Conference PaperPDF Available

Coding standards as anchors for the CVPR CLIC video track

June 2021

June 2021

Conference: CVPR Challenge on Learned Image Compression, Video Challenge paper

Authors:

Orange Innovation

In 2021, a new track has been initiated in the Challenge for Learned Image Compression : the video track. This category proposes to explore technologies for the compression of short video clips at 1 Mbit/s. This paper proposes to generate coded videos using the latest standardized video coders, especially Versatile Video Coding (VVC). The objective is not only to measure the progress made by learning techniques compared to the state of the art video coders, but also to quantify their progress from years to years. With this in mind, this paper documents how to generate the video sequences fulfilling the requirements of this challenge, in a reproducible way, targeting the maximum performance for VVC.

VTM Software coding configuration.

…

Figures - uploaded by Pierrick Philippe

Content may be subject to copyright.

Content uploaded by Pierrick Philippe

Content may be subject to copyright.

Coding standards as anchors for the CVPR CLIC video track

Th´

eo Ladune*and Pierrick Philippe∗

Orange

pierrick.philippe@orange.com

Abstract

In 2021, a new track has been initiated in the Challenge

for Learned Image Compression : the video track. This cat-

egory proposes to explore technologies for the compression

of short video clips at 1 Mbit/s. This paper proposes to

generate coded videos using the latest standardized video

coders, especially Versatile Video Coding (VVC). The ob-

jective is not only to measure the progress made by learning

techniques compared to the state of the art video coders, but

also to quantify their progress from years to years. With this

in mind, this paper documents how to generate the video

sequences fulﬁlling the requirements of this challenge, in a

reproducible way, targeting the maximum performance for

VVC.

1. Introduction

From the 1990s standardization bodies, ISO and ITU-T,

have deﬁned several video coding standards [3]. Advanced

Video Coding (AVC) was ﬁnalized in 2003 followed by

HEVC (High Efﬁcientcy Video Coding) in 2013 and ﬁnally

VVC (Versatile Video Coding) was released in 2020.

From a generation to an other it is targeted, among addi-

tional functionalities, to reduce the bit-rate by a factor of

two for an equivalent subjective quality. HEVC has ef-

fectively proven to halve the bit rate compared to AVC.

VVC also demonstrates 50% bit-rate savings compared to

HEVC [7].

From a generation to the next, ITU/MPEG standards

have consistency shown that they represent the state of the

art in terms of image quality. VVC, its latest technology is

therefore considered as the ﬂagship of the standardized so-

lutions. In the context of the Challenge on Learned Image

Compression (CLIC) it is therefore important to establish

the level of performance of this last iteration of video cod-

ing standards.

However, it is important to notice that video coding stan-

dards specify only the format of the coded data, i.e. the

*The two authors have equal contribution.

bitstreams and the decoder. While two decoder implemen-

tations have to reproduce the same video sequence, the en-

coder itself is not constrained as it can accomodate different

trade-offs, e.g. in terms of complexity versus quality.

In the context of this challenge it is therefore important

to establish appropriate encoder conﬁgurations to maximize

the performance of video standards while fulﬁlling the chal-

lenge requirements. This is what is proposed in this paper.

In a ﬁrst section, a brief overview of VVC is performed

with a focus on the latest evolutions and tools appropriate

for the challenge. Then, after an analysis of the challenge

requirements, a general approach is proposed to obtain suit-

able encoder parameterizations. A last section presents the

obtained coding results and compares those with the perfor-

mance of HEVC with two encoder implementations.

2. Brief overview of VVC

This section gives some elements of VVC. The reader

should refer on [3] to have an overview of VVC and its de-

velopment phase.

As AVC and HEVC, VVC has a block-based hybrid cod-

ing architecture. This architecture combines Inter and In-

tra block predictions. Intra blocks are predicted from the

current image, while Inter blocks are predicted from other

images. In order to avoid any coding drift during these pre-

diction processes, it is important that the encoder relies on

predictions identical to those performed at the decoder side.

For inter coding, the coding of images in a sequence and

their presentation after decoding does not necessarily follow

the same order : it is advised to perform hierarchical GOP

(Group of Pictures) structuring, in which a distant frame is

ﬁrst encoded and intermediate frames are interpolated.

After the realization of the prediction, a residue is com-

puted, as the difference between the original image and

its prediction. This residual signal is transformed to re-

duce statistical dependencies and subsequently quantized

at a selectable accuracy (using a quantization parameter

called QP) then the quantized values are binarized and

conveyed using Context Adaptive Binary Arithmetic Cod-

ing (CABAC). The block reconstruction is performed after

arithmetic decoding, inverse transformation and addition of

the spatial domain residual block with the predicted block.

The type of prediction is determined based on a recur-

sive sub-division of blocks (into Coding Units, CUs) from

an initial maximal size, 128x128 pixels for VVC, down to

4x4 blocks. For each block, the encoder selects the most

appropriate prediction scheme (intra/inter) and correspond-

ing parameters, then the coded residual is determined. This

process acts in a competitive fashion : rate-distorsion opti-

mization is carried out to select the best block subdivision

and coding parameters.

Based on the HEVC standard, VVC extends consider-

ably the amount of coding tools and provides additional

ﬂexibility. For example, the Coding Units (CUs) can be sub-

divisionned using quad-tree, binary tree and also ternary

trees. The intra prediction angular modes are extended from

33 in HEVC to 93 in VVC. Also the inter prediction can

beneﬁt from reﬁnements using an optical ﬂow, geometric

partitioning etc.

For the transform stage, instead of using one type of

transform as does AVC with the Discrete Cosine Transform

(Type II), VVC uses Multiple Transform Set (MTS) to pro-

vide additional transform kinds : the DCT Type VIII and

the Type VII Discrete Sine transform (DST). The transform

sizes range from 4 to 64 to handle the different degrees of

spatial stationarity.

In the context of this Challenge on Learned Image Com-

pression, it is also worth noticing that machine learning ap-

proaches have been extensively used during the VVC devel-

opment phase. Particularly, VVC uses two tools inherited

from learning-based approaches :

• For intra prediction Matrix-based Intra Prediction

(MIP) [8] a set of prediction matrices has been derived

using a neural network approach. This tool was pro-

gressively simpliﬁed into a linear alternative and sub-

sequently quantized [9] to allow deterministic and re-

liable implementations also on ﬁxed-point devices ;

• Prediction residuals often exhibit directional patterns

for which DCT/DST-based separable transforms are

not adapted. Therefore, non-separable transforms

called Low Frequency Non Separable Transforms

(LFNST) [7] have been designed. LFNST provide a

set of transforms adapted to each intra prediction di-

rection [2].

2.1. Tools for Screen Content Coding

To handle graphics coding, traditional hybrid coding

with transforms is not advisable : the residual signal ex-

hibits sharp edges not suited for DCT/DST based trans-

forms. Indeed, transforms are avoided through the usage

of the Transform Skip alternative where the residual is di-

rectly coded in the spatial domain. A sample-wise integer

differential PCM can be instantiated to remove vertical or

horizontal redundancies through the BDPCM tool [1].

Intra Block Copy is also beneﬁcial for these contents as

it copies and pastes previously coded areas. This can be

viewed as a basic motion compensation prediction, with in-

teger pixel accuracy, conducted within the current picture.

This set of tools is denoted as SCC tools (Screen Content

Coding) in the ITU/MPEG terminology.

3. Adaptation of the VVC coding conﬁguration

to the challenge requirements

The video compression track asks to compress short

video clips of 60 YUV frames having a vertical resolution

of 720 lines for the luma component. The vertical widths

in the complete video set ranges from 948 pixels to 1440.

During the validation phase, a subset of 100 sequences is

considered, they include resolutions from 959 pixels wide

up to 1440.

The target bit-rate is approximately 1 Mbit/s for the

whole set. The decoder size is accounted in the submis-

sion to avoid data overﬁtting in the training process. Conse-

quently, participants to the challenge have to minimize both

the dataset size and the model size through a weighted sum :

Tsize =Submission Size =Data Size

0.019 +Decoder Size

The limit for the Submission Size is set to 1,309,062,500

bytes. Given this overall limit, the objective of the challenge

is to maximize the MS-SSIM.

Therefore the challenge objective can be turned into a

classic rate distorsion optimization problem. This is com-

monly solved using a Lagrangian optimization method in

which the distorsion and bit rate are combined into a single

metric :

J(λ) = MS-SSIM +λ·Tsize (1)

As the MS-SSIM and the submission size are additive,

the optimization is solved, for a given λvalue, by ﬁnding

the optimal rate distorsion point, individually for each se-

quence. The size constraint, λ, is to be selected to match

the submission size. To optimize the Rate Distorsion cost,

the encoder shall maximize a perceptual quality inline with

the MS-SSIM metric. The set of optional tools need to be

carefully considered to provide optimal quality since the

validation sequences contain computer generated content.

Speciﬁc tools are needed : Intra Block Copy and BDPCM

especially are selected for those sequences.

To optimize the quality, the coding structure can be re-

laxed to avoid unnecessary constraints. For example, only a

single Intra frame is needed in this context since no frequent

random access point is needed. Also, the coding structure

Option Description

--InputFile Selects the input ﬁle

--BitstreamFile Indicates the bistream ﬁle

--SourceWidth Selects the video width

--SourceHeight Selects the video height

-c encoder randomaccess vtm.cfg selects the basic coding conﬁguration, with GOP32

-c classSCC.cfg Selects the SCC tools when appropriate

--IntraPeriod=-1 Intra Period: A single Intra frame is selected

--QP qp Speciﬁes the base value of the quantization parameter

--SliceChromaQPOffsetPeriodicity=1 periodicity for inter slices that use the slice-level chroma QP offsets

--PerceptQPA=1 Applies erceptually optimized QP adaptation

Table 1. VTM Software coding conﬁguration.

Standard Encoder Data Size Decoder Size MS-SSIM Relative model size

VVC VTM 24830109 701528 0.98777 99.88%

HEVC HM 24789559 355643 0.98450 99.69%

HEVC x265 24864775 355643 0.97968 100.00%

Table 2. Comparative performance of VVC relative to HEVC

is made ﬂexible. First, GOP (Group Of Pictures) size is set

to 32, which is the maximum power of 2 within 60 frames,

then, for highly moving sequences, a shorter GOP size (16)

is considered to handle rapid movements.

To summarize, the desired VVC encoder conﬁguration

should include :

• Perceptual quality optimization, targetting MS-SSIM

maximization if possible ;

• One single Intra frame insertion at the beginning of the

sequence ;

• Adapted GOP size, maximized for stationnary se-

quences and shorter for rapidly evolving sequences ;

• Usage of SCC especially when the sequence contains

graphics.

The VVC standard includes a reference encoder [6] that

contains selectable options that can accommodate most of

these desired features : the Intra frame insertion mechanism

can be selected and the GOP structure adapted to the chal-

lenge objectives. Also SCC tools can be activated. Addi-

tionally, a perceptual optimization strategy [4] can be se-

lected instead of the more frequently used PSNR approach.

As such, these options turn into the VTM command

line in Table 2. SCC tools can be switched-off for cam-

era generated sequences. For the GOP16 structure, the

one conﬁguration provided in the VTM conﬁguration (en-

coder randomaccess vtm gop16.cfg ) is invoked.

For VVC, the rate distorsion point is selected using the

Quantization Parameter (QP) in the command line. A large

QP indicates a larger quantization step leading to a smaller

bit rate. In contrast, smaller QPs increase the quality. When

the encoder is driven by a QP parameter, the encoding qual-

ity is mostly constant as the coding noise level is directly

related to the quantization step.

Each ﬁle in the validation set is encoded with a set of

QPs : in practice, in this paper, the QP range is ﬁxed to

24 to 42 to address a sufﬁcient bit rate range. For the

SCC sequences, the SCC conﬁguration is selected. GOP16

and GOP32 are used, the Lagrangian optimization automat-

ically selects the best conﬁguration, sequence per sequence.

To handle misaligned YUV ﬁles, for which the luma (Y)

component has not twice the number of pixels of the chroma

channels (U,V), an additional row of pixels was added dur-

ing the process of conversion from PNG to YUV ﬁle, prior

to the encoding process. In practice this happens in the val-

idation phase only for the sequence ”Lecture 1080P4991”.

After decoding, that additional row of pixels was cropped

in the inverse conversion.

4. Coding Results

The rate distorsion optimization process selects the best

coding conﬁguration and the appropriate QP for each se-

quence.

Figure 1shows the frequency of coding conﬁgurations

selected during the coding and rate distorsion optimization

process. Most of the sequences use the basic conﬁguration

with a the maximum GOP size ; GOP16 and SCC conﬁgu-

rations are less used as expected.

Figure 2illustrates the repartition of the QP values. The

average QP value is close to 31 and conﬁrms that the se-

lected QP range is sufﬁcient.

Overall, the proposed VTM anchor performance is re-

ported in table 2.

BASIC SCC GOP16

Configuration Share (%)

Figure 1. Optimal coding conﬁguration selected in the RD process

22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

QP Sha re (%)

Quanti zation Parame ter (QP)

Figure 2. Selected QP values

4.1. comparison with HEVC and x265

Using the same methodology, the performance of the

HEVC standard is also indicated using two encoder imple-

mentations :

• The HEVC reference software [5] version HM 16.22 is

used, the main10 proﬁle was selected, as such the SCC

tools are not made available in this conﬁguration

• x265 software [10] version 3.5 is also included in the

benchmark. The main10 proﬁle is selected, and the

tuning is performed using the SSIM option.

For those two encoders, the RDO process described for

VVC is used, based on the sequences encoded according to

the QP parameter. The command line for x265 was:

x265 --tune ssim --input inputFile

--input-res WxH --profile main10 -crf QP

The performance of the VTM relative to those HEVC

implementations is provided Figure 3. The competition be-

tween VTM with SCC tools and GOP16, denoted as VTM

all characteristic, provides additional compression perfor-

mance relative to the VTM with a ﬁxed conﬁguration. The

Rate Distorsion optimization provides considerable perfor-

mance improvement for VVC compared to a conﬁguration

where all the sequences use the same quantization param-

eter. Roughly 40% bit rate saving is noticed from HM to

50 60 70 80 90 100 110 120 130 140 150

-10 x log10(1-MS-SSIM)

Submission size relati ve to chall enge limit (% )

VTM (al l)

VTM

VTM no RDO

x265

Figure 3. Performance of VTM with competition

VVC and from x265 in its default conﬁguration relative to

HM.

The performance for the HM16.22 and x265 conﬁgura-

tions, at the challenge target is provided in Table 2.

5. Conclusion

This paper reports the generation of anchors for the

CLIC21 video track. Modern standardized codec anchors

are provided, including the latest ITU/MPEG standard

VVC. A rate distorsion process is described to provide a set

of encoded sequences for which the toolset and the quality

is adjusted to match the challenge requirements.

This paper attempts to make this anchor generation as

reproducible as possible. The video bitstreams are available

upon request by contacting the ﬁrst author.

The proposed coding conﬁgurations and optimal point

determination is still subject to further improvement : in

this paper only the reference VVC implementation (VTM)

has been used, with a limited action on its parameterization.

A more elaborated encoder, or a more ﬁnely tuned parame-

terization would likely give a better level of performance.

References

[1] M. Abdoli, F. Henry, P. Brault, F. Dufaux, P. Duhamel, and

P. Philippe. Intra block-dpcm with layer separation of screen

content in vvc. In 2019 IEEE International Conference on

Image Processing (ICIP), pages 3162–3166, 2019. 4322

[2] A. Arrufat, P. Philippe, and O. D´

eforges. Mode-dependent

transform competition for hevc. In 2015 IEEE International

Conference on Image Processing (ICIP), pages 1598–1602,

2015. 4322

[3] B. Bross, J. Chen, J. R. Ohm, G. J. Sullivan, and Y. K. Wang.

Developments in international video coding standardization

after avc, with an overview of versatile video coding (vvc).

Proceedings of the IEEE, pages 1–31, 2021. 4321

[4] C. R. Helmrich, S. Bosse, M. Siekmann, H. Schwarz,

D. Marpe, and T. Wiegand. Perceptually optimized bit-

allocation and associated distortion measure for block-based

image or video coding. In 2019 Data Compression Confer-

ence (DCC), pages 172–181, 2019. 4323

[5] JVET. https://vcgit.hhi.fraunhofer.de/jvet/hm, 2021. 4324

[6] JVET. https://vcgit.hhi.fraunhofer.de/jvet/vvcsoftware vtm,

2021. 4323

[7] M. Koo, M. Salehifar, J. Lim, and S. Kim. Low frequency

non-separable transform (lfnst). In 2019 Picture Coding

Symposium (PCS), pages 1–5, 2019. 4321,4322

[8] M. Sch¨

afer, B. Stallenberger, J. Pfaff, P. Helle, H. Schwarz,

D. Marpe, and T. Wiegand. A data-trained, afﬁne-linear

intra-picture prediction in the frequency domain. In 2019

Picture Coding Symposium (PCS), pages 1–5, 2019. 4322

[9] M. Sch¨

afer, B. Stallenberger, J. Pfaff, P. Helle, H. Schwarz,

D. Marpe, and T. Wiegand. Efﬁcient ﬁxed-point implemen-

tation of matrix-based intra prediction. In 2020 IEEE In-

ternational Conference on Image Processing (ICIP), pages

3364–3368, 2020. 4322

[10] VideoLAN. https://www.videolan.org/developers/x265.html,

2021. 4324

A VVC anchor for the CVPR 2022 CLIC video track

Conference Paper

Full-text available

Jun 2022

In 2022 the CVPR Challenge for Learned Image Compression includes a video track which targets to explore technologies for the compression of HD video sequences. The proposed technologies are evaluated through a subjective test at two operating points: 100 kb/s and 1 Mb/s. This contribution proposes to generate coded videos compliant with the latest standardized video coder, Versatile Video Coding (VVC). The primary objective of this candidate is to assess the recent developments in video coding with respect to this standard to measure the progress made by learning based techniques. To this end, this paper explains how to generate video sequences fulfilling the requirements of this challenge, in a reproducible way, targeting the maximum performance for VVC.

Mode-dependent transform competition for HEVC

Conference Paper

Full-text available

Sep 2015

Transform coding plays a key role in state-of-the-art video coders, such as HEVC. However, transforms used in current solutions do not cover the varieties of video coding signals. This work presents an adaptive transform design method that enables the use of multiple transforms in HEVC. A different transform set is learnt for each intra prediction mode, allowing the video encoder to perform better decisions regarding block sizes, prediction modes and transforms. Different systems are proposed to accommodate trade-offs between complexity and performance. Bit rate reductions in the range of 2% to 7% are reported, depending on complexity.

Efficient Fixed-Point Implementation Of Matrix-Based Intra Prediction

Conference Paper

Oct 2020

A Data-Trained, Affine-Linear Intra-Picture Prediction in the Frequency Domain

Conference Paper

Nov 2019

Low Frequency Non-Separable Transform (LFNST)

Conference Paper

Nov 2019

Perceptually Optimized Bit-Allocation and Associated Distortion Measure for Block-Based Image or Video Coding

Conference Paper

Mar 2019

Intra Block-DPCM with Layer Separation of Screen Content in VVC

Conference Paper

May 2019

An intra coding algorithm with layer separation is proposed. This algorithm is designed on top of an adopted tool in VVC, called Block DPCM (BDPCM), and benefits from texture information in a neighborhood to derive intensity levels of background and foreground layers. This information is used to reduce large rate of residual in case of incorrect layer prediction by BDPCM. For this purpose, three inter-layer transition states are defined that are either implicitly or explicitly conveyed to the decoder. Once a transition is signaled, the decoder corrects the prediction value using the derived layer information. Experiments on screen contents show a BD-rate gain of about 10% percent over VVC Test Model (VTM) and 1% over the regular BDPCM, with the cost of computational complexity.

Developments in international video coding standardization after avc, with an overview of versatile video coding (vvc)

Jan 2021
4321

B Bross
J Chen
J R Ohm
G J Sullivan
Y K Wang

B. Bross, J. Chen, J. R. Ohm, G. J. Sullivan, and Y. K. Wang. Developments in international video coding standardization after avc, with an overview of versatile video coding (vvc). Proceedings of the IEEE, pages 1-31, 2021. 4321

Coding standards as anchors for the CVPR CLIC video track

Abstract and Figures

Recommended publications

A VVC anchor for the CVPR 2022 CLIC video track

Coding standards as anchors for the CVPR CLIC video track

ModeNet: Mode Selection Network For Learned Video Coding

Conditional Coding for Flexible Learned Video Compression