Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from P-T

Scalable Video Coding - Requirements, Spatial transforms for SVC, a. Wavelet in loop, b. In-band prediction

scalability motion temporal frames

Marta Mrak and Ebroul Izquierdo
Department of Electronic Engineering, Queen Mary, University of London, UK

Definition: Scalable video coding targets seamless delivery of and access to digital content, enabling optimal, user-centered multi-channel and cross-platform media services, and providing a straightforward solution for universal video delivery to a broad range of applications .

The recent convergence trend of multimedia technology and telecommunications along with the materialization of the Web as strong competitor of conventional distribution networks have generated an acute need for enrichment in modalities and capabilities of the delivery of digital media. Within this new trend a main challenge relates to the production of easy adaptable content capable of optimally fitting into evolving and heterogeneous networks as well as iterative delivery platforms with specific content requirements. Network supported multimedia applications involve many different transmission capabilities including Web based applications, narrowcasting, conventional terrestrial for interactive broadcasting, wireless channels, high definition television for sensitive remote applications, e.g., remote medical diagnosis, etc. These applications are used to deliver content to a wide range of terminals and users surrounded by different environments and acting under totally different circumstances. Conventional video coding systems encode video content using a fix bit-rate tailored to a specific application. As a consequence conventional video coding does not fulfill the basic requirements of new flexible digital media applications. Contrasting this, scalable video coding (SVC) emerges as a new technology able to satisfy the underlying requirements. SVC targets seamless delivery of and access to digital content, enabling optimal, user centered multi-channel and cross-platform media services, and providing a straightforward solution for universal video delivery to a broad range of applications.

There are several major challenges in the engineering aspects of scalable video coding. The most important relates to the capability of automatically render broadcasting content onto a Web browser or other low-bit rate end-devices such as PDAs, 3G phones and vice-versa. That is, the capability of dynamically perform resolution changes in order to deliver the best quality under a given bandwidth budget. Since this functionality should be provided “on the fly” without using expensive and time consuming transcoders, it requires the production of content in a specific “hierarchically embedded” form. It also enables applications that are both network and terminal aware, i.e., applications that can tailor content quality and size to optimally fit bandwidth and terminal requirements automatically and without expensive or time consuming additional processing.

Scalable image coding is useful for applications where efficient adaptation of compressed images is required. Such adaptations are usually context driven and involve adaptation of the quality, spatial resolution, color components (grey scale or full color) and more importantly, regions of interest (ROI). In this article we discuss the scalability features required in image adaptations and evaluate the current state-of-the-art scalable and non-scalable image coding algorithms.


There are several types of scalability which should be supported by an advanced video coding system. The basic scalability types are:

  • spatial (resolution) scalability,
  • SNR (quality) scalability,
  • temporal (frame rate) scalability.

Scalable coding should also support combinations of these three basic scalability functionalities. Moreover, advanced scalable video coders may support

  • complexity scalability,
  • region of interest scalability,
  • object based scalability,

and other features such as:

  • support for both progressive and interlaced material,
  • support for various color depths (including component scalability),
  • robustness to different types of transmission errors.

Scalable video coder provides embedded bit stream, which we refer to as precoded stream. To address specific application or network conditions, a rate allocation mechanism extracts necessary data from the precoded bit-stream, as a part of scalable video transmitter (see Figure 1).

As the bit allocation tables are provided or appropriate flags are inserted in the bit-stream, the extractor is of a very low-complexity. State-of-the-art scalable video coders feature at least two of the scalability types described before. Figure 2 depicts the most important scalability features in advanced video coders.

Spatial transforms for SVC

There are two categories of transforms commonly used in video and still image coding: discrete cosine transform (DCT) or DCT-like transform and discrete wavelet transform (DWT). DCT is used in the popular JPEG still image coding standard and in a wide variety of video coding standards including MPEG and H264. It is performed on the blocks of original frame or motion-compensated error frames and it is not resolution scalable. However, scalability may be imposed on DCT-based video coding schemes which results in systems with low complexity, but with limited scalable features and significant decrease in the compression efficiency.

The application of the wavelet transform in video coding has been exhaustively examined but is still not regarded as the main decorrelation transform in video coding. However, it is the base for the more advanced JPEG2000 still image coding standard that, beside excellent compression performance, offers broad range of other functionalities including full scalability.

The wavelet transform can be efficiently used in video compression, enabling simple selection of the desired bit allocation, rate optimization and progressive transmission functionalities. The wavelet transform can be applied on single frames, as a 2D transform on image pixels or on motion compensated frames. Alternatively, it can be also used as 3D transform on groups of frames in a video sequence. Accordingly to, wavelet-based video coding schemes can be grouped as follows:

a. Wavelet in loop

Both intra coded frames and motion-compensated residual frames are transformed using 2D wavelet transform. Basically, DCT is replaced by DWT. Since the decoder would not perform the same prediction as the coder, the so-called drift problem is originated. The blocky edges caused by motion compensation (MC) cannot be efficiently coded using wavelet transform and may introduce unpleasant ringing artifacts. To reduce the blockiness in the prediction error images, overlapped block motion compensation (OBMC) can be used. Moreover, the drift problem can be avoided using motion estimation on the lowest quality and resolution layer. However, if a coder has to meet a wide range of bit-rates, the efficiency becomes very poor.

b. In-band prediction

This approach performs a 2D wavelet transform to remove inter-frame redundancy. Therefore, wavelet coefficients can be coded using temporal prediction and temporal context modeling. Motion compensation is also used but while the drift is avoided for resolution scalability it remains for SNR scalability. MC can be performed for each resolution level individually, thus enabling efficient resolution scalability. The drawback is that in the wavelet domain, spatial shifting results in phase shifting. Thus, motion compensation does not perform well and may cause motion tracking errors in high-frequency bands [ 2 ]. To overcome this problem, overcomplete wavelet representations are recommended for MC.

c. Interframe Wavelet

Wavelet filtering is used both in spatial and temporal domain. High degrees of scalability can be achieved without drift problems. Two classes can be distinguished: temporal filtering preceding spatial filtering, i.e., t+2D and vice versa, i.e. 2D+t. To achieve higher degree of temporal decorrelation, motion compensated interframe wavelet techniques are used. This process is called motion-compensated temporal filtering (MCTF).

Motion compensated temporal filtering (MCTF)

MCTF was firstly introduced in 3-D transform coding in and has been used in several scalable coding proposals. MCTF performs the wavelet transform after MC. It enables improved rate-distortion performance. Wavelet filtering can be implemented in a lifting structure where prediction and update steps include MC. A lifting scheme also enables sub-pixel realization and perfect reconstruction.

Subsequent video frames in MCTF based schemes are transformed into temporal low-pass frames L and high-pass frames H. Qualitatively, in the simplest case, H is a motion-compensated difference between two subsequent frames, and L is their mean.

Higher depth of temporal decomposition can be achieved by repeated filtering of L frames. This process is depicted in Figure 3. Usually, temporal decomposition can be non-dyadic, which enables wider implementation possibilities.

The simplest implementation of MCTF in a lifting filter structure uses Haar filters, but more complex wavelet filter banks can achieve a higher-quality interpolation. Longer filters take better use of the temporal redundancies, but a doubling of the number of motion vectors to be transmitted negatively influences the quality on the lower bit-rates. To improve the prediction performance in MCTF schemes, prediction tools of hybrid video codecs can be used.

A wide variety of spatial-domain MCTF (SD MCTF), also known as t+2D, codecs have been proposed, In these approaches, motion estimation and temporal filtering are the first steps. Since motion estimation is performed in spatial domain on full resolution frames, lower quality, resolution and temporal scale streams, contain complete motion information which do not necessary represent the optimal block-matching decision. This motion information will also be highly redundant for lower resolution applications.

In 3D MC transformations consisting of in-band MCTF (IB MCTF, 2D+t), input frames are firstly spatially transformed. MCTF is performed on these transformed frames. Since a complete wavelet transform is shift-variant, a complete-to-over complete wavelet transform (CODWT) may be used.

IB MCTF introduces higher degree of scalability in video coding since different resolutions are separately motion compensated. This allows tuning of the motion information budget depending on desired resolution. The issue of complexity scalability can be easier achieved because MCTF may be different on each resolution level. Moreover, a variable number of temporal levels for each resolution can be included. IB MCTF and SD MCTF coding schemes may be implemented without the update step in MCTF, while the temporal filtering still remains invertible.

Overview of Scalability Features in Current Standards

Currently, no video coding standards provides full scalability. However, available standards feature basic scalability functionality. The following is an overview of scalability features of the most popular video coding standards.

MPEG-2 video coding standard: The MPEG-2 video coding standard incorporates modes to support layered scalability. The signal may be encoded into a base layer and a few enhancement layers. The enhancement layers add spatial, temporal, and/or SNR quality to the reconstructed base layer. Specifically, the enhancement layer in SNR scalability adds refinement data for the DCT coefficients of the base layer. For spatial scalability, the first enhancement layer uses predictions from the base layer without the use of motion vectors. For temporal scalability the enhancement layer uses motion compensated predictions from the base layer.

MPEG-4 Fine-Granular Scalability and hybrid temporal-SNR scalability: Fine-Granular Scalability (FSN) enables progressive SNR coding. FSN is adopted in MPEG-4 in the way that image quality of the base-layer (non-scalable) frames can be enhanced using single enhancement layer. A progression is achieved by embedded compression. MPEG-4 uses embedded DCT which is based on bit-plane by bit-plane coding. Moreover, it uses VLC tables for entropy coding.

SNR FGS in MPEG-4 features temporal scalability. The temporal SNR FGS scalability is achieved by including only B-frames in the enhancement layer. B-frames are called FGST frames. Before texture coding of FGST frames, a complete motion vector set for that frame has to be transmitted. For base-layer frames motion vectors are transmitted before texture information for each macroblock. The granularity of temporal scalability depends on the GOP structure determined by the coder.

MPEG and TVT scalable video coding: The MPEG standardization body has identified applications that require scalable and reliable video coding technologies. It has defined the requirements for scalable video coding. The first Call for Proposals on Scalable Video Coding Technology was issued in October 2003. It proposes the evaluation methodology and respective experimental conditions. The most promising proposals have been considered as the starting point for the development of a scalable video coding standard.

H.264/AVC was the first coding standard achieving high efficiency in the terms of rate-distortion performance. It is a block-based (4×4 DCT-like transform) system using highly efficient motion compensation on variable block sizes and highly efficient context based entropy coding (CABAC). However, the first version of the H.264/AVC does not support scalable coding. However, it is regarded as the base for several scalable video coding approaches.

Scalar Edge Detectors [next] [back] Say Goodbye, Maggie Cole

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or