Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from A-E

Coding of Stereoscopic and 3D Images and Video - Introduction, Disparity estimation, Stereo video coding, Standardization Efforts, Conclusions

based object mpeg motion

G. Triantafyllidis, N. Grammalidis, and M.G. Strintzis
Informatics and Telematics Institute, Thessaloniki, Greece

Definition: Since the bandwidth required to transmit stereoscopic and 3D image streams is large, efficient coding techniques should be employed to reduce the data rate.


Stereo vision provides a direct way of inferring the depth information by using two images (stereo pair) destined for the left and right eye respectively. The stereo image pair consists by two frames labeled as left frame and right frame.

A stereoscopic pair of image sequences, recorded with a difference in the view angle, allows the three-dimensional (3D) perception of the scene by the human observer, by exposing to each eye the respective image sequence. This creates an enhanced 3D feeling and increased “tele-presence” in teleconferencing and several other applications (medical, entertainment, etc.) .

Since the bandwidth required to transmit both stereoscopic image streams is large, efficient coding techniques should be employed to reduce the data rate. Similar to other coding scenarios, compression for stereo images can be achieved by taking advantage of redundancies in the source data, e.g. spatial and temporal redundancies for monocular images and video. The simplest solution for compressing the two channels is by using independent coding for each image/video with existing compression standards such as JPEG or MPEG. However, in the case of stereo images/video, an additional source of redundancy stems from the similarity, i.e. the strong “binocular redundancy” between two images in a stereo pair, due to stereo camera geometry. Exploiting this binocular dependency allows achieving higher compression ratios.

The subjective quality of the reconstructed image sequence produced by such algorithms should be sufficiently high. Subjective tests have shown that image artifacts are more visible and annoying in a stereoscopic display than in standard (monoscopic) video displays.

One of the most important parameters in the study of stereo coding is disparity . Because of the different perspective, the same point in the object will be mapped to different coordinates in the left and right images. The disparity is the difference between these coordinates.

Compression of the right channel can be done by either being compatible with the monoscopic coding techniques such as assuming independent channels and using standard video coding techniques or by being incompatible, such as joint coding of both channels. Figure 1(a) shows a coding technique where each view is coded independently.

A simple modification to exploit the correlation between the two views would be to encode one (the left) view and the difference between the two views. However, this method is not so efficient because each object in the scene has different disparity. Therefore, further improvement can be achieved by adopting predictive coding, where a disparity vector field and the disparity compensated difference frame are encoded. Figure l(b) illustrates this approach that attempts to exploit crossview redundancy, by coding the right image using the estimated disparity map to predict it from the left image. The corresponding residual image is also encoded to improve the reconstruction of the right image.

Three major approaches are used for coding of stereo sequences: block-based, object-based and hybrid. The block-based approach has the advantage of simplicity and robustness allowing more straight-forward hardware implementations, but the subjective quality of reconstructed images may be unacceptable at low bit-rates. Object-based schemes alleviate the problem of annoying coding errors, providing a more natural representation of the scene, but require a complex analysis phase to segment the scene into objects and estimate their motion and structure. Finally, hybrid methods, combining block-based and object-based methods, are usually preferred since they can combine the merits of both approaches.

Due to the similarity between stereo images and video, many of the concepts and techniques used for video coding are applicable to stereo image coding. Predictive coding with motion estimation increases the coding gain by exploiting the temporal dependency. This is possible because consecutive images in a video sequence tend to be similar. In general, disparity estimation is similar to motion estimation in the sense that they both are used to exploit the similarity between two (or more) images in order to reduce the bit rate. However, the motion estimation schemes developed in video coding may not be efficient unless geometrical constraints for stereo imaging are taken into account.

A predictive coding system of stereoscopic images includes displacement (disparity) estimation / compensation, transform / quantization and entropy coding. Therefore, the overall encoding performance can be controlled by various factors. Especially, for the stereo image coding case, an efficient prediction reduces the “binocular redundancy” between two images in a stereo pair. In addition, optimal quantization that takes into account the “binocular dependency” can further improve the overall encoding performance.

Occlusion regions mark disparity discontinuity jumps which can be used to improve stereo image encoding and transmission, segmentation, motion analysis and object identification processes which must preserve object boundaries.

Disparity estimation

To better understand the meaning of disparity, let us imagine overlaying the left and right pictures of a stereo pair by placing the left picture on the top of the right. Given a specific point A in the left picture, its matching point B in the right picture does not in general lie directly underneath point A. Therefore, we define the stereo (or binocular) disparity of the point pair (A,B) as the vector connecting B and A.

In the case of parallel axes camera configuration, the disparity reduces to the signed magnitude of the horizontal component of the displacement vector since the vertical component is always zero. Furthermore, the use of the parallel axes camera geometry leads to a simple mathematical relationship between the disparity of a point pair and the distance (depth) to the object it represents; specifically the disparity is inversely proportional to the depth.

Disparity estimation is one of the key steps in the stereo coding, because it helps to exploit the similarity along the disparity in the process of disparity estimation/compensation. In the predictive coding framework, the redundancy is reduced by compensating the target (right) image from the reference (left) image using the disparity vectors

Various techniques have been proposed to determine stereo disparities . All of these methods attempt to match pixels in one image with their corresponding pixels in the other image. Some basic methods are summarized in Table 1

Stereo video coding

Stereo video coding algorithms can be classified in: Extensions of single view video coding, rate distortion optimized coding, coding based on semantic relevance, object based coding and other techniques.

In traditional MPEG-2 compression, the video includes I-frames which are completely transmitted, P-frames which are predicted from previous I-frames and B-frames which are predicted from both P and I-frames. A method to extend this coding technique to stereo has been proposed. In this method the left channel is coded independently, exactly as in the single-view coding case (MPEG-2). Each frame of the other channel is then predicted from the corresponding frame of the left view, using disparity compensation, or the previously coded frame of the right view, using motion compensation.

A predictive video encoding system contains several stages such as motion estimation and compensation, disparity estimation and compensation in the case of stereo coding, transform, quantization and, finally, entropy coding. Therefore, the overall performance of the system can be optimized by adjusting a limited number of variables used in the above mentioned stages. It is observed that, especially in stereo video coding, the better the disparity prediction is, the more efficient the overall compression rate becomes. Rate distortion optimized coding schemes estimate the block-based motion/disparity field under the constraint of a target bitrate for the coding of the vector information. The entropy of the displacement vectors is employed as a measure of the bit rate needed for their lossless transmission.

If the relevancy of the content is considered while encoding, the definition of a semantic relevance measure for video content is crucial. These semantic relevance measures can be performed by the user or the content author can assign common relevancy measures for all users. It is also possible to divide the input video into segments by considering various statistics along temporal segments (coding difficulty hints, motion hints etc.) that affect the ease of coding without taking into account any relevance issues.

Object based coding has long attracted considerable attention as a promising alternative to block-based encoding, achieving excellent performance, and producing fewer annoying effects such as blocking artifacts and mosquito effects than those commonly occurring in block-based hybrid DCT coders at moderate and low bit rates. Furthermore, important image areas such as facial details in face-to-face communications can be reconstructed with a higher image quality than with block-oriented hybrid coding. In addition, the ability of object-based coding techniques to describe a scene in a structural way, in contrast to traditional waveform-based coding techniques, opens new areas of applications. The encoder consists of an analysis part and a synthesis part. The analysis part aims to subdivide the scene into a set of objects representing each one by a set of parameters: shape or boundary, motion, structure or depth and texture or colour. These parameters are encoded and transmitted to the decoder where the decoded parameters are then used to synthesize an approximation of the original images. The analysis phase is the most sophisticated one, consisting of image segmentation and motion/structure estimation.

Standardization Efforts

Stereo video coding is already supported by the MPEG-2 technology, where a corresponding multi-view profile, defined in 1996, is available to transmit two video signals. The main application area of MPEG-2 multiview profile (MVP) is stereoscopic TV. The MVP extends the well-known hybrid coding towards exploitation of inter-view channel redundancies by implicitly defining disparity-compensated prediction. However, there are important disadvantages: disparity vectors fields are sparse and thus the disparity-compensation is not efficient so motion-compensation is usually preferred. Furthermore, the technology is outdated and interactive applications that involve view interpolation cannot be supported. To provide support for interactive applications, enhanced depth and/or disparity information about the scene has to be included in the bitstream, which can also be used for synthesizing virtual views from intermediate viewpoints. A new MPEG activity, namely of 3DAV (for 3D Audio-Visual) explores the need for standardization in this area to support these new applications . Experiments for encoding depth data using different video codecs by putting the depth data into the luminance channel, and simply changing the semantics of its description have been conducted by MPEG and the ATTEST 1ST project. Results show that this approach makes it possible to achieve extreme compression of depth data while still maintaining a good quality level for both decoded depth and any generated novel views. Furthermore, additional disparity and occlusion information can be included in the form of additional MPEG-4 Video Object Planes (VOPs), Multiple Auxiliary Components (MAC) (defined by MPEG-4), or, preferably, Layered Depth Images, which are defined in a new part of MPEG-4 called Animation Framework extension (AFX).


Stereoscopic coding techniques pose an important problem due to the existence of additional correlation between the two views. The most common approach is the predictive coding of the one channel with respect to the other channel and includes displacement (disparity) estimation / compensation, transform / quantization and entropy coding. Depending on the basic elements used approaches can be classified as block-based, object-based and hybrid. Extensions of single view video coding, rate distortion optimized coding, coding based on semantic relevance, object based coding and other techniques have been proposed for efficient stereoscopic coding.

Although both MPEG-2 and MPEG-4 have already standardized specific extensions for stereoscopic and multiview signals, new interactive applications require new coding approaches which are currently under development within a new MPEG activity to support such new applications, under the name of 3DAV (for 3D Audio-Visual).

Coeme, Louis (Adolphe) [next] [back] Code Name: Diamond Head

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or