Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from F-J

Image and Video Quality Assessment - Introduction, Why Do We Need Quality Assessment?, Why is Quality Assessment So Hard?

hvs images information reference

Kalpana Seshadrinathan and Alan C. Bovik
The University of Texas at Austin, USA

Definition: Image and video quality assessment deals with quantifying the quality of an image or video signal as seen by a human observer using an objective measure.


In this article, we discuss methods to evaluate the quality of digital images and videos, where the final image is intended to be viewed by the human eye. The quality of an image that is meant for human consumption can be evaluated by showing it to a human observer and asking the subject to judge its quality on a pre-defined scale. This is known as subjective assessment and is currently the most common way to assess image and video quality. Clearly, this is also the most reliable method as we are interested in evaluating quality as seen by the human eye . However, to account for human variability in assessing quality and to have some statistical confidence in the score assigned by the subject, several subjects are required to view the same image. The final score for a particular image can then be computed as a statistical average of the sample scores. Also, in such an experiment, the assessment is dependent on several factors such as the display device, distance of viewing, content of the image, whether or not the subject is a trained observer who is familiar with processing of images etc. Thus, a change in viewing conditions would entail repeating the experiment! Imagine this process being repeated for every image that is encountered and it becomes clear why subjective studies are cumbersome and expensive. It would hence be extremely valuable to formulate some objective measure that can predict the quality of an image.

The problem of image and video quality assessment is to quantify the quality of an image or video signal as seen by a human observer using an objective measure. The quality assessment techniques that we present in this article are known as full-reference techniques, i.e. it is assumed that in addition to the test image whose quality we wish to evaluate, a “perfect” reference image is also available. We are, thus, actually evaluating the fidelity of the image, rather than the quality. Evaluating the quality of an image without a reference image is a much harder problem and is known as blind or no-reference quality assessment. Blind techniques generally reduce the storage requirements of the algorithm, which could lead to considerable savings, especially in the case of video signals. Also, in certain applications, the original uncorrupted image may not be available. However, blind algorithms are also difficult to develop as the interpretation of the content and quality of an image by the HVS depends on high-level features such as attentive vision, cognitive understanding, and prior experiences of viewing similar patterns, which are not very well understood. Reduced reference quality assessment techniques form the middle ground and use some information from the reference signal, without requiring that the entire reference image be available.

Why Do We Need Quality Assessment?

Image and video quality assessment plays a fundamental role in the design and evaluation of imaging and image processing systems. For example, the goal of image and video compression algorithms is to reduce the amount of data required to store an image and at the same time, ensure that the resulting image is of sufficiently high quality. Image enhancement and restoration algorithms attempt to generate an image that is of better visual quality from a degraded image. Quality assessment algorithms are also useful in the design of image acquisition systems and to evaluate display devices etc. Communication networks have developed tremendously over the past decade and images and video are frequently transported over optic fiber, packet switched networks like the Internet, wireless systems etc. Bandwidth efficiency of applications such as video conferencing and Video on Demand (VOD) can be improved using quality assessment systems to evaluate the effects of channel errors on the transported images and video. Finally, quality assessment and the psychophysics of human vision are closely related disciplines. Evaluation of quality requires clear understanding of the sensitivities of the HVS to several features such as luminance, contrast, texture, and masking that are discussed in detail in Section 4.1. Research on image and video quality assessment may lend deep insights into the functioning of the HVS, which would be of great scientific value.

Why is Quality Assessment So Hard?

At first glance, a reasonable candidate for an image quality metric might be the Mean-Squared Error (MSE) between the reference and distorted images. Consider a reference image and test image denoted by R = R ( i , j ) and T = T ( i , j ) respectively, where 0 = i = N – 1, 0 = j = M – 1. Then, the MSE is defined by:

MSE is a function of the Euclidean distance between the two vectors R and T in an MN -dimensional space. Since the MSE is a monotonic function of the error between corresponding pixels in the reference and distorted images, it is a reasonable metric and is often used as a quality measure. Some of the reasons for the popularity of this metric are its simplicity, ease of computation and analytic tractability. However, it has long been known to correlate very poorly with visual quality. A few simple examples are sufficient to demonstrate that MSE is completely unacceptable as a visual quality predictor. This is illustrated in Fig.1 which shows several images whose MSE with respect to the reference are identical, but have very different visual quality. The main reason for the failure of MSE as a quality metric is the absence of any kind of modeling of the sensitivities of the HVS.

The difficulties in developing objective measures of image quality are best illustrated by example. Figure 1(a) and (b) show the original “Caps” and “Buildings” images respectively. Figure 1© and (d) show JPEG compressed versions of these images of approximately the same MSE. While the distortion in the “Buildings” image is hardly visible, it is visibly annoying in the “Caps” image. The perception of distortion varies with the actual image at hand and this effect is part of what makes quality assessment difficult. There is enormous diversity in the content of images used in different applications and even within images of a specific category, for example, the class of images obtained from the real world. Consistent performance of a quality assessment algorithm irrespective of the specific image at hand is no easy task. Additionally, different kinds of distortion produce different characteristic artifacts and it is very difficult for a quality assessment algorithm to predict degradation in visual quality across distortion types . For example, JPEG produces characteristic blocking artifacts and blurring of fine details (Figure 1© and (d)).

This is due to the fact that it is a block-based algorithm that achieves compression by removing the highest frequency components that the HVS is least sensitive to. JPEG 2000 compression eliminates blocking artifacts, but produces ringing distortions that are visible in areas surrounding edges, such as around the edges of the caps in Figure 1(e). Sub band decompositions, such as those used in JPEG 2000, attempt to approximate the image using finite-duration basis functions and this causes ringing around discontinuities like edges due to Gibb’s phenomenon. Figure 1(f) shows an image that is corrupted by Additive White Gaussian Noise (AWGN) which looks grainy, seen clearly in the smooth background regions of the image. This kind of noise is typically observed in a lot of imaging devices and images transmitted through certain communication channels. A generic image quality measure should predict visual quality in a robust manner across these and several other types of distortions.

Thus, it is not an easy task for a machine to automatically predict quality by computation, although the human eye is very good at evaluating the quality of an image almost instantly. We explore some state of the art techniques for objective quality assessment in the following sections.

Approaches to Quality Assessment

Techniques for image and video quality assessment can broadly be classified as bottom-up and top-down approaches. Bottom-up approaches attempt to model the functioning of the HVS and characterize the sensitivities and limitations of the human eye to predict the quality of a given image. Top-down approaches, on the other hand, usually make some high-level assumption on the technique adopted by the human eye in evaluating quality and use this to develop a quality metric. Top-down methods are gaining popularity due to their low computational complexity, as they don’t attempt to model the functioning of the entire HVS, but only try to characterize the features of the HVS that are most relevant in evaluating quality. Also, the HVS is a complex entity and even the low level processing in the human eye that includes the optics, striate cortex and retina are not understood well enough today, which reflects on the accuracy of existing HVS models. In this chapter, we categorize several state of the art quality assessment techniques into three main categories, namely HVS modeling based approaches, structural approaches and information theoretic approaches. Each of these paradigms in perceptual quality assessment is explained in detail in the following sections.

HVS-based Approaches

Most HVS-based approaches can be summarized by the diagram shown in Figure 2. The initial step in the process usually involves the decomposition of the image into different spatial-frequency bands. It is well known that cells in the visual cortex are specialized and tuned to different ranges of spatial frequencies and orientations. Experimental studies indicate that the radial frequency selective mechanisms have constant octave bandwidths and the orientation selectivity is a function of the radial frequencies. Several transforms have been proposed to model the spatial frequency selectivity of the HVS and the initial step in an HVS-based approach is usually a decomposition of the image into different sub-bands using a filter-bank.

The perception of brightness is not a linear function of the luminance and this effect is known as luminance masking. In fact, the threshold of visibility of a brightness pattern is a linear function of the background luminance. In other words, brighter regions in an image can tolerate more noise due to distortions before it becomes visually annoying. The Contrast Sensitivity Function (CSF) provides a description of the frequency response of the HVS, which can be thought of as a band-pass filter. For example, the HVS is less sensitive to higher spatial frequencies and this fact is exploited by most compression algorithms to encode images at low bit rates, with minimal degradation in visual quality. Most HVS-based approaches use some kind of modeling of the luminance masking and contrast sensitivity properties of the HVS as shown in Figure 2.

In Figure 1, the distortions are clearly visible in the “Caps” image, but they are hardly noticeable in the “Buildings” image, despite the MSE being the same. This is a consequence of the contrast masking property of the HVS, wherein the visibility of certain image components is reduced due to the presence of other strong image components with similar spatial frequencies and orientations at neighboring spatial locations. Thus, the strong edges and structure in the “Buildings” image effectively mask the distortion, while it is clearly visible in the smooth “Caps” image. Usually, a HVS-based metric incorporates modeling of the contrast masking property, as shown in Figure 2.

In developing a quality metric, a signal is first decomposed into several frequency bands and the HVS model specifies the maximum possible distortion that can be introduced in each frequency component before the distortion becomes visible. This is known as the Just Noticeable Difference (JND). The final stage in the quality evaluation involves combining the errors in the different frequency components, after normalizing them with the corresponding sensitivity thresholds, using some metric such as the Minkowski error. The final output of the algorithm is either a spatial map showing the image quality at different spatial locations or a single number describing the overall quality of the image.

Different proposed quality metrics differ in the models used for the blocks shown in Figure 2. Notable amongst the HVS-based quality measures are the Visible Difference Predictor, the Teo and Heeger model, Lubin’s model and Sarnoff’s JNDMetrix technology.

Structural Approaches

Structural approaches to image quality assessment, in contrast to HVS-based approaches, take a top-down view of the problem. Here, it is hypothesized that the HVS has evolved to extract structural information from a scene and hence, quantifying the loss in structural information can accurately predict the quality of an image. In Figure 1, the distorted versions of the “Buildings” image and the “Caps” image have the same MSE with respect to the reference image. The bad visual quality of the “Caps” image can be attributed to the structural distortions in both the background and the objects in the image. The structural philosophy can also accurately predict the good visual quality of the “Buildings” image, since the structure of the image remains almost intact in both distorted versions.

Structural information is defined as those aspects of the image that are independent of the luminance and contrast, since the structure of various objects in the scene is independent of the brightness and contrast of the image. The Structural SIMilarity (SSIM) algorithm, also known as the Wang-Bovik Index partitions the quality assessment problem into three components, namely luminance, contrast and structure comparisons.

Let and represent N – dimensional vectors containing pixels from the reference and distorted images respectively. Then, the Wang-Bovik Index between and is defined by:

C 1 , C 2 and C 3 are small constants added to avoid numerical instability when the denominators of the fractions are small. a, ß and ? are non-negative constants that control the relative contributions of the three different measurements to the Wang-Bovik Index.

The three terms in the right hand side of Equation (1) are the luminance, contrast and structure comparison measurements respectively. µ x and µ y are estimates of the mean luminance of the two images and hence, the first term in Eqn. (1) defines the luminance comparison function. It is easily seen that the luminance comparison function satisfies the desirable properties of being bounded by 1 and attaining the maximum possible value if and only if the means of the two images are equal. Similarly, s x and s y are estimates of the contrast of the two images and the second term in Eqn. (1) defines the contrast comparison function. Finally, the structural comparison is performed between the luminance and contrast normalized signals, given by and. The correlation or inner product between these signals is an effective measure of the structural similarity. The correlation between the normalized vectors is equal to the correlation coefficient between the original signals and, which is defined by the third term in Equation (1). Note that the Wang-Bovik Index is also bounded by 1 and attains unity if and only if the two images are equal.

The structural philosophy overcomes certain limitations of HVS-based approaches such as computational complexity and inaccuracy of HVS models. The idea of quantifying structural distortions is not only novel, but also intuitive, and experimental studies show that the algorithm is competitive with several other state-of-the-art quality metrics.

Information Theoretic Approaches

Information theoretic approaches attempt to quantify the loss in the information that the HVS can extract from a given test image, as compared to the original reference image. Mutual information between two random sources is a statistical measure that quantifies the amount of information one source contains about the other. In other words, assuming the distorted and reference images to be samples obtained from two random sources, mutual information measures the distance between the distributions of these sources. Information theoretic approaches use this measure to quantify the amount of information that the human eye can obtain from a given image, to develop a metric that correlates well with visual quality. The Visual Information Fidelity (VIF) criterion, also known as the Sheikh-Bovik Index, assumes that the distorted image is the output of a communication channel that introduces errors in the image that passes through it. The HVS is also assumed to be a communication channel that limits the amount of information that can pass through it.

Photographic images of natural scenes exhibit striking structures and dependencies and are far from random. A random image generated assuming an independent and identically distributed Gaussian source, for example, will look nothing like a natural image. Characterizing the distributions and statistical dependencies of natural images provides a description of the subspace spanned by natural images, in the space of all possible images. Such probabilistic models have been studied by numerous researchers and one model that has achieved considerable success is known as the Gaussian Scale Mixture (GSM) model . This is the source model used to describe the statistics of the wavelet coefficients of reference images in the Sheikh-Bovik Index. Let represent a collection of wavelet coefficients from neighboring spatial locations of the original image. Then, where z represents a scalar random variable known as the mixing density and represents a zero-mean, white Gaussian random vector. Instead of explicitly characterizing the mixing density, the maximum likelihood estimate of the scalar z is derived from the given image in the development of the Sheikh-Bovik Index.

Let denote the corresponding coefficients from the distorted image. The distortion channel that the reference image passes through to produce the distorted image is modeled using:

This is a simple signal attenuation plus additive noise model, where g represents a scalar attenuation and is additive Gaussian noise. Most commonly occurring distortions such as compression and blurring can be approximated by this model reasonably well. This model has some nice properties such as analytic tractability, ability to characterize a wide variety of distortions and computational simplicity. Additionally, both reference and distorted images pass through a communication channel that models the HVS. The HVS model is given by:

where and represent additive Gaussian noise, that is independent of the input image. The entire system is illustrated in Figure 3.

The VIF criterion is then defined for these coefficients using:

represents the mutual information between and out , conditioned on the estimated value of z . The denominator of Equation (2) represents the amount of information that the HVS can extract from the original image. The numerator represents the amount of information that the HVS can extract from the distorted image. The ratio of these two quantities hence is a measure of the amount of information in the distorted image relative to the reference image and has been shown to correlate very well with visual quality. Closed form expressions to compute this quantity have been derived and further details can be found in. Note that wavelet coefficients corresponding to the same spatial location can be grouped separately, for example, coefficients in each subband of the wavelet decomposition can be collected into a separate vector. In this case, these different quality indices for the same spatial location have to be appropriately combined. This results in a spatial map containing the Sheikh-Bovik quality Index of the image, which can then be combined to produce an overall index of goodness for the image.

The success of the information theoretic paradigm lies primarily in the use of accurate statistical models for the natural images and the distortion channel. Natural scene modeling is in some sense a dual of HVS modeling, as the HVS has evolved in response to the natural images it perceives. The equivalence of this approach to certain HVS-based approaches has also been established. The idea of quantifying information loss and the deviation of a given image from certain expected statistics provides an altogether new and promising perspective on the problem of image quality assessment.

Figure 4 illustrates the power of the Wang-Bovik and Sheikh-Bovik indices in predicting image quality. Notice that the relative quality of the images, as predicted by both indices, is the same and agrees reasonably well with human perception of quality. In the Video Quality Experts Group (VQEG) Phase I FR-TV tests, which provides performance evaluation procedures for quality metrics, logistic functions are used in a fitting procedure to obtain a non-linear mapping between objective/subjective scores first. Hence, the differences in the absolute values of quality predicted by the two algorithms are not important.


In this chapter, we have attempted to present a short survey of image quality assessment techniques. Researchers in this field have primarily focused on techniques for images as this is easier and usually the first step in developing a video quality metric. Although insights from image quality metrics play a huge role in the design of metrics for video, it is not always a straight forward extension of a two-dimensional problem into three dimensions. The fundamental change in moving from images to video is the motion of various objects in a scene. From the perspective of quality assessment, video metrics require modeling of the human perception of motion and quality of motion in the distorted image. Most of the algorithms discussed here have been extended to evaluate the quality of video signals and further details can be found in the references.

Considerable progress has been made in the field of quality assessment over the years, especially in the context of specific applications like compression and halftoning. Most of the initial work dealt with the threshold of perceived distortion in images, as opposed to supra-threshold distortion which refers to artifacts that are perceptible. Recent work in the field has concentrated on generic, robust quality measures in a full reference framework for supra-threshold distortions. No reference quality assessment is still in its infancy and is likely to be the thrust of future research in this area.


Image Compression and Coding - Fundamentals of visual data compression, Redundancy, models, Error-free compression, Variable Length Coding (VLC) [next] [back] Ilves, Toomas Hendrik - President of Estonia, Career, Sidelights

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or