Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from P-T » Semantic Image Representation and Indexing -  

A Detection-Based Image Representation and Indexing Scheme

region feature semantic images

In essence, the objective of the structured framework is to represent an image in an application domain as a distribution of meaningful visual objects relevant to the application domain, rather than low-level features such as colors, texture, and shapes. In a systematic manner, an application developer designs a visual vocabulary with intuitive meanings (e.g. faces, buildings, foliage, etc in consumer images; cerebrum, teeth, skeletal joints, etc in medical images) called Semantic Support Regions (SSRs), based on typical images of the application domain. Training samples of the visual vocabulary cropped from typical images are then used to construct modular visual detectors of these SSRs using statistical learning based on visual features suitable to characterize the visual vocabulary.

To index an image from the same application domain as a distribution of SSRs, the learned visual detectors are first applied to the image in a multi-scale manner. The detection results are reconciled across multiple resolutions and aggregated spatially to form local semantic histograms, suitable for efficient semantic query and retrieval. The key in image representation and indexing here is not to record the primitive feature vectors themselves but to project them into a classification space spanned by semantic labels (instead of a low-level feature space) and uses the soft classification decisions as the local indexes for further aggregation.

To compute the SSRs from training instances, we could use Support Vector Machines (SVMs) on suitable features for a local image patch. The feature vectors depend on the application domain. Suppose we denote a feature vector as z. A support vector classifier S i , is a detector for SSR i on a region with feature vector z . The classification vector T for region z can be computed via the softmax function as:

For illustration, we assume both color and texture are important as in the case of consumer images. That is, a feature vector z has two parts, namely, a color feature vector z c and texture feature vector z t . As a working example, for the color feature, we can compute the mean and standard deviation of each color channel (i.e. z c has 6 dimensions). As another example, for the texture feature, we can compute the Gabor coefficients. Similarly, the means and standard deviations of the Gabor coefficients (e.g. 5 scales and 6 orientations) in an image block are computed as z t (60 dimensions). Zero-mean normalization is applied to both the color and texture features.

The distance or similarity measure depends on the kernel adopted for the SVMs. Based on past experimentation and for illustration, the polynomial kernels are used here. In order to balance the contributions of the color and texture features, the similarity measure sim(y,z) between feature vector y and z is defined as:

where y•z denotes dot product operation.

The Mercer’s condition for the kernel K(y,z) ensures the convergence of the SVM algorithm towards a unique optimum because the SVM problem will be convex whenever a Mercerum because the SVM problem kernel is used . The Mercer’s condition requires that if and only if, for any g(z) such that ? g(y) 2 dy is finite, then ?K(y,z) g(y) g(z) dy dz = 0. However defining or proving a kernel that satisfies the Mercer’s condition is non-trivial. This difficulty has not stopped researchers from experimenting with non-Mercer kernels with practical values .

To detect SSRs with translation and scale invariance in an image to be indexed, the image is scanned with windows of different scales. More precisely, given an image I with resolution M × N, the middle layer (Figure 1), Reconciled Detection Map (RDM), has a lower resolution of P × Q, P = M, Q = N. Each pixel (p,q) in RDM corresponds to a two-dimensional region of size r x × r y in I . We further allow tessellation displacements d x , d y > 0 in X, Y directions respectively such that adjacent pixels in RDM along X direction (along Y direction) have receptive fields in I which are displaced by d x pixels along X direction ( d y pixels along Y direction) in I At the end of scanning an image, each pixel (p,q) that covers a region z in the pixel-feature layer will consolidate the SSR classification vector T x (z) (Equation (1)).

As an empirical guideline, we can progressively increase the window size r x × r y from 20 × 20 to 60 × 60 at a displacement ( d x , d y ) of (10,10) pixels, on a 240 × 360 size-normalized image. That is, after the detection step, we have 5 maps of detection of dimensions 23 × 35 to 19 × 31.

To reconcile the detection maps across different resolutions onto a common basis, we adopt the following principle: If the most confident classification of a region at resolution r is less than that of a larger region (at resolution r + 1) that subsumes the region, then the classification output of the region should be replaced by those of the larger region at resolution r + 1. For instance, if the detection of a face is more confident than that of a building at the nose region (assuming nose is not in the SSR vocabulary), then the entire region covered by the face, which subsumes the nose region, should be labeled as face.

Using this principle, we start the reconciliation from detection map based on the largest scan window (60 × 60) to the detection map based on next-to-smallest scan window (30 × 30). After 4 cycles of reconciliation, the detection map that is based on the smallest scan window (20 × 20) would have consolidated the detection decisions obtained at other resolutions.

The purpose of spatial aggregation is to summarize the reconciled detection outcome in a larger spatial region. Suppose a region Z comprises of n small equal regions with feature vectors z 1 , z 2 , z n respectively. To account for the size of detected SSRs in the spatial area Z, the SSR classification vectors of the RDM is aggregated as

This is illustrated in Figure 1 where a Spatial Aggregation Map (SAM) further tessellates over RDM with A × B, A = P, B = Q pixels. This form of spatial aggregation does not encode spatial relation explicitly. But the design flexibility of s x , s y allows us to specify the location and extent in the image to be focused and indexed. We can choose to ignore unimportant areas (e.g. margins) and emphasize certain areas with overlapping tessellation. We can even have different weights attached to the areas during similarity matching.

That is, for Query by Examples (QBE), the content-based similarity ? between a query q and an image x can be computed in terms of the similarity between their corresponding local tessellated blocks as:

where ? j are weights, and ?(Z j , X j ) is the similarity between two image blocks. As an important example, the image block similarity based on L 1 distance measure (city distance) is defined as:

This is equivalent to histogram intersection except that the bins have semantic interpretation as SSRs. Indeed the SAM has similar representation scheme as local color histograms and hence enjoys similar invariant properties such as translation and rotation invariant about the viewing axis and change only slowly under change of angle of view, change of scale, and occlusion .

There is a trade-off between content symmetry and spatial specificity. If we want images of similar semantics with different spatial arrangement (e.g. mirror images) to be treated as similar, we can have larger tessellated blocks (i.e. similar to global histogram). However in applications where spatial locations are considered differentiating, local histograms will provide good sensitivity to spatial specificity.

The effect of averaging in Equation (3) will not dilute T i (Z) into a flat histogram. As an illustration, we show the T i (Z) > 0.1 of SSRs (related to consumer images, see Semantic Consumer Image Indexing ) in Table 1 for the 3 tessellated blocks (outlined in red) in Figure 2. We observe that the dominant T i (Z) shown capture the content essence in each block with small values distributed in other bins.

Note that we have presented the features, distance measures, kernel functions, and window sizes of SSR-based indexing in concrete forms to facilitate understanding. The SSR-based indexing scheme is indeed generic and flexible to adapt to application domains.

The SSR-based representation and indexing scheme supports visual concept hierarchy. For instance, a two-level IS-A hierarchy has been designed and implemented for consumer images (see Semantic Consumer Image Indexing ). The learning and detection of SSR classes are based on the more specific SSR classes S i such as People:Face, Sky:Clear, and Building:City etc and the detection value A k of a more general concept C k (e.g. People, Sky, Building) within an image region Z can be derived from the detection values T i (Z) of those SSR classes S i , that are subclasses of C k . Since the subclasses S i under C k are assumed to be disjoint, A k (Z) can be computed as:

On the other hand, a complex visual object can be represented in terms of its parts, i.e. a Part-Whole hierarchy. For instance, a human figure can be represented and detected by the presence of a face and a body. Indeed interesting approaches to recognize objects by their components have been proposed and applied to people detection based on adaptive combination of classifiers . This approach is especially useful when a 3D object has no consistent shape representation in a 2D image. The detection of multiple parts of a complex object can help to enhance the detection accuracy although not every part of an object is good candidate for detection (e.g. besides the wheels, the other parts of a car may not possess consistent color, texture, or shape feature for reliable detection).

Similar to the detection in IS-A hierarchy, the detection value B k of a multi-part object C k within an image region Z can be inferred from the detection values T i (Z) of those SSR classes S i that correspond to the parts of C k . Since the parts S i of C k can co-occur and they occupy spatial areas, B k (Z) can be derived as:

This article has focused on local semantic regions learned and extracted from images. The local semantic regions illustrated are based on consumer images (also see article on Semantic Consumer Image Indexing ). For another illustration on medical images, please refer to article on Semantic Medical Image Indexing .

The supervised learning approach described here allows the design of visual semantics with statistical learning. To alleviate the load of labeling image regions as training samples, alternative approach based on semi-supervised learning to discover local image semantics has been explored.

When the semantics is associated with entire image, image categorization is another approach to bridge the semantic gap and has received more attention lately. For example, a progressive approach to classify vacation photographs based on low-level features such as color, edge directions etc has been attempted . For semantic image indexing related to class information, please see article on Semantic Class-Based Image Indexing . Both local and global image semantics can also be combined in image matching to achieve better retrieval performance (see article on Combining Intra-Image and Inter-Class Semantics for Image Matching )


User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or