Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from U-Z

Video Automatic Annotation - Shot and Scene detection, Low level content annotation,  

motion based frames algorithms

Alberto Del Bimbo and Marco Bertini
Università di Firenze, Firenze, Italy

Definition: Automatic annotation of video refers to the extraction of the information about video automatically, which can serve as the first step for different data access modalities such as browsing, searching, comparison, and categorization.

Advances in digital video technology and the ever increasing availability of computing resources have resulted, in the last few years, in an explosion of digital video data. Moreover, the increased availability of Internet bandwidth has defined new means of video distribution, other than physical media. The major web search engines have already started to provide specific services to index, search and retrieve videos on the Internet.


Improving of video accessibility is the true challenge. In fact, access to video data requires that video content is appropriately indexed but manually annotating or tagging video is at best a laborious and economically infeasible process. Therefore, one important subject of research has been concerned with study of novel techniques to extract information about video content automatically. This annotation process serves as a first step for different data access modalities such as browsing, searching, comparison and categorization.


Automatic video annotation can be carried out at different levels, from the low syntactic level, where audiovisual features are extracted up to the high semantic level where concepts are recognized. Shot detection is the most basic temporal video segmentation task, as it is intrinsically and inextricably linked to the way that video is produced. It segments a video into more manageable parts and is very often the first step in other algorithms that operate both at syntactic and semantic level. At the syntactic level, video annotation is concerned with the estimation of low and mid-level features such as motion descriptors and the derivation of compact visualization of video content, like the extraction of a representative set of frames, either independent or organized in short sequences; this visualization can be used as a substitute for the whole video for the purposes of searching, comparison and categorization and is especially useful for video browsing. At semantic level, video annotation regards the identification and recognition of meaningful entities represented in the video, like settings, text captions, people and objects, or meaningful highlights and events.


The state of the art and principal contributions in each of these subjects of investigation are discussed in the following sections.


Shot and Scene detection


Shot-change detection is the process of identifying the elementary units of video and changes in the scene content of a video sequence so that alternate representations may be derived for the purposes of browsing and retrieval, e.g. extraction of key-frames, mosaic images or further processing by other algorithms and techniques to extract information used to perform content-based indexing and classification. A common definition of shot is: “a sequence of frames that was (or appears to be) continuously captured from the same camera”. Usually this is the definition that is used for the comparison of shot detection algorithms; a shot can encompass several types of camera motions: pan, tilt and zoom, but algorithms for shot-detection may also react to changes caused by significant camera and object motion, unless a global motion-compensation is performed. Shot changes may be of different types: hard (cut), gradual (dissolve, wipe and fade) and others (special effects like push, slide, etc.).


A large body of literature has been produced on the subject of shot detection, with proposals of methods working either in the compressed or in the non compressed domain. Survey papers have reviewed and compared the most effective solutions. Boreczky et al. compared five different algorithms: global histograms, region histograms, global histograms using twin-comparison thresholding, motion-compensated pixel difference and DCT-coefficient difference. The histograms were grayscale histograms. Dailianas et al. compared algorithms based on histogram differencing, moment invariants, pixel-value changes and edge detection (Zabih algorithm). The histogram-based segmentation algorithms differencing methods studied were: bin-to-bin difference, weighted histogram difference, histogram difference after equalization, intersection of histograms and squared histogram difference (chi-square). Kasturi, Gargi et al. have also recently compared histogram-based algorithms to MPEG and block-matching algorithms. The frame difference measures analyzed are: bin to bin difference, chi-square histogram difference, histogram intersection and average color. Three block-matching and six MPEG based methods are analyzed. The effects of different color space representation and MPEG encoders are evaluated.


Another possible partition of the video stream is the “scene”. Traditionally, a scene is a continuous sequence that is temporally and spatially cohesive in the real (or represented) world—but not necessarily in the projection of the real world in the video, and may in fact be composed by several shots (e.g. a dialogue in a movie). Scene detection is a very subjective task that depends on human cultural conditioning, professional training and even intuition. But it is also strictly connected to the genre of video and its subject domain. Moreover, since it focuses on real-world actions and temporal and spatial configurations of objects and people, it requires the ability to understand the meaning of images. Due to these facts the automatic segmentation of a video into scenes ranges from very difficult to intractable, and usually is carried on using low level visual features or the knowledge of well defined patterns present in the video domain being analyzed. An approach that tries to overcome these difficulties, employing dominant color grouping and tracking to evaluate a shot correlation measure, and performing shot grouping has been presented in.


Low level content annotation


Low level content annotation of videos requires setting indexes of primitives, usually extracted from the segmented parts of the video stream. These indexes may be distinguished into indexes on visual primitives, like objects and motion, indexes on audio primitives, like silence, music, words, environment or special sounds, and indexes on the meaning conveyed by primitives that can be used for video summarization. MPEG7 provides several descriptors for these indexes of primitives.


Object primitives are usually extracted from key-frames and can be used in retrieval applications for comparison with primitives extracted from a query frame. Object segmentation techniques similar to those employed in image analysis can be applied. On the other hand it is more complex for motion indexes to be extracted due to temporal dependency between video frames, which requires to study the motion of pixels between subsequent images of the sequence. Motion is induced either by camera operations and/or objects independently moving in the scene. Sometimes, the first is referred to as global motion, while the latter is called local motion. Extraction of moving objects also provides the initial step to extract semantic information on the video, useful to identify key-frames and produce condensed representations based on significant frames or video mosaics.


Most of the approaches to motion estimation use direct (gradient-based) techniques. The methods based on gradient employ the relation between spatial and temporal variations of the lightness intensity. This relation can be employed to segment images based on the speed of each point and may be considered as an approach to the more general problem of motion segmentation.


An approach commonly used when annotating video acquired from a stationary camera is to segment each image into a set of regions representing the moving objects by using a background differencing algorithm. More recently, Elgammal et al. have proposed background modeling using a generalization of Gaussians mixture models, where the background is assumed to be a set of N Gaussian distributions centered on the pixel values in the previous N frames. Then the average probability of each pixel’s value in the N Gaussian distributions is evaluated, expressing the likelihood of belonging to the background; this approach allow us to process video streams with time varying background.


Other approaches are based on correspondence techniques. Instead of observing every point, they extract some characteristics from the image, hopefully from the “interest objects” that are supposed to stay constant in everything but the position. The search for these characteristics in two consecutive frames allows extracting a map of velocity vectors defined only for “interesting” points (see Figure 2).


Motion features may be employed as a cue to detect video highlights. Several authors have defined indices related to “excitement”. Usually a multimodal approach that takes into account frequency of shot change, motion activity and sound energy is employed to derive these indices. Maxima of these indices, or maxima of entropy metrics based on these indices, are used to derive highlight time curves.


Typical low level audio analysis consists in audio classification of audio segments into at least four main classes: silence, music, speech and other sounds, using loudness and spectrum features. It is possible to further discriminate the sounds that belong to the fourth class in more specific sounds, usually depending on the domain, like cries, shots and explosions in movies, or whistles in sports videos.


 

Video-Based Face Recognition [next] [back] Victory at Entebbe

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or