Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from U-Z » Video Automatic Annotation - Shot and Scene detection, Low level content annotation,  

High level content annotation

video detection text motion

High level content annotation strives at producing semantic annotation, combining low level features, appropriate pattern recognition techniques, and usually domain knowledge. The latter is required in order to reduce the semantic gap between the observable features of the multimedia material and the interpretation that a user may have. Narrowing the application domain is the most typical approach. A good recent and comprehensive review of multimodal video annotation is provided in. High level content annotation is typically concerned with detection of settings, people, text, and highlights or meaningful events. Settings are classified usually as indoor or outdoor, although finer classifications can also be obtained. Several visual features may be employed to recognize the setting of a video scene: color histograms, color coherence vectors, DCT coefficients, edge direction histograms, and edge coherence vectors, texture. Motion features are not used since settings are usually static. Typically classification of settings is performed exploiting visual information from keyframes. Special sounds or cheers can also be used to reinforce classification.

Most of the approaches to people detection in videos deal with face detection. In general, problems arise from the large variety in locations, lighting conditions, orientation, scale and location. Most of the approaches reduce the problem of face detection to that of single frame processing. In this case you must be assured that a full view of the face is included in the frame (see Figure 3). An effective technique, recently integrated in the OpenCV library, is Viola and Jones. Basically, the algorithm relies on a number of simple classifiers which are devoted to signal the presence of a particular feature of the object to be detected. In the case of faces, this could be the alignment of the two eyes or the symmetry in frontal views. A large number of these simple features are initially selected. Then, a complex classifier is constructed by selecting a small number of important features using AdaBoost. The method also combines increasingly more complex classifiers in a “cascade” which allows discarding quickly the background regions of the image, while spending more computation on promising object like regions. A comprehensive survey of face detection and recognition techniques can be found in and.

Textual information included in images or videos often contains information that otherwise is not available in other information charnnels, such as, for example, the name and role of an interviewed person in news videos. Video text analysis requires usually three processing steps: detection (time and space position), extraction and processing of the text image (to separate it from the background), and finally text recognition. Text may be scene text (e.g. a billboard in the background), superimposed text or closed captions. The first type of text is the most critical to be detected and processed since it may appear slanted, tilted or partially occluded; nevertheless in some cases it may be treated as superimposed captions (e.g. in commercials). The second type of text has received a lot of attention, and several approaches to perform the first two processing steps have been proposed. Typically solutions exploit the fact that text appears for several frames in the same position (see Figure 4). Some approaches use image segmentation and then connected components analysis to select regions as character candidates; constraints on text appearance and size are used.

Other approaches use texture as a distinctive feature. A review of these approaches is provided in. Another group of algorithms relies on the fact that superimposed captions have usually a high contrast w.r.t. the background, and then search for regions composed by sharp borders o group of corners. The third type of text does not require the extraction steps, although it still requires some processing, since it may suffer from the typical phenomena that characterize spoken language, such as hesitation, ellipses, etc. Example of usage of closed captions for video abstraction was presented in. Recently closed captions have been selected as a feature used by Google ’s video search engine, among the others.

Highlight detection and recognition is dependent on the video subject domain. In the following we provide an overview of some of the most interesting solutions for highlight annotation techniques, distinguished by the domain which they have been designed for.

News videos annotation has been thoroughly analyzed by many researchers because of the well defined structure of news video, which alternates shots with anchormen and reports. After video segmentation shots are classified in one of the two classes. Approaches based on template matching may perform the classification using the spatial structure of anchorman shots, calculating mean and variance o histogram and pixel values of the areas that should belong to the anchorman or the news logo. Other approaches based on syntactic and structural matching instead use the structure of the video, identifying repeated shots that have strong similarity and low motion. Finally probabilistic methods may use HMMs trained with several clues, including feature vectors based on difference images, average frame color and audio signal. An example o this approach is presented in. Semantic annotation of the news shots is performed through video OCR or speech recognition, as in the Informedia Project lead by Carnegie Mellon.

Due to their huge commercial appeal sports videos represent an important application domain for automatic video annotation. It is possible to identify sports videos based on detection of slow motion replays, large areas of superimposed captions, and specific camera motion.slow motion replays, large areas of superimposed captions, and specific camera motion. Furthermore it is possible to distinguish which type of sport is being shown by analyzing features related to the playfield, like ground color and lines. In shots of sports video are classified into three classes, according to the most common scenes that are played, namely playfield, players’ close-ups and crowd. A feature vector composed by edges, segments and color information is employed. Figure 5 shows the features used to distinguish the three classes. Once playfield shots have been classified it is possible to perform sport classification according to the special characteristics of the playfield. Neural networks have been employed to perform the classification.

Recognition of specific highlights has been studied for different sports like soccer, tennis, basketball, volleyball, baseball, American football. Usually these methods exploit low and mid level audio and visual cues, such as the detection of referee’s whistle, excited speech, crowd cheering, color and edge related features, playfield zone identification, players and ball tracking, motion indexes, etc. and relate them to a-priori domain knowledge of the sport or to knowledge of production rules. In the first case sport rules and the spatio- temporal evolution of typical actions are used. In the second case special production rules employed by directors (like the use of slow motion replays) are exploited. An example of the first approach has been presented in where each highlight is modeled with a Finite State Machine: key events, defined in terms of estimated visual cues (camera motion, playfield zone framed, and players’ position), determine the transition from one state to the following. Highlight models are checked against the current observations, using a model checking algorithm.

An example of the second approach has been presented in where highlight detection in soccer video is performed using both shot sequence analysis and shot visual cues. It is assumed that the presence of highlights can be inferred from the occurrence of one or several slow motion shots and from the presence of shots where the referee and/or the goal box is framed.

Feature films are even less structured than sports videos, thus highlight detection is much more general. Movie genre classification, based on four visual features (shot length, color variance, motion, lighting key) extracted from movie previews has been presented in. Dialogue scenes may be detected using audio analysis, face detection and localization. Alternatively similar and temporally close shots are analyzed, exploiting the fact that in many movies dialogues are obtained following the shot-reverse-shot technique. Similarity may be valuated using only visual features or adding also audio features, like the classification of audio segments into silence, speech, music and miscellaneous sound. The detection of patterns allows classifying the scene as dialogue, action or other story units. Other classifications for movie scenes are those of “violent” and “sex” scenes; this is usually done taking into account both visual and audio. Violent scenes can be detected by checking the presence of blood colored regions, high shot change rate and high motion activity. The abrupt change in audio energy can also be used as an additional feature.

 

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or