Physical Video Property Based Methods
frames curve key segments
Most existing video summarization systems start with key frame extraction. In this technique, certain properties of frames are used to identify them some frames as key frames. For instance, one can consider frames with a large amount of motion or abrupt color changes as key frames (while avoiding frames with camera motion and special effects). The detected key frames can either be inserted into a summary as they are or used to segment the video. For example, video segmentation algorithms may be used to “split” the video into homogeneous segments at key frames. One can then construct a summary by selecting a certain number of frames from each of these segments and concatenating them together.
The MoCA system composes film previews by picking special events, such as zooming of actors, explosions, shots, etc. In other words, image processing algorithms are used to detect when selected objects or events occur within a video and then some of the frames in which these events occur end up in the summary. The authors propose an approach to the segmentation of video objects based on motion cues. Motion analysis is performed by estimating local spatio-temporal orientation using three-dimensional structure tensors. These estimates are integrated into an active contour model, which involves stopping an evolving curve when it reaches the spatial boundary associated with a moving object. Segmented video objects are then classified by using of the contours of its occurrences in successive video frames. The classification is performed by matching curvature features of the video object contour to a database containing preprocessed views of prototypical objects. Object recognition can be performed at different levels of abstraction.
Yahiaoui, Merialdo et al propose an automatic video summarization method in which they define and identify what is the most important content in a video by means of similarities and differences between video segments. For example, consider a TV series. Different videos associated with the series will have a common part (e.g. the opening sequence) as well as various differences. The algorithms in identify the common themes as well as the differences. Common parts are identified by extracting characteristic vectors from the sequence of frames – in particular, the set of analyzed features is a combination of color histograms applied to different portions of a frame. These vectors are used by a clustering procedure that produces classes of video frames with similar visual content. The frequency of occurrence of a given frame from each video within classes allows us to compute the importance of the various classes. Once video frames have been clustered, the video could be described as sets of classes of frames. A global summary is constructed with representative images of video content selected from the set of most important classes. They also suggest a new criterion to evaluate the quality of the summaries that have been created through the maximization of an objective function.
Shao et al propose an approach to automatically summarize music videos based on an analysis of both video and music tracks. The audio track is separated from the visual track and is analyzed in order to evaluate linear prediction coefficients, zero crossing rates, and Mel-Frequency Cepstral Coefficients (MFCCs). Based on the features thus calculated, and using an adaptive clustering method, they groups the music frames and generate a structure describing the musical content of the video. The result of this computation is crucial for the generation of summaries that are calculated in terms of the detected structure and in terms of a domain-based music knowledge. After the summarization of the musical content, they turn the raw video sequences into a structured data set in which boundaries of all camera shots are identified and visually similar shots are grouped together. Each cluster is then represented by the shot with the longest length. A video summary is generated by collecting all the representative shots of the clusters. The final step is the alignment operation that aims to partially align the image segments in the video summary with the associated music segments. The authors evaluated the quality of the summaries through a subjective user study and compared the results with those obtained by analyzing either audio track only. The subject enrolled in the experiments rated conciseness and coherence of the summaries on a 1 to 5 scale. Conciseness pertains to the terseness of the music video summary and how much of the music video captures the essence of the music video. Coherence instead pertains to the consistency and natural drift of the segments in the music video summary. This method is primarily applicable to music videos rather than to other types of videos.
DeMenthon et al represented a changing vector of frame features (such as overall macroblock luminance) with a multi-dimensional curve and applied a curve simplification algorithm to select key frames. In particular they extend the classic binary curve splitting algorithm that recursively splits a curve into curve segments until these segments can be replaced by line segments. This replacement can occur if the distance from the curve of the segment is small. They show how to adapt the classic algorithm for splitting a curve of dimension N into curve segments of any dimension between 1 and N. The frames at the edges of the segments are used as key frames at different levels of detail. While this approach works well for key frame detection, it does not consider the fact that certain events have higher priorities than others, and that continuity and repetition are important.
Ju et al propose another key frame approach that chooses frames based on motion and gesture estimation. They focus on videos involving slide presentations – for example, a video showing a lecture in a computer science department would fall into this category. Assuming that a camera is focused on the speaker’s slides, they estimate the global image motion between every two consecutive frames using a robust regression method. The extracted motion information is used to evaluate if a sequence of consecutive frames represents the same slide. The detected frame sequences are processed to extract key frames used to represent the slides shown during the presentation. They identify gestures in the video by computing a pixel difference between the key frames and the corresponding frames in the “stabilized” image sequence. They recognize some of these gestures (e.g. pointing towards the slides) by using a deformable contour model and analyzing the shape and the motion over the time. With a little additional inference, they can also (sometimes) zero in on the part of the slide that the speaker is referring to.
Zhou et al attempt to extract and cluster features with a video so as to classify video content semantically. They use an interactive decision-tree learning method to define a set of if-then rules that can be easily applied to a set of low-level feature matching functions. In particular, the set of low level features that they can automatically extract includes motion, color and edge related features. Sample video clips from different semantic categories are used to train the classification system by means of the selected low level features. The set of rules in the decision tree is defined as a combination of appropriate features and relative thresholds that are automatically defined in the training process. They then apply their rule-based classification system to basketball videos and report on the results.
Ma et al present a generic framework for video summarization based on estimated user attention. They attempt to identify how a user’s attention is captured by motion, objects, audio and language. For each frame in a video, an attention value is computed, and the result for a given video is an attention curve that allows us to determine which frame or which sequence of frames is more likely to attract the user’s attention. In this way, an optimal number of key frames in a video shot is determined by the number of wave crest on the attention curve. Their summary consists of the peaks of the attention curve thus created.
User Comments