Semantic Video Property Class
blocks block summary algorithm
This class of video summarization algorithms tries to perform elementary analysis of the video’s semantic content and then to use this data, in conjunction with information about user content preferences, to determine exactly which frames go into the summary and which frames do not.
The Video Skimming System from Carnegie Mellon finds key frames in documentaries and news-bulletins by detecting important words in the accompanying audio. The authors propose a method to extract significant audio and video information and create a “skim” video which represents a very short synopsis of the original. The goal of this work is to show the utility of integrating language and image understanding techniques for video skimming by extraction of significant information, such as specific objects, audio keywords and relevant video structure. The resulting skim video is much shorter, where compaction is as high as 20:1, and yet retains the essential content of the original segment.
In contrast to the above works, the CPR system of Fayzullin et al provides a robust framework within which an application developer can specify functions that measure the continuity c(S) , and the degree of repetition r(S) in a video summary S . The developer can also specify functions to assess the priority of each video frame p(f) (or more generally, if the video is broken up into a sequence of blocks, the priority p(b) of each block b ). The priority of the summary p(S) can then be set to the sum of the priorities of all blocks in S . The developer can assign weights to each of the three functions and create an objective function . Given the maximal desired size of a summary k , the system tries to find a set S of video blocks such that the size of S is less than or equal to k blocks and such that the objective function is maximized.
Authors show that the problem of finding such an “optimal” summary is NP-complete and proceed to provide four algorithms. The first is an exact algorithm that takes exponential time but finds an S that does in fact maximize the objective function’s value. Other three algorithms may not return the best S but run in polynomial time and find summaries that are often as good as the ones found by the exact algorithm.
Some examples of the continuity functions, repetition functions, and priority functions provided by Fayzullin et al include:
- Continuity can be measured by summing up the numbers of common objects shared by adjacent summary blocks, divided by the total numbers of objects in adjacent blocks. Thus, the more objects are shared between adjacent summary blocks, the more continuous the summary is. To measure continuity (or, rather, discontinuity) one can also sum up color histogram differences between adjacent blocks. The lower this sum, the more continuous is the summary.
- Repetition can be computed as the ratio of the total number of objects occurring in the summary to the number of distinct objects. Alternatively, one can consider repetition to be inversely proportional to standard deviation of the color histogram in summary blocks. The less color changes occur in a summary, the more repetitive this summary is going to be.
- Priority of a block can be computed as the sum of user defined priorities for objects occurring in the block or based on a set of rules that describe desired combinations of objects and events.
In a subsequent paper, Albanese et al retain the core idea from that the CPR criteria are important. However, they use a completely different approach to the problem of finding good summaries fast. Their priority curve algorithm ( PriCA for short) completely eliminates the objective function upon which the previous algorithms were based, but captures the same intuitions in a compelling way. They leverage the following intuitions.
- Block creation. They first split the video into blocks – blocks could either be of equal sizes, or they could be obtained as a result of segmenting the video using any standard video segmentation algorithm. The resulting segments are usually relatively small.
- Priority assignment. Each block is then assigned a priority based on the objects and events occurring in that block. Yet another alternative would use the audio stream and/or accompanying text associated with the video to identify the priority of each block. The priority assignment can be done automatically using object and event detection algorithms or manually, by employing a human annotator – in fact, their video summarization application uses both image processing algorithms and some manual annotation.
- Peak detection. They then plot a graph whose x axis consists of block numbers and whose y axis describes the block priorities. The peaks associated with this graph represent segments whose priorities are high. Figure 1 shows such a graph. They then identify the blocks associated with the peaks in this graph using a peak identification algorithm that the authors developed. Examples of such peaks are shown in Figure 1.
- Block merging. Suppose now that several different blocks are identified as peaks. If two peaks are adjacent to one another, then it is likely that the two adjacent blocks in question jointly refer to a continuous event in the video. They merge adjacent blocks on the intuition that the same or related events occur in these blocks even though the video segmentation algorithm has put them into different segments. This is because standard video segmentation algorithms use image information alone to segment and do not take semantics into account. However, it may be possible to use audio and/or accompanying text streams to identify similar video blocks without relying on visual similarity alone.
- Block elimination. The authors then run a block elimination algorithm which eliminates certain unworthy blocks whose priority is too low for inclusion in the summary. This is done by analyzing the distribution of block priorities, as well as the relative sizes of the blocks involved, rather than by setting an artificial threshold.
- Block resizing. Finally, the authors run a block resizing algorithm that shrinks the remaining blocks so that the final summary consists of these resized blocks adjusted to fit the desired total length.
Both the CPR and the PriCA systems provide a semi-automatic mechanism for describing and indexing video content. A preprocessing stage automatically detects shot boundaries in video streams and extracts some information about the content of each block. The user can then provide further information about those events and objects that cannot be detected and/or recognized by existing image processing and/or content analysis algorithms. In the specific implementation the authors of and have adapted a video segmentation algorithm from , based on the biological mechanisms of visual attention . More precisely, the shot-change detection method is related to the computation, at each time instant, of a consistency measure of the fixation sequences generated by an ideal observer looking at the video. The algorithm aims at detecting both abrupt and gradual transitions between shots using a single technique, rather than a set of dedicated methods.
Specialized image processing algorithms can then be applied in order to identify particular classes of events that occur in the blocks. In particular both the CPR and the PriCA systems have been tested on a collection of soccer matches videos and an algorithm from has been used to detect a predefined set of events, namely goals, yellow cards and red cards. The algorithm takes into account both visual and audio features in order to identify the selected events. A neural network is used to analyze visual content while audio features are simply taken into account through RMS (root mean square) analysis.
Figure 2 shows a screen dump of the PCA system, which allows to characterize video content through the above described semi-automatic process and to summarize a video using any of the four summarization algorithms in and.
User Comments