Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from U-Z

Video Databases - Video data models, Video Extraction, Video Query Language, Video Index Structures

frame tree segment objects

C. Cesarano 1 , M. Fayzullin 2 , A. Picariello 1 , and V.S.Subrahmanian 2
1 Dipartimento di Informatica e Sistemistica, Università di Napoli “Federico II”, Napoli, Italy
2 Department of Computer Science, University of Maryland, College Park, MD, USA

Definition : Video database research falls into the following categories: video data models, video extraction, video query language, and video index structures.

The past decade has seen explosive growth in the ability of individuals to create and/or capture digital video, leading slowly to large scale personal and corporate digital video banks. Over the last 8-10 years, there has been a tremendous amount of work on creating video databases. Video database research falls primarily into the following categories:

  • Video data models. What kind of data about a video should we store?
  • Video extraction. How should this data be automatically extracted from a video?
  • Video query language. How should we query this data?
  • Video index structures. How should we index this data for faster retrieval?

We discuss multiple potential answers to these four important topics.

Video data models

Throughout this paper, we will assume that a video ? is divided up into a sequence b 1 ,b len (?) of blocks . The video database administrator can choose what a block is – he could, for instance, choose a block to be a single frame, or to be the set of frames between two consecutive I-frames (in the case of MPEG video) or something else. The number len (?) is called the length of video ?. If 1= l = u = len (?), then we use the expression block sequence to refer to the closed interval [ l,u ] which denotes the set of all blocks b such that I = b = u. Associated with any block sequence [ l,u ] is a set of objects. These objects fall into four categories as shown in Figure 1.

  • Visual Entities of Interest (Visual EOIs for short): An entity of interest is a region of interest in a block sequence (usually when identifying entities of interest, a single block, i.e. a block sequence of length one, is considered). Visual EOIs can be identified using appropriate image processing algorithms. For example, Figure 1 shows a photograph of a stork and identifies three regions of interest in the picture using active vision techniques. This image may have various attributes associated with the above rectangles such as an id, a color histogram, a texture map, etc.
  • Visual Activities of Interest (Visual AOIs): An activity of interest is a motion of interest in a video segment. For example, a dance motion is an example of an activity of interest. Likewise, a flying bird might be an activity of interest (“flight”). There are numerous techniques available in the image processing literature to extract visual AOIs. These include dancing, gestures, and many other motions.
  • Textual Entities of Interest : Many videos are annotated with information about the video. For example, news videos often have textual streams associated with the video stream. There are also numerous projects that allow textual annotations of video – such annotations may explicitly annotate a video with objects in a block sequence or may use a text stream from which such objects can be derived. A recent commercial product to do this is IBM’s Alpha Works system where a user can annotate a video while watching it.
  • Textual Activities of Interest : The main difference between Textual EOIs and Textual AOIs is that the latter pertains to activities, while the former pertains to entities. Both are textually marked up.

Any database model to store information about video content must be rich enough to store all these phenomena. Data models to store video information have been proposed based on the relational model of data, the object oriented data model and the XML data model.

Video Extraction

A vast variety of image processing algorithms can be used to populate video data with visual and/or textual entities and activities of interest. From the image processing point of view, object and event detection are still considered an interesting challenge, due to significant variations in shape, color, and the high dimensionality of feature vectors used for processing.

In the past, template matching approaches were used extensively using a set of templates or parameterized curves. However, real objects are not always describable via a set of rigid templates – in such cases, these approaches are found to be inadequate and difficult to extend. Thus, a significant amount of domain knowledge needs to be input a priori. In more recent research, a number of learning-based techniques have been proposed. Despite the diversity of approaches, there is a common underlying theme consisting of the following steps: a) provide a collection of target images containing the object class considered together with negative examples; b) transform the images into (feature) vectors using a certain representation; c) use the previous extracted features to learn a pattern recognition classifier (statistical, neural network, and so on) to separate target from non target objects.

In the area of event detection, usually we can broadly divide the proposed techniques into two parts: human action recognition and general motion-based recognitio. As in the preceding case, most approaches require the transformation of each frame of the image sequence into a certain feature space, and then action recognition is performed on those feature, building a 3-D model of the action and/or some measurement based temporal models. Motion recognition has been widely detected through the computation of basic flow fields, estimated by principal component analysis (PCA) followed by robust estimation techniques or using a “motion-history image” (MHI), which represents the motion at the corresponding spatial location in an image sequence.

Video texts are also used to improve event detection. Technically speaking, there are two kinds of video text: texts shown in video scenes which is often called scene text , and the text added into the video during a post-processing phase, or caption text: for certain kinds of video, such as sports, captions are semantically more important than scene text. In both cases, to recognize text, existing OCR (Optical Character Recognition) engine may be adopted thus transforming a binary image representation into an ASCII string result: this is used to efficiently detect objects and/or events in a video scene.

Finally, some authors propose a strategy based on a combination of different video/audio features in order to detect specific events, as suggested in, in which “goal” detection in soccer matches has been accomplished using both visual (color histograms) and audio (RMS) features on video shots and through a simple reasoning on conjunctive simultaneous presence of several detected events.

Video Query Language

Algorithms to query video databases can and should be built on top of classical database models such as the relational calculus and the relational algebra. Picariello et. al.propose an extension of the relational algebra to query videos. In their work, they assume the existence of a set of base “visual” predicates that can be implemented using image processing algorithms. Examples of visual predicates include:

  • color(rectl,rect2,dist,d): This predicate succeeds if the color histograms of two rectangles (sub-images of an image) are within distance dist of each other when using a distance metric d.
  • texture(rectl,rect2,dist,d): This predicate succeeds if the texture histograms of two rectangles (sub-images of an image) are within distance dist of each other when using a distance metric d.
  • shape(rectl,rect2,dist,d) : This predicate succeeds if the shapes of two rectangles are within a distance dist of each other when using distance metric d .

They define selection conditions based on some a priori defined set of visual predicates. For example, the selection condition O.color.blue > 200 is satisfied by objects whose “blue” field has a value over 200, while O1.color.blue > O2.color.blue is satisfied by pairs of objects, with the first one more “blue” than the other one. Here, Color(rect1,rect2,10,L1) is a visual predicate which succeeds if rectangle rect1 is within 10 units of distance of rectangle rect2 using say the well known L1 metric They then define what it means for a video block to satisfy a video selection condition. The analog of the relational “select” operation applied to video databases is then defined to be the set of all blocks in a video (or a set of videos) that satisfy the desired selection condition.

The projection operation is a little more complex. Given a video, whose blocks contain a number of different objects, projection takes as input, a specification of the objects that the user is interested in, and deletes from the input video, all objects not mentioned in the object specification list. Of course, pixels that in the original video correspond to the eliminated objects must be set to some value. In, authors use a recoloring strategy to recolor objects that are eliminated in this way.

The Cartesian Product operation (in terms of which join is defined) looks at two videos and is parametrized by a merge function. A merge function takes two frames and merges them in some way. For example, suppose f and f are two frames. The “left-right split merge” function returns a new frame with f occupying its left half and f occupying its right half. Likewise, the “top-down split merge” returns a frame with f in the top-half and f in the bottom half. In an “embedded split-merge”, f is embedded in the top left corner of f. The Cartesian Product operator looks at all pairs of frames – one from each video – and orders the merge of these pairs in some order. Figure 2 below shows an example of Cartesian Product under left-right, top-down and embedded split merge.

The join of two videos is defined in the usual way in terms of selection and Cartesian product. In, a whole set of other video algebra operations has been defined.

Video Index Structures

There are numerous index structures for videos. Subrahmanian describes variants of the R-tree and the segment tree to store video data. It is impossible to store video content on a frame by frame basis (just to fix this concept, a single 90-minute video contains 162,000 frames). There is a critical need to develop compact representations of video and of the parts in which the video may be decomposed (shots, scene). Two main data structures are used in this context: frame segment trees and R-segment trees (or RS-trees).

The basic idea behind a frame segment tree is simple. First, we construct two arrays, OBJECTARRAY and ACTIVITYARRAY. An OBJECTARRAY is an array containing, for each integer i an object oi . With each element of this array is associated an ordered linked list of pointers to nodes in the frame segment tree. For example, the linked list associated with object number 1 ( o1 ) may contain a number of pointers to nodes, e.g. 15,16,19,20, representing nodes in the frame segment tree. On the other side, the ACTIVITYARRAY contains for each element i an activity ai . Also each element of this array has an ordered list of pointers to nodes in the frame segment tree.

The frame segment tree is a binary tree constructed as follows:

  1. Each node in the frame segment tree represents a frame sequence [ x,y), starting at frame x and including all frames up to, but not including, frame y.
  2. Every leaf is at level r. The leftmost leaf denotes the interval [ z 1 ,z 2 ) the second from the left represents the interval [ Z 2 , Z 3 ) and so on. If N is a node with two children representing the intervals [ p 1 , p 2 ), [ p 2 , p 3 ), then N represents the interval [ p 1 , p 3 ). Thus, the root of the segment tree represents the interval [ q 1 , q 2 ) if q z is an exponent of 2; otherwise it represents the interval [ q 1 , 8).
  3. The number inside each node may be viewed as the address of that node.
  4. The set of number placed next to a node denotes the id number of video objects and activities that appear in the entire frame sequence associated with that node.

The R-Segment Tree (RS-Tree) is very similar to the frame segment tree, with one major distinction. Although the concepts of OBJECTARRAY and ACTIVITY ARRAY remain the same as before, instead of using a segment tree to represent the frame sequence, we take advantage of the fact that a sequence [s, e ) is a rectangle of length ( e-s ) and of width 0. The rectangles are arranged in an R-tree. In this case, each R-tree node will have a special structure to specify, for each rectangle, which object or activity is associated with it. The main advantage that an RS-tree has over a frame segment tree is that it is suitable for retrieving pages from disk, since each disk access brings back a page containing not one rectangle but several proximate rectangles.

Video Delivery Over Wireless Multi-Hop Networks [next] [back] Video Content Analysis Using Machine Learning Tools - Introduction, Overview of Machine Learning Techniques

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or

Vote down Vote up

over 5 years ago

Awesome post. Here’s a great tool that lets you build any type of database apps for web and mobile fast and without coding http://www.caspio.com/

Vote down Vote up

over 5 years ago

Awesome post. Here’s a great tool that lets you build any type of database apps for web and mobile fast and without coding http://www.caspio.com/