Simone Santini
Universidad Autónoma de Madrd, Spain and University of California,
San Diego, USA

Definition: Video search is considered, more or less, as the next logical step from image search, but this consideration is in many aspects superficial: it shows a lack of appreciation for the peculiar characteristics of the medium and for the new class of problems that the temporal nature of video poses.

If, to this, one adds that image search has, to this day, failed to produce a satisfactory general purpose solution (and there are some reasons to believe that such a solution may be conceptually impossible, one can realize that video search is still in a very preliminary stage, quite possibly one in which its main intellectual features and cultural impact still have to be identified.

Differently from other media, there is a substantial discontinuity among the various forms of video and, correspondingly, one can expect a similar discontinuity in the methods and principles of search for different types of video. A broad distinction in organization and search principles can be made between semiotic and phencestetic video.

Differently from other media, there is a substantial discontinuity among the various forms of video and, correspondingly, one can expect a similar discontinuity in the methods and principles of search for different types of video. A broad distinction in organization and search principles can be made between semiotic and phencestetic video.

Semiotic Video

Semiotic video is, loosely speaking, video specifically designed, constructed, or assembled to deliver a message. Produced video, from films to TV programs, commercials, or music videos are examples of semiotic video: they carry an explicit message (or multiple messages), which is encoded not only in their content, but also in the syntax of the video itself.

In other words, in addition to the information carried by the scenes represented in the video, and by the actions and situations therein depicted, one has an additional language, superimposed to it, a semiotic system whose syntax was defined, in its essential elements, by the work of Russian constructivists such as Eisenstein and Vertov in the 1920’s. This language is that of montage, based on elements such as shots (an uninterrupted camera take covering a certain time span), scenes (a collection of shots that forms a narrative unit), and transitions between them.

The language of montage has extended beyond cinema through its adoption by other cultural media like TV or music videos. Its relative importance as a carrier of the semiosis of a video depends, of course, on the general characteristics of the form of expression in which it is employed, or even on the personal characteristics of the creator of a given message: we have at one extreme film directors who make a very Spartan use of montage and, at the other extreme, musical videos in which the near totality of the visual message is expressed by montage.

Montage is not the only symbolic system at work in semiotic video, although it is the most common and the easiest to detect. Specific genres of video use other means that a long use has promoted to the status of symbols in order to express certain ideas. In cartoons, for instance, characters running very fast always leave a cloud behind them, and whenever they are hit on the head, they vibrate. These messages are, to an extent, iconic (certain materials do vibrate when they are hit) but their use is now largely symbolic and is being extended especially rather crudely to other genres (such as action movies) that rely on visuals more than on dialog to describe an action: thanks to the possibilities offered by sophisticated computer techniques, the language of action movies is moving away from realism and getting closer and closer to a cartoonish symbolism.

Attempts at an automatic analysis of semiotic video have traditionally relied on the semiotic system constituted by montage, mainly due to the existence of relatively robust algorithms for the detection of cuts and of transition effects, which permit the extraction of—at least—a level of semantics without requiring the solution of the cognitive problems of image recognition. The detection of the structure of montage serves, grosso modo, two purposes in video search.

Firstly, it is used to break the temporal structure of the video in a spatial one in which the temporal structured is summarized, and where it can be stored and accessed as an instantaneous whole. The break-up of the temporal structure is sometimes stored in a two-level hierarchy: shots are grouped into scenes (sometimes this hierarchy is referred to as a four levels one, the bottom level being represented by the individual frame, the top level by the whole film—occasionally, additional levels can be present, as in a film divided into episodes). The term scene used in video retrieval is only partially related to what a film director would call a scene, but it is based on the same idea of being a collection of thematically and narratively related shots. The criteria for clustering shots into scenes are, to an extent, dependent on the genre of the film (what works for, say, a documentary won‘t probably work for a musical video). For narrative cinema, clustering criteria include color (the dominant colors are given by the background, which tend to be the same in the various shots of a scene), a statistically uniform audio, and a detection of a dialog situation (represented, e.g. by the alternance of the same faces). In some other cases, video has a structure that can constitute a priori information used to guide the segmentation process. This is true for highly structured “format” video programs, such as news: in this cases one can start with a structure graph describing the temporal relations of the various elements of the video (anchor-person shots, stories, commercial interruptions,) and, on one hand, create expectations that can be used to drive the segmentation process and, on the other hand, fill in the slots of the model with the segments detected.

The information that is associated to these fragments of video is to an extent dependent on the application that the system is expected to serve. In many cases of retrieval, a suitable set of features is extracted and is used to index the fragment; these can consist in features extracted from the whole fragment (e.g. features containing motion information) or features extracted from some representative frames. Representative (or “key”) are also used to represent the fragments in the case of interactive “browsing” systems. An alternative to this is to maintain the temporal nature of video in the summarization by building abstracts , that is, short segments of video meant to convey the essential contents of the whole video.

Secondly, shots and scenes can be used as the basis for features that characterize the contents of video. This approach goes back to montage seen as a symbolic—and, therefore, conventional—semiotic system. To make but an example, typically a cut in a film signifies either no time discontinuity at all, or a relatively short time discontinuity, while a dissolve marks the passage of a certain amount of time. Conventions such as these can be used to determine important aspects of the semantics of a video. There is, here as in many other areas at the intersection of technology and creative expression, the always present risk of simplification: the use of the language of montage is highly idiosyncratic, depending on the personal language of the director, on the language of the genre in which a film is placed, and on the general convention in use at a certain time: trivialization is an ever present danger when one deals with a semiotic system as complex as video. The worst risk in this sense is that a technological trivialization will push, under the influence of marketing and of the advantages that technological complacency can have for it, towards a simplification of the film language itself.

Together with other features (the semiotic of color, for instance, is quite well known) montage has been used to create a taxonomy in advertisements according to their narrative structure. Similarly, montage has been used to classify video according to genre although, in this case, the feasibility of such an operation is more doubtful given the uncertain (culturally mediated and, to an extent, subjective) definition of genres (is The Maltese falcon a noir or an action film ?).

Phencestetic video

Phencestetic is characterized by the almost absolute absence of cultural references or expressive possibilities that rest on a shared cultural background. It is, to use an evocative albeit simplistic image, video that just happens to be. Typical examples of phencestetic video are given by security cameras or by the increasingly common phenomenon of web cameras.

In this type of video, there is in general no conscious effort to express a meaning except, possibly, through the actions of the people in the video. Medium specific semiotic systems, such as montage are absent and, in most cases, the identities of the people present in the video will afford no special connotation. In a few words, phencestetic video is a stream of undifferentiated, uninterrupted narrative, a sort of visual stream of consciousness of a particular situation.

Imposing a structure to this kind of video for the purpose of searching it is much more problematic than it is the case with semiotic video, since all the syntactic structures to which the semantic structure is anchored are absent. It would be futile, for instance, to try to infer the character of a phenoestetic video by looking at the average length of the shots or the semiotic characteristics of its color distribution, for there are no shots to be detected and colors are not purposefully selected.

One organizational principle on which search can be based is constituted by events . The concept of event is, of course, highly application-specific—what constitutes an event in a security environment might not constitute it for a consumer-behavioral study in a supermarket. However, once a set of primitive events has been selected—and the proper feature extraction algorithms and analysis procedure have been set up for their detection and placement in time—general primitives can be used to compose primitive events in complex ones.

With events as objects of the query process, a number of data base techniques can be used or modified to serve the needs of video search. One possibility in this sense is to use temporal logic for describing video sequences as well as for expressing query conditions, while other options are offered by the various event specification formalisms Page 916  devised for event data bases. One advantage of these formal approaches is that they provide the means to specify the intended semantics of event detection (intended here as formal semantics, e.g. denotational or operational).

Consider the detection of a composite event composed of an instance of an event A followed by an instance of an event B, and suppose that in a certain situation four primitive events are detected, placed in time as in figure below:

How many composite events are detected? Depending on the semantics of follows, one can detect just the event the two events the four events and so on. Several possible semantics for event detection can be defined depending on the requirements of the application.

