Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from A-E

Exploiting Captions for Multimedia Data Mining - INTRODUCTION, FINDING, RATING, AND INDEXING CAPTIONS, Background, Sources of Captions, Cues for Rating Captions

media objects text clues

Neil C. Rowe
U.S. Naval Postgraduate School, USA


Captions are text that describes some other information; they are especially useful for describing non-text media objects (images, audio, video, and software). Captions are valuable metadata for managing multimedia, since they help users better understand and remember (McAninch, Austin, & Derks, 1992-1993) and permit better indexing of media. Captions are essential for effective data mining of multimedia data, since only a small amount of text in typical documents with multimedia—1.2% in a survey of random World Wide Web pages (Rowe, 2002)—describes the media objects. Thus, standard Web browsers do poorly at finding media without knowledge of captions. Multimedia information is increasingly common in documents, as computer technology improves in speed and ability to handle it, and as people need multimedia for a variety of purposes like illustrating educational materials and preparing news stories.

Captions also are valuable, because non-text media rarely specify internally the creator, date, or spatial and temporal context, and cannot convey linguistic features like negation, tense, and indirect reference. Furthermore, experiments with users of multimedia retrieval systems show a wide range of needs (Sutcliffe et al., 1997) but a focus on media meaning rather than appearance (Armitage & Enser, 1997). This suggests that content analysis of media is unnecessary for many retrieval situations, which is fortunate, because it is often considerably slower and more unreliable than caption analysis. But using captions requires finding them and understanding them. Many captions are not clearly identified, and the mapping from captions to media objects is rarely easy. Nonetheless, the restricted semantics of media and captions can be exploited.



Much text in a document near a media object is unrelated to that object, and text explicitly associated with an object often may not describe it (i.e., “JPEG picture here” or “Photo39573”). Thus, we need clues to distinguish and rate a variety of caption possibilities and words within them, allowing for more than one caption for an object or more than one object for a caption. Free commercial media search engines (i.e., images.google.com, multimedia.lycos.com, and www.altavista.com/image) use a few simple clues to index media, but their accuracy is significantly lower than that for indexing text. For instance, Rowe (2005) reported that none of five major image search engines could find pictures for “President greeting dignitaries” in 18 tries. So research is exploring a broader range of caption clues and types (Mukherjea & Cho, 1999; Sclaroff et al., 1999).

Sources of Captions

Some captions are explicitly attached to media objects by adding them to a digital library or database. On Web pages, HTML “alt” and “caption” tags also explicitly associate text with media objects. Clickable text links to media files are another good source of captions, since the text must explain the link. A short caption can be the name of the media file itself (e.g., “socket_wrench.gif”).

Less explicit captions use conventions like centering or font changes to text. Titles and headings preceding a media object also can serve as captions, as they generalize over a block of information, but they can be overly general. Paragraphs above, below, or next to media also can be captions, especially short paragraphs.

Other captions are embedded directly into the media, like characters drawn on an image (Lienhart & Wernicke, 2002) or explanatory words at the beginning of audio. These require specialized processing like optical character recognition to extract. Captions can be attached through a separate channel of video or audio, as with the “closed captions” associated with television broadcasts that aid hearing-impaired viewers and students learning languages. “Annotations” can function like captions, although they tend to emphasize analysis or background knowledge.

Cues for Rating Captions

A caption candidate’s type affects its likelihood, but many other clues help rate it and its words (Rowe, 2005):

  • Certain words are typical of captions, like those having to do with communication, representation, and showing. Words about space and time (e.g., “west.” “event,” “above,” “yesterday”) are good clues, too. Negative clues like “bytes” and “page” can be equally valuable as indicators of text unlikely to be captions. Words can be made to be more powerful clues by enforcing a limited or controlled vocabulary for describing media, like what librarians use in cataloging books (Arms, 1999), but this requires cooperation from caption writers and is often impossible.
  • Position in the caption candidate matters: Words early in the text are four times more likely to describe a media object (Rowe, 2002).
  • Distinctive phrases often signal captions (e.g., “the X above,” “you can hear X,” “X then Y”) where X and Y describe depictable objects.
  • Full parsing of caption candidates (Elworthy et al., 2001; Srihari & Zhang, 1999) can extract more detailed information about them, but it is time-consuming and prone to errors.
  • Candidate length is a clue, since true captions average 200 characters with few under 20 or over 1,000.
  • A good clue is words in common between the candidate caption and the name of the media file, such as “Front view of woodchuck burrowing” and image file “northern_woodchuck.gif.”
  • Nearness of the caption candidate to its media actually is not a clue (Rowe, 2002), since much nearby text in documents is unrelated.
  • Some words in the name of a media file affect captionability (e.g., “view” and “clip” as positive clues and “icon” and “button” as negative clues).
  • “Decorative” media objects occurring more than once on a page or three times on a site are 99% certain not to have captions (Rowe, 2002). Text generally captions only one media object except for headings and titles.
  • Media-related clues are the size of the object (small objects are less likely to have captions) and the file format (e.g., JPEG images are more likely to have captions). Other clues are the number of colors and the ratio of width to length for an image.
  • Consistency with the style of known captions on the same page or at the same site is also a clue because many organizations specify a consistent “look and feel” for their captions.

Quantifying Clues

Clue strength is the conditional probability of a caption given appearance of the clue, estimated from statistics by c/(c+n), where c is the number of occurrences of the clue in a caption and n is the number of occurrences of the clue in a noncaption. If we have a representative sample, clue appearances can be modeled as a binomial process with expected standard deviation . This can be used to judge whether a clue is statistically significant, and it rules out many potential word clues. Recall-precision analysis also can compare clues; Rowe (2002) showed that text-word clues were the most valuable in identifying captions, followed in order by caption type, image format, words in common between the text and the image filename, image size, use of digits in the image file name, and image-filename word clues.

Methods of data mining (Witten & Frank, 2000) can combine clues to get an overall likelihood that some text is a caption. Linear models, Naive-Bayes models, and case-based reasoning have been used. The words of the captions can be indexed, and the likelihoods can be used by a browser to sort media for presentation to the user that match a set of keywords.



Studies show that users usually consider media data as “depicting” a set of objects (Jorgensen, 1998) rather than a set of textures arranged in space or time. Captions can be:

  • Component-Depictive: The caption describes objects and/or processes that correspond to particular parts of the media. For instance, a caption “President speaking to board” with a picture that shows a president behind a podium with several other people. This caption type is quite common.
  • Whole-Depictive: The caption describes the media as a whole. This is often signaled by media-type words like “view,” “clip.” and “recording”; for instance, “Tape of City Council 7/26/04” with some audio. Such captions summarize overall characteristics of the media object and help distinguish it from others. Adjectives are especially helpful, as in “infrared picture.” “short clip,” and “noisy recording”; they specify distributions of values. Dates and locations for associated media can be found in special linguistic formulas (Smith, 2002).
  • Illustrative-Example: The media presents only an example of the phenomenon described by the caption; for instance, “War in the Gulf” with a picture of tanks in a desert.
  • Metaphorical: The media represents something related to the caption but does not depict it or describe it; for instance, “Military fiction” with a picture of tanks in a desert.
  • Background: The caption only gives background information about the media; for instance, “World War II” with a picture of Winston Churchill. National Geographic magazine often uses caption sentences of this kind after the first sentence.

Media Properties and Structure

The structure of media objects can be referenced by component-depictive caption sentences to orient the viewer or listener. Then valuable information is often contained in the sub-objects of a media object that captions do not convey. Images, audio, and video are multidimensional signals for which local changes in the signal characteristics help segment them into sub-objects (Aslandogan & Yu, 1999). Color or texture changes in an image suggest separate objects; changes in the frequency-intensity plot of audio suggest beginnings and ends of sounds; and many simultaneous changes between corresponding locations in two video frames suggest a new shot (Wactlar et al, 2000). But segmentation methods are not especially reliable. Also, some media objects have multiple colors or textures, like images of trees or human faces, and domain-dependent knowledge must group regions into larger objects.

Software can calculate properties of segmented regions and classify them. Mezaris, Compatsiaris, and Strinzis (2003), for instance, classify image regions by color, size, shape, and relative position, and then infer probabilities for what they could represent. Additional laws of media space can rule out possibilities so that objects closer to a camera appear larger, and gravity is downward, so support relationships between objects often can be found (e.g., people on floors). Similarly, the pattern of voices and the duration of their speaking times in an audio recording can suggest in general terms what is happening. The subject of a media object often can be inferred, even without a caption, since subjects are typically near the center of the media space, not touching its edges, and well distinguished from nearby regions in intensity or texture.

Caption-Media Correspondence

While finding the caption-media correspondence for component-depictive captions can be generally difficult, there are easier subcases. One is the recognition and naming of faces in an image (Satoh, Nakamura, & Kanda, 1999). Another is captioned graphics, since their structure is easier to infer than most images (Preim et al., 1998).

In general, grammatical subjects of a caption often correspond to the principal subjects within the media (Rowe, 2005). For instance, “Large deer beside tree” has the grammatical subject “deer,” and we would expect to see all of it in the picture near the center, whereas “tree” has no such guarantee. Exceptions are undepictable abstract subjects (i.e., “Jobless rate soars”). Present-tense principal verbs and verbals can depict dynamic physical processes, such as “eating” in “Deer eating flowers,” and direct objects of such verbs and verbals usually are fully depicted in the media when they are physical like “flowers.” Objects of physical-location prepositions attached to the principal subject are also depicted in part (but not necessarily as a whole). Subjects that are media objects like “view” defer viewability to their objects. Motion-denoting words can be depicted directly in video, audio, and software, rather than just their subjects and objects. They can be translational (e.g. “go”), configurational (“develop”), property-changing (“lighten”), relationship-changing (“fall”), social (“report”), or existential (“appear”).

Captions are “deictic,” using the linguistic term for expressions whose meaning requires assimilation of information from outside the expression itself. Spatial deixis refers to spatial relationships between objects or parts of objects and entails a set of physical constraints (DiTomaso et al., 1998; Pineda & Garza, 2000). Spatial deixis expressions like “above” and “outside” are often “fuzzy” in that they do not define a precise area but rather associate a probability distribution with a region of space (Matsakis et al., 2001). It is important to determine the reference location of the referring expression, which is usually the characters of the text itself but can be previously referenced objects like “right” in “the right picture below.” Some elegant theory has been developed, although captions on media objects that use such expressions are not especially common.

Media objects also can occur in sets with intrinsic meaning. The media can be a time sequence, a causal sequence, a dispersion in physical space, or a hierarchy of concepts. Special issues arise when captions serve to distinguish one media object from another (Heidorn, 1999). Media-object sets also can be embedded in other sets. Rules for set correspondences can be learned from examples (Cohen, Wang, & Murphy, 2003).

For deeper understanding of media, the words of the caption can be matched to regions of the media. This permits applications like calculating the size and contrast of media subobjects mentioned in the caption, recognizing the time of day when it is not mentioned, and recognizing additional unmentioned objects. Matching must take into account the properties of the words and regions, and the constraints relating them, and must try to find the best matches. Statistical methods similar to those for identifying clues for captions can be used, except that there are many more categories, entailing problems of obtaining enough data. Some help is provided by knowledge of the settings of things described in captions (Sproat, 2001). Machine learning methods can learn the associations between words and types of image regions (Barnard et al., 2003; Roy, 2000, 2001).

Generating Captions

Since captions are so valuable in indexing and explaining media objects, it is important to obtain good ones. The methods described above for finding caption candidates can be used to collect text for a caption when an explicit one is lacking. Media content analysis also can provide information that can be paraphrased into a caption; this is most possible with graphics images. Discourse theory can help to make captions sound natural by providing “discourse strategies” such as organizing the caption around one media attribute that determines all the others (e.g., the department in a budget diagram) (Mittal et al., 1998). Then guidelines about how much detail the user wants, together with a ranking of the importance of specific details, can be used to assemble a reasonable set of details to mention in a caption. Semi-automated techniques also can construct captions by allowing users to point and click within media objects and supply audio (Srihari & Zhang, 2000). Captions also can be made “interactive” so that changes to them cause changes in corresponding media (Preim et al., 1998).


Future multimedia-retrieval technology will not be dramatically different, although multimedia will be increasingly common in many applications. Captions will continue to provide the easiest access via keyword search, and caption text will remain important to explain media objects in documents. But improved media content analysis (aided by speed increases in computer hardware) will increasingly help in both disambiguating captions and mapping their words to parts of the media object. Machine-learning methods will be used increasingly to learn the necessary associations.


Captions are essential tools to managing and manipulating multimedia objects as one of the most powerful forms of metadata. A good multimedia data-mining system needs to include captions and their management in its design. This includes methods for finding them in unrestricted text as well as ways of mapping them to the media objects. With good support for captions, media objects are much better integrated with the traditional text data used by information systems.

Extensible Stylesheet Language [next] [back] Exploitation

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or