Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from A-E

Content Extraction and Metadata - MPEG-7, Content extraction and description, Visual descriptors, Audio descriptors, The role of metadata

xml multimedia web based

Oge Marques
Department of Computer Science and Engineering
Florida Atlantic University, Boca Raton, Florida, USA

Definition: Context extraction deals with extracting relevant data (metadata) from complex multimedia files.

Multimedia files contain complex and expensive audiovisual representations. Extracting relevant contents from these files – preferably in a (semi-)automatic manner – in a way that facilitates future indexing, search, and retrieval is a challenging task. The ever growing amount of digitally encoded multimedia content produced, stored, and distributed worldwide makes it an even harder and more relevant job.

Even though creation, storage, and dissemination of multimedia contents have become easier and cheaper than ever before, the resulting content has value only if it can be discovered and used. The most popular way to discover contents on the Internet is to use a text-based search engine. General-purpose text-based search engines usually do a poor job when searching for multimedia contents, primarily due to the fact that they often must rely on potentially subjective, incomplete, and/or ambiguous keyword and free text annotations, describing the multimedia contents. In other words, even though the multimedia data may be relevant, its accompanying description (metadata) does not do justice to it. There is an urgent need to overcome these limitations and find ways to “automatically and objectively describe, index and annotate multimedia information, notably audiovisual data, using tools that automatically extract (possibly complex) audiovisual features from the content to substitute or complement manual, text-based descriptions”.

A lot of the research in multimedia over the past 10 years has somehow addressed these issues, mostly through creation of algorithms for specific subtasks, often integrated into research prototypes or commercial products. We will review some of the technical aspects involved in content extraction and description of multimedia data, as well as some of the most relevant attempts to standardize representation of associated metadata.


In 1996, the Moving Pictures Expert Group recognized the need to address the problem of standardized metadata representation, and work on the MPEG-7 Multimedia Content Description Interface (ISO/IEC 15938) started.

MPEG-7 goals are:

  • Create standardized multimedia description framework
  • Enable content-based access to and processing of multimedia information on the basis of descriptions of multimedia content and structure (metadata)
  • Support range of abstraction levels for metadata from low-level signal characteristics to high-level semantic information

    The guiding principles behind MPEG-7 include:

  • Wide application base: MPEG-7 shall be applicable to content associated with any application, real-time or not, irrespective of whether it is made available online, off-line, or streamed.
  • Wide array of data types: MPEG-7 shall consider a large variety of data types (e.g., speech, audio, image, graphics, video, etc.). No new description tool should be developed for textual data; existing solutions such as Extensible Markup Language (XML) or Resource Description Framework (RDF) should be used instead.
  • Media independence: MPEG-7 shall be applicable independently of the medium that carries the content.
  • Object-based: MPEG-7 shall support object-based description of content. Since the MPEG-4 standard is built around an object-based data model, the two standards complement each other very well.
  • Format independence: MPEG-7 shall be applicable independently of the content representation format, whether analog or digital, compressed or not.
  • Abstraction level: MPEG-7 shall support description capabilities with different levels of abstraction, from low-level features (which most likely will be extracted automatically using algorithms beyond the scope of the standard) to high-level features conveying semantic meaning.

For more details on the MPEG-7 standard, please refer to. For the latest updates on the MPEG-7 program of work, check .

Content extraction and description

Automatic extraction of contents from multimedia files is a challenging topic that has been extensively investigated over the past decade, often in connection with image and video indexing and retrieval systems, digital libraries, video databases, and related systems. We will look at content extraction from the perspective of the associated MPEG-7-compatible descriptors. Consequently, we will also adopt the MPEG-7 division of the extraction and description steps into two main categories: visual and audio.

Visual descriptors

Visual descriptors, as the name suggests, provide standardized descriptions of the visual aspects of a multimedia file (e.g., a video clip). They can be of two main types: general or domain-specific. The main types of general visual descriptors are:

  • Color descriptors:

    • Color space descriptor: allows a selection of a color space – RGB, HSV, YCbCr, or HMMD (Hue-Max-Min-Diff) – to be used in the description. The associated color quantization descriptor specifies how the selected color space is partitioned into discrete bins.
    • Dominant color descriptor (DCD): allows specification of a small number of dominant color values as well as their statistical properties, such as distribution and variance. The extraction procedure for the DCD is described in.
    • Scalable color descriptor (SCD): is obtained by employing a Haar transform-based encoding scheme across values of a color histogram in the HSV color space.
    • Group-of-frame (GoF), or group-of-picture (GoP) descriptor , is an extension of the SCD to a group of frames in a video sequence.
    • Color structure descriptor (CSD): aims at identifying localized color distributions using a small structuring window.
    • Color layout descriptor (CLD): captures the spatial layout of the representative colors on a grid superimposed on a region or image.
  • Texture descriptors:

    • Homogeneous texture descriptor (HTD): provides a quantitative representation of the region texture using the mean energy and energy deviation from a set of frequency channels.
    • Texture browsing descriptor (TBD): compact descriptor that specifies the perceptual characterization of a texture in terms of regularity, coarseness, and directionality.
    • Edge histogram descriptor (EHD): represents local edge distribution in an image.
  • Shape descriptors:

    • Region-based shape descriptor: expresses pixel distribution within a 2-D object or region. It is based on both boundary and internal pixels and can describe complex objects consisting of multiple disconnected regions as well as simple objects with or without holes.
    • Contour-based shape descriptor: is based on the Curvature Scale-Space (CSS) representation of the contour of an object and is said to emulate well the shape similarity perception of the human visual system.
    • 3-D shape descriptor: is based on the histogram of local geometrical properties of the 3-D surfaces of the object.
  • Motion descriptors:

    • Motion activity: captures the intuitive notion of ‘intensity (or pace) of action’ in a video segment, by encoding the intensity, direction, spatial distribution, and temporal distribution of activity.
    • Camera motion: encodes all camera operations such as translations, rotations, and changes of focal length, as well as all possible combinations of these.
    • Motion trajectory : describes the displacements of objects in time, whereas an object is any (set of) spatiotemporal region(s) whose trajectory is relevant in the context in which is used.
    • Parametric motion: represents the motion and/or deformation of a region or image, using one of the following classical parametric models: translational, rotational, affine, perspective, or quadratic.

An example of domain-specific visual descriptor is the face descriptor, which uses Principal Component Analysis (PCA) to represent the projection of a face vector onto a set of 48 basis vectors that span the space of all possible face vectors. These basis vectors are derived from eigenvectors of a set of training faces and are reasonably robust to view-angle and illumination changes.

Audio descriptors

  • Low-level audio descriptors:

    • Basic descriptors: AudioWaveformType (describes the minimum and maximum sampled amplitude values reached by the audio signal within the sample period) and AudioPowerType (describes the instantaneous power of the samples in the frame).
    • Basic spectral descriptors: provide a very compact description of the signal’s spectral content.
    • Basic signal parameters: include the computation of the fundamental frequency of the audio signal and two measures of the harmonic nature of the signal’s spectrum.
    • Temporal timbre descriptors: used within an audio segment and intended to compute parameters of the signal envelope.
    • Spectral timbre descriptors
    • Spectral basis representations
    • Silence segment: attaches the semantic of ‘silence’ (i.e., no significant sound) to an audio segment.
  • High-level audio descriptors: lower-level audio descriptors can be used as building blocks of higher-level description tools, specialized in tasks such as: general sound recognition, spoken content description, musical instrument timbre description, or melody description.

The role of metadata

“Metadata is the value-added information which documents the administrative, descriptive, preservation, technical and usage history and characteristics associated with resources.” It provides the foundation upon which digital asset management systems (such as digital libraries or video databases) rely to provide fast, precise access to relevant resources across networks and between organizations. Some of the main challenges associated with metadata are its cost, its unreliability, its subjectivity, its lack of authentication and its lack of interoperability with respect to syntax, semantics, vocabularies, languages and underlying models. In this section we look at ongoing activity and relevant standards for representation of metadata associated with multimedia contents.

XML Technologies

XML and its associated technologies, such as XML Namespaces, XML Schema Language, and XML Query languages, have become the key to enabling automated computer-processing, integration and exchange of information over the Internet.

Extensible Markup Language (XML)

XML is a simple and flexible (meta-)language that makes it possible to exchange data in a standard format among heterogeneous machines and networks. It has become the de facto standard for representing metadata descriptions of resources on the Internet.

XML Schema Language

XML Schema Language provides a means for defining the structure, content and semantics of XML documents. It provides an inventory of XML markup constructs which can constrain and document the meaning, usage and relationships of the constituents of a class of XML documents: data types, elements and their content, attributes and their values, entities and their contents and notations. Thus, the XML Schema language can be used to define, describe and catalogue XML vocabularies for classes of XML documents, such as metadata descriptions of web resources or digital objects. XML Schemas have been used to define metadata schemas in the MPEG-7 standard.

XML Query (XQuery)

XML Query provides flexible query facilities to extract data from real and virtual documents and collections both locally and on the World Wide Web.

Semantic Web-related technologies

The Semantic Web vision of “bringing structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users” will only be fully realized when programs and applications are created that collect Web content from diverse sources, process the information and exchange the results with other programs.

Two of the key technological building blocks for the Semantic Web are:

  • Formal languages for expressing semantics, such as the Resource Description Framework (RDF) and OWL (Web Ontology Language).
  • The ontologies which are being constructed from such languages.


The Resource Description Framework (RDF) uses triples to makes assertions that particular things (people, Web pages or whatever) have properties (such as “is a sister of,” “is the author of”) with certain values (another person, another Web page). The triples of RDF form webs of information about related things. Because RDF uses URIs to encode this information in a document, the URIs ensure that concepts are not just words in a document but are tied to a unique definition that everyone can find on the Web.


The OWL Web Ontology Language is designed for use by applications that need to process the content of information instead of just presenting information to humans. OWL facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics .

Integration between multimedia standards, such as MPEG-7, and Semantic Web standards, such as RDF and OWL, is a current research topic. See, and for recent examples of activity in this direction.

Content Management in Multimedia News Systems [next] [back] Content Distribution Network

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or