Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from F-J

Indexing 3D Scenes - 3D scenes as pure geometric worlds, Localizing real world objects in a 3D scene

semantic content mpeg description

oan Marius BILASCO, Jérôme GENSEL, Hervé MARTIN, and Marlène VILLANOVA-OLIVER
Laboratoire LSR IMAG, Grenoble, France

Definition: Semantic queries on indexed objects allow the reuse of the 3D scenes.

Nowadays, the 3D is a highly expanding media. More particularly with the emergence of dedicated standards such as VRML and X3D, 3D animations are widely used on the Web. The continuous evolution of computing capabilities of desktop computers is also a factor that facilitates the large deployment of 3D information contents. At the same time the demand in term of 3D information is becoming more and more sustained in various domains such as spatial planning, risks management, telecommunications, transports, defense, and tourism. 3D information should represent a real world scene as accurately as possible and should exhibit properties (like, topological relations) to allow complex spatial analysis.

The construction of a 3D scene is a complex and time consuming task. Thus, being able to reuse the 3D scenes is a very important issue for the multimedia community. In order to meet this goal, the indexing process is essential. Indexing implies the enrichment of the raw information contained in a multimedia document. In general, indexing is achieved by means of signal analysis or manual or semi-automatic annotations. Signal indexing supposes the automatic extraction of implicit features from the document. For instance, if one analyses a 2D image the signal indexing results in the extraction of the dominant color, the color histogram, etc.

Usually, in a 3D scene, one can model only the geometric features of the scene paying very little attention to semantic information that should guide and help the reuse. Identification of interesting/reusable objects in the scene is part of the indexing process. The granularity of a reusable object can vary from a simple geometric element (e.g. a cube, etc.) to a full scene (e.g. a casual office scene, a building). Since a 3D scene is built up from different geometric elements, identification of objects is performed by localizing its geometric elements. In order to facilitate the reuse of 3D objects, some semantic information should be added. Semantic queries on indexed objects would yield the most appropriate result according to the intent of reuse.

3D scenes as pure geometric worlds

The 3D community benefits from the support of a highly interactive consortium called the Web3D consortium, involving many companies (NASA, HP, nVIDIA, Sun Microsystem) and academic communities (Communications Research Center of Canada, GIS Research Center at Feng Chia University, and others). The research efforts of the consortium are directed towards the development of a widely adopted standard for deploying 3D information all over the Web.

The Extensible 3D (X3D) standard has emerged in the mid 2002. X3D was proposed as a revised version of the Virtual Reality Modelling Language (VRML97). In July 2002 the Web3D Consortium made available the final working draft version of X3D. The X3D was accepted by the ISO as an ISO/IEC 19775 standard of communication for 3D real-time scenes in August, 2004. The final specifications were produced by the end of October 2004.

This standard defines a runtime environment and a delivery mechanism for 3D content and applications running over a network. It combines geometry descriptions, runtime behavioral descriptions and control features. It proposes numerous types of encodings including an Extensible Modelling Language (XML) encoding.

X3D is extensible as it is built on a set of components organized in profiles. Each profile contains a set of components. A component introduces a specific collection of nodes. The next extensions of the standard will be made by defining new components and organizing them into new profiles.

An X3D document represents the scene as an n-ary tree. The tree is composed of nodes supported by the selected profile. Among the most important nodes are found: geometric primitives ( Cube, Box, IndexLineSet , etc), geometric transformations ( Transform assuring translations and rotations ), composite objects ( Group ), alternate content ( Switch ), multi level representation ( LOD ), etc. The tree also contains ambiental elements: lights, viewpoints , etc, and a meta-data node ( Worldlnfo ).

A view of the scene is shown in Figure 1. The office contains a desk and a chair. On the desk, there are two stacks of books and papers. Hereafter, in Figure 2, we present the tree corresponding to a 3D scene describing a researcher’s office. An excerpt of the X3D code corresponding to the materialization of the chair can be found in Figure 3.

The desk is built up using three boxes: one for the desk top and two for the desk legs. The chair is composed of three boxes as well: one for the back side and the back-side legs, one for the front legs and one to sit on. The books are modeled by three boxes: two small ones for the books on the left and a bigger one for the stack of papers.

The resulting model of a 3D scene using the X3D treats exhaustively the geometry, the environmental aspects (lights, etc.), and describes levels of user’s interactions. However, the scene is not self-descriptive. The scene does not contain any information on the real-world objects included into the scene. The main scope of an X3D scene model is to ensure the delivery of the scene to the user’s rendering device. The semantics aspects are scarcely included.

A 3D scene is a condensed information space (geometric description, spatial organization, etc.). The richness of a 3D scene cannot be fully exploited if information is not explicitly materialized. In our example, the fact that the chair is under the table cannot be directly deduced from the X3D description of the scene. Geometric constructs embed information that it is easily understood by human actors, but raw information, globally, remains unexploitable by means of queries without further analysis. Consequently, some spatial analysis is required.

Even though the structural organization of an X3D file facilitates the retrieval of attributary information ( position, color, appearance ), a series of spatial or semantic queries still remain unanswered. A complementary description, by means of annotations, should take into account the semantic aspects of information associated with the scene (e.g. the chair is allowing the researcher to sit at the table – semantic relations, the chair cannot support a weight heavier than 200 kg – semantic properties). Firstly, the annotations should allow identifying the geometric elements that correspond to the representation of a real world object. Secondly, semantic properties should be associated with the related real-world object, as well as relations it has with its environment.

The boxes are imposed an adequate size using the scale attributes of Transform nodes as in Figure 3. They are grouped together into Group nodes, placed at the right position using the translation attribute, and are imposed a fixed orientation using the rotation attributes.

Localizing real world objects in a 3D scene

In order to explain the localization of real world objects in a 3D scene one should take also into account the work done in the domain of 2D images. The localization supposes the identification of a set of elements that constitutes the representation of the object. For instance, in the case of a 2D image objects are associated to sub-regions delimited by sets of 2D polygons.

The nature of localized elements can vary from pure geometric elements ( points, lines, surfaces, volumes ) to complex document entries (e.g. a cluster of geometric elements). Hence, we can observe two types of localization, both geometric and structural localizations.

The Structural Localization uses the structure of a document to indicate the document entries that corresponds to the target element. For instance, if the document is XML-like, then a structural localization would be likely composed of a series of XPATH expressions indicating the parts of the object in the scene.

Let us consider the scene described in Figure 3. The localization of the desk legs can be expressed as the combination of the two following XPATH expressions:

/ X3D /Scene / Group / Group[id=‘desk'] / Transf orm[position()=2] <br> and / /Group[id=‘desk’] / Transform[position()=3].

This localisation contains information about elements that materialize the legs ( Shape ) as well as their position ( Transform ) inside the local cluster ( Group id=“desk” ). The structural localisation should target the smallest structural element that offers an adequate description for the object.

Sometimes, this condition cannot be satisfied. It is possible that the structure of the document is not as fine-grained as necessary. In our example, we could imagine that the bottom of the stack of papers contains articles published in the last month while the top of the stack contains articles about the multimedia indexing. The structural localisation does not allow localizing the articles at the bottom of the stack since no structural element corresponds to the respective object. The whole stack is defined as a box, and, hence the same structural element is used to represent two real world objects. The only solution to identify the object is to complete the structural localisation using a geometric localisation. In our example, we can isolate the multimedia indexing related papers with a volume such as, a cube having the same basis as the box and a fixed height. The cube coordinates are defined relatively to the box. In this case, there is a mix between one structural localisation and a geometric one. However, the geometric localisation could be completely separated from the structure of the scene. In this case, the volume should have been defined according to the global reference system. The situation presented above has many similarities with multimedia documents that do not have any internal structure which is often the case in the presence of raw images, simples Digital Terrain Models, etc.

When the geometric localisation is used, geometric queries are applied on the scene elements in order to obtain the precise geometric elements composing the object. The granularity of the retrieved geometric elements (pixels, primitives, etc.) – subject of negotiation – is chosen in order to meet specific requirements.

As in the case of 2D images region identification, one can think of three ways of performing localisation:

  • manual localisation: the user defines the set of objects and their localisation in the scene.
  • semi-automatic localisation: the machine suggests to the user a series of possible objects in the scene. In the case of an X3D structured scene the machine could associate to each Group node a virtual object leaving then the user choose the relevant ones. In the case of an unstructured 3D scene (a Digital Terrain Model), a signal-level analysis could yield interesting regions using algorithmic methods similar to those employed in 2D contour detection for instance. Other propositions suggest the use of specific tools – like intelligent scissoring of 3D meshes [ 5 ], etc. – in order to define precise geometric localisation.
  • automatic localisation: the system performs all the work. In order to achieve this degree of generality dependant domain knowledge should be implemented in the localisation process.

Indexing multimedia content using MPEG-7

The Moving Picture Experts Group (MPEG) is a working group of the ISO/IEC in charge of developing standards for coded representation of digital audio and video. Established in 1988, the group has produced MPEG-1, MPEG-2, MPEG-4, MPEG-7 and MPEG-21 “Multimedia Framework”. MPEG-1, MPEG-2 and MPEG-4 are basically standards concerning the encoding and the transmission of audio and video streams. MPEG-7 is a standard that addresses the semantic description of media resources. MPEG-21 is much more considered as the description of a multimedia framework (coding, transmission, adaptation, etc.) than a specific standard. Even if the MPEG-7 and MPEG-21 were proposed in the context of digital audio and video data, they are highly extensible and could cover other areas. Due to its high capability of evolution, we consider MPEG-7 as a valuable candidate for fulfilling requirements in terms of semantic annotations inside a 3D scene. Hereafter, we focus on the MPEG-7.

MPEG-7 formally named “Multimedia Content Description Interface”, was officially approved as a standard in 2001. It provides multimedia content description utilities for the browsing and retrieval of audio and visual contents. This standard provides normative elements, such as, Descriptors (Ds), Descriptors Schemes (DSs) and a Description Definition Language (DDL).

The Ds are indexation units describing the visual, audio and semantic features of media objects.

They allow the description of low level audio-visual features (color, texture, animation, sound level, etc.) as well as attributes associated with the content (localisation, duration, quality, etc). Moreover, visual and audio features are automatically extracted from the encoding level of the media object as described in for visual features and for audio. Semantic features are mainly added manually.

The DSs are used to group several Ds and other DSs into structured, semantic units. A DSs models a real life entity and the relations it holds with its environment. DSs are usually employed to express high-level features of media content like: objects, events, and segments, metadata linked to the creation, generation and usage of media objects. As for the semantic features, DSs are scarcely filled in automatically.

In order to offer an important degree of extensibility to the standard the DDL is included as a tool for extending the predefined set of Ds and DSs formerly proposed. The DDL defines the syntax for specifying, expressing and combining DSs and Ds allowing creating new DSs.

The existing DSs cover the following areas: visual description (VDS), audio description (ADS), multimedia content description (MDS) (general attributes and features related to any simple or composed media object). We focus on the MDS as it addresses organization aspects that could serve as a valuable starting point in order to extend the indexing capabilities towards 3D documents. The VDS and the ADS are matched against the physical/logical organizations or semantics of the document.

MDSs propose metadata structures for annotating real world or multimedia entities. MDS is decomposed on the following axis: Content Organization, Navigation and Access, User Interaction, Basic Elements, Content Management, Content Description.

We discuss more in detail the Content Description axis as it offers DSs for characterizing the physical and the logical structures of the content. It ensures also the semantic description using real world concepts. The basic structural element is called a segment. A segment corresponds to a spatial, temporal or spatio-temporal partition of the content. A segment (Audio Segment DS, Visual Segment DS, Audiovisual Segment DS, and Moving Regions DS) is associated with a section . It can be decomposed in smaller segments creating a hierarchical segmentation of the media content. Each segment is individually indexed using the available tools (visual DSs, audio DSs).

The conceptual aspects of the media content are formulated using the Semantic DS. Objects ( ObjectDS ), events ( EventDS ), abstract concepts (ConceptDS), places ( SemanticPiaceDS ), moments of time ( SemanticTimeDS ) are all parts of the Semantic DS. As for the segment-based description of the content, the semantics can be organized as trees or graphs. The nodes represent semantic notions/concepts and the links define semantic relations between concepts. A semantic characterization of the office scene (Figure 1) using MPEG-7 tools is illustrated below.

The structural schemas and the semantic ones could be linked to each other. Hence, the content description could be made out of content structure and semantic structure all together.


In this article we presented some problems that are inherent to the process of indexing and reuse of 3D scenes. Widely accepted standards exist for the modeling of 3D scenes (X3D) and multimedia indexing (MPEG-7). They enhance, respectively, the Web deployment of 3D information and the management of multimedia content. However, at our knowledge, no largely accepted research project aims at solving interoperability issues between the two standards in order to support the management – and notably the reuse – of 3D multimedia content.

The interoperability issues concern the capacity of addressing 3D content inside a multimedia document. Specific 3D region/object locators must be designed. A set of specific 3D spatial descriptions schemes have to be provided in order to facilitate the spatial analysis required by most of 3D application domains (spatial planning, transports).

Research efforts will address the interoperability issues between standards in order to improve in flexibility and reuse of 3D scenes. Work is to be performed in order to enhance the capacities of MPEG-7 to address and to characterize 3D content. The semantic added to the pure X3D geometric modeling of the scene enhances the reuse process of 3D objects. Complex geometric, spatial and/or semantic queries could then be formulated in order to extract the most appropriate 3D content according to specific needs of applications or scene designers.

[back] Incident on a Dark Street

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or