Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from K-O

Multimedia Semantics - Conceptualization of the multimedia content, Extraction and representation of

information knowledge standards formal

Giorgos Stamou and Stefanos Kollias
Institute of Communication and Computer Systems,
National Technical University of Athens, Greece

Definition: Multimedia semantics deals with the question how to conceptually index, search and retrieve the digital multimedia content, which means how to extract and represent the semantics of the content of the multimedia raw data in a human and machine-understandable way.

Conceptualization of the multimedia content

The late advances in computer and communication technologies caused a huge increase of digital multimedia information distributed over the web. On the one hand, the production line of digital multimedia content can now be followed by everyone: the production houses shoot high quality video in digital format; organizations that hold multimedia content (TV channels, film archives, museums, libraries, etc.) digitize it and use the different digital formats for preservation, management and distribution; most of people has digital cameras, scanners etc and produces image and video content in MPEG and JPEG formats, ready to be delivered. On the other hand, due to the maturity of storage and data network technologies, digital formats provide now the most cheap, safe and easy way to store and deliver multimedia content even in high resolutions. And, of course, a great role in this pace chain has been played by the standardization activities that provided MPEG, JPEG and other digital coding formats.

The amount of multimedia information delivered on the Web is now so huge that the improvement of the coding standards and web technologies is not enough to satisfy the needs of the end user in the searching and retrieving process. Although the standardization activities of the ISO (and other) communities (MPEG-7, MPEG-21, Dublin Core, etc.) have provided standards for describing the content, using formal representation languages like XML, these standards have not been widely used mainly for two reasons. The first is that the task of manually annotating the multimedia content in the standard metadata form is difficult and time-consuming, thus very expensive. The second is that the representation languages used by the standards (like XML and XML Schema) do not provide any way to formally represent semantics, thus do not solve the problem. The main question is how to conceptually index, search and retrieve the digital multimedia content, which means how to extract and represent the semantics of the content of the multimedia raw data in a human and machine-understandable way.

The problem has been pointed out by Tim Berners-Lee some years ago and boosted the research and the standardization in the emerging technology of the Semantic Web. W3C followed the late research in knowledge representation and reasoning and started a standardization process of a new content representation language with formal semantics, the latest result of which is the Web Ontology Language (OWL). Still, the new question is whether the semantic web technology, is mature enough to be used in the multimedia content annotation area. Several approaches have been proposed for facing this problem. Some of them try to transfer all the multimedia standards in the new knowledge representation languages like OWL. Others keep some of the metadata in the structural form of the multimedia standards and represent the rest that needs semantics in RDF or OWL. In any case, the clear message is that the multimedia standards have to use the semantic web technology, although it is still “under construction”. And not only for the main reason that the representation languages of the semantic web (like OWL) will be the only standard to represent content with formal semantics in the web in the very near future (like HTML and XML for the serialization of unstructured and structured information). But also because they provide a formal framework to represent the knowledge needed for the automatic analysis of raw multimedia content and the extraction of its semantic annotation.

Extraction and representation of multimedia semantics

Multimedia documents are complex spatio-temporal signals providing information in several levels of abstraction . The user conceptualizes and understands the multimedia content by capturing all different cues (audio, video, speech, images, text ), specifying the sources of semantic information (a scene of a video, a shot, a frame, a specific object, a text box, a narration, a specific sound etc) in various levels of detail and then understanding the semantic information provided by these sources and their interrelations. Following the above syllogism, the information provided by the multimedia document can be formalized, represented, analyzed and processed in three different levels of abstraction: the subsymbolic , the symbolic and the logical (see Figure 1).

The subsymbolic level of abstraction covers the raw multimedia information represented in the well known formats of the different cues like video, image, audio, text, metadata etc. At this level, the information is processed by several processing, feature extraction and classification tools and the final result is the extraction, representation and description of the different information sources. The detection of these sources is actually a matter of temporal, spatial and spatio-temporal multi-cue segmentation and feature extraction that provides the structure of the multimedia document, i.e. a tree (or more general a graph) representing all the different sources, their characteristics and their interrelations in a symbolic manner. This information forms the symbolic level of abstraction. Moreover, at this level information (metadata) covering all the lifecycle of the multimedia document, from pre-production to post-production and use, is stored. All the above information is serialized in descriptive representation languages like XML and processed by feature-based recognition systems that provide the semantics of the multimedia document. The semantics are actually mappings between the symbolic-labeled and structured information sources and a terminology (a higher knowledge), described in a formal knowledge representation language, like OWL. This formal terminology forms the logical level of abstraction, in which the implicit knowledge of the multimedia document description can be explicit with the aid of reasoning algorithms. This means that syllogistic processing of the existing and formally represented knowledge, interpreted and instantiated in the world of the multimedia document description would provide further knowledge for this multimedia document, also serialized in formal descriptive languages and semantically interpreted with the aid of the terminology.

The analysis of the multimedia document and its automatic annotation is actually a continuous process in all the levels of abstraction that generally performs the followings: detects possible information sources with the aid of processing and clustering; extracts the appropriate features and represents them; tries to recognize the semantic meaning of the information source with the aid of semantic interpretation (assertion through feature matching); reason with the aid of the terminology; checks the consistency of the knowledge. For example, let us suppose, without loss of generality, that the multimedia document is a simple image containing a face (as a small object of the image). With the aid of a spatial color segmentation algorithm, a possible information source is detected. Suppose also that this area is actually the area of the face (of course it is not yet recognized and labeled with the “face” symbol). Then, several predefined features can be extracted and all the above information (possible object and its descriptors) is serialized in MPEG-7 XML. The next step is the feature matching of the represented features and others stored in an existing knowledge of objects. If the matching with the “face” object is done, then a semantic association is defined between the object and the “face” object of the higher knowledge and thus the analysis tool has recognize the object. Continuing this process several objects could be recognized and then using the reasoning additional objects or relations between the existing ones can be also added to the description of the image.

The above situation is very optimistic and there are no real-world systems providing all the steps of the above analysis for general multimedia documents due to several obvious reasons. Unfortunately, this also seems to be the case for the very near future (the next five years). The most important obstacles that the automatic analysis process should overcome are mainly two: the detection of the right possible information sources and the definition of the semantic interpretation through the feature matching process. Nevertheless, in semi-automatic approaches the above two steps are generally obvious for the human and the whole process could be performed in cases where the higher knowledge is there (at least formally providing the terminology, i.e. the names of the objects and their inter-relations in a formal knowledge representation language like OWL). Moreover, in all the steps of the production chain of the multimedia content, from pre-production to post-production, useful information (that is completely impossible to be automatically recognized like for example the name of the color corrector of a video sequence) can be described in the form of metadata again in a formal descriptive language like XML and semantically interpreted with the aid of higher knowledge. Thus, the definition and formal representation of multimedia semantics in all levels of abstraction remains a critical issue for both the short and long term future.

Several standards have been proposed and used in the literature for the representation of multimedia document descriptions and their semantic interpretation (Dublin Core, MPEG-7, MPEG-21 etc). The main focus of the above standards is given to the extraction of a set of predefined categories and types of metadata. For example, MPEG-7 provides a very rich set of Descriptors and Description Schemes covering almost all the levels of abstractions. Using this predefined set, it is not difficult to provide the semantics. For the above representation XML is used (with some minor extensions). Moreover, a not so powerful way of defining new descriptors is given (mainly using XML Schema). Although the above approach can be perfectly applied to the structural description of the multimedia documents and to metadata like director, producer etc, it is rather inappropriate for the semantic description of their content. The main reason for this disadvantage is the inability of the above framework to provide formal semantics and inference services in arbitrary structures of descriptions. On the other hand, the work on knowledge representation and reasoning and the standards provided by the W3C community (DAML+OIL, OWL,) are ideal for the representation of the content, providing formal semantics and inference services, still it is very difficult to be used for the structural description of the multimedia content. Moreover, since the main focus of the above standardization was not on the representation of multimedia knowledge, some language extensions are needed in order for these standards to be widely used. Consequently, a combination of the above standards seems to be the most promising way for multimedia document description in the near future .

Multimedia Semiotics - Syntagm and Paradigm, The formalization of semiotics [next] [back] Multimedia Proxy Caching

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or

Vote down Vote up

almost 7 years ago

I am one of the candidates of PhD in Documentary Linguistics and Culture in Addis Ababa University and take Multimedia Courses. I appreciate the technologies and would like to do more research supporting with these technologies.Therefore, I found this information is very useful.