Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from P-T

Real-World Multimedia Systems - Real World Capture, Analysis, Association, Querying and Retrieval, Conclusion

data video content user

Harry Agius, Chris Crockfor, and Arthur G. Money
Brunel University, Uxbridge, UK

Definition: Real-world multimedia systems can be defined as systems that utilize real-world data to support the querying and retrieval of multimedia content.

Multimedia systems have always been at the heart of technical convergence, fusing video and audio with text, images and graphics. However, increased and cheaper processing power combined with other advances such as affordable sensor equipment, has enabled this to take on new dimensions whereby multimedia content and real-world data converge within a real-world multimedia system. Real-world multimedia systems can be defined as systems that utilize real-world data to support the querying and retrieval of multimedia content. This retrieval may be either pulled or pushed. Real-world multimedia systems vary substantially, but broadly can be seen to embody the architecture shown in Figure 1.

Various real-world data is captured from the environment and/or human subjects. Sensors may be used to capture the data automatically, or semi-automatically since some sensors require manual adjustment at multiple stages of capture. The use of intelligent sensors enables some analysis to be automatically undertaken on the real-world data at the point of capture. If sensors are not used, the data may be derived from various non-real-time sources and input into the system manually or, for suitable real-world data, derived from real-world multimedia content and noninvasive techniques such as intelligent computer vision during the Analysis component. Real-world data may optionally be complemented by captured real-world multimedia content of the environment or human subjects, such as surveillance video footage, sporting events, consumer digital photographs, reconnaissance video footage, video depicting user’s facial expressions, and so on. More conventional, non-real-world multimedia content may also be used by the system, such as movies, TV programs, and news footage. During Analysis, real-world data is analyzed together with the real-world and non-real-world multimedia content to interpret and structure the data and derive further information. During the Association component of the architecture, relationships between the real-world data and the multimedia content within the system are expressed, so that all data and content may be stored in a structured and integrated format to create a real-world multimedia resource. This resource is used in a variety of ways during the Querying and Retrieval components, ranging from direct retrieval based on real-world content-based or contextual data to navigation within interactive, immersive real-world-like environments.

Real World Capture

Environmental data can provide a wealth of information regarding artifacts in the world and properties of the world itself at specific times. This data includes geographic positioning using GPS (Global Positioning System) or differential GPS, camera parameters such as pitch or yaw, weather data such as wind speed, temperature and humidity, detection of smoke, flames and radiation, topological maps depicting shape and elevation, traffic speed and movements, and so on. In the Informedia Experience-on-Demand system, GPS units wired into wearable vests are exploited to generate panoramic views from various perspectives for video captured by users wearing the vests. Arbeeny and Silver use a GPS receiver to encode spatial and temporal coordinates of the location of a mobile camera, together with a digital compass sensor to encode direction of travel in relation to magnetic north. In this way, captured video streams depicting real-world images may be suitably geo-referenced. In the situated documentary augmented reality system created at Columbia University , a single-sensor inertial/magnetometer orientation tracker is mounted rigidly on a head band attached to a see-through HMD (head mounted display) worn by the user. The tracker enables the position of the user to be determined through a differential GPS system. 3D graphics, imagery and sound are then overlaid via the HMD on top of the real world viewed by the user. Additional material is presented on a hand-held pen computer. The situated documentaries tell the stories of events that took place on the campus. For virtualized architectural heritage systems, which seek to recreate accurate, virtual historic and cultural landmarks, a range of contact (touch) and non-contact (camera-based) capture technologies may be used. Examples of the former include sonic, optical, electromagnetic and satellite triangulation, laser and target-based time-of-flight mass spectrometry (TOF-MS), and stereo photogrammetry. Examples of the latter include stereo video auto-photogrammetry, sonic TOF-MS, and phase-based laser interferometry. Use of RFID-based (radio frequency identification) devices have also proven useful for data capture in real-world multimedia systems. Volgin et al. describe a multimedia retrieval system for images captured by mobile devices where sensor motes are used to capture environment-specific data, such as light intensity, temperature, humidity, location and users in proximity. These sensor motes communicate with gateway motes attached to laptops which collect this data and turn it into metadata for use in the multimedia retrieval system.

Human subject data concerns real-world data about human subjects, typically users of the real-world multimedia system. This data typically concerns their biological properties or their physical or mental behavior, including heart rate, galvanic skin response (GSR), blood volume pulse (BVP), respiration, gestures, facial expressions, motion, and content browsing and usage behavior. In 3D Live, an augmented reality video conferencing system, users wear a video see-through HMD connected to a computer. Real name cards contain a black tracking pattern which is used by the system to calculate the user’s viewpoint. This is then used to overlay virtual video onto the card which is taken from fourteen video cameras that surround a remote collaborator. The Conductor Jacket is a sensor interface that gathers its wearer’s gestures and physiology. It aims to interpret and synthesize music to accompany conducting gestures. Camurri et al. use a range of sensors, some on-body, as well as live audio and video captured from the user, to detect the expressiveness and affection of users as they communicate in various non-traditional ways, such as through gestures, postures, facial expressions, movement or voice. This is used in a range of systems to provide personalized multimedia content which alters according to the expressiveness or affection of the users, e.g. changing how characters are animated or the type or style of music which is played back. The Point At system and the Interactive Blackboard system allow users to communicate with natural hand gestures to indicate which part of a painting interests them so that audio information about the subject or object can be provided (in the former) or to zoom in and out of various zones on a map and select icons to gain further information (in the latter). In both cases, camera capture the gestures which are later analyzed and interpreted. Aizawa et al.capture a range of real-world data to support retrieval of multimedia life logs (which record a user’s daily experiences in detail). This includes a brain wave signal acquired from a brain wave analyzer which is used to determine a person’s arousal status at the time they captured the real-world video. This can later be used to interpret whether the user was interested in the content of the video at that time. Motion sensors are used to capture information about activities occurring during video capture. These motion sensors include an acceleration sensor and a gyro sensor, which can be used to identify activities   such as walking, running, or standing still. Logs of users’ video browsing behavior have proven useful to video summarization, where they are later subjected to analysis in order to infer user interest.

As an example of how non-real-world multimedia content may be exploited by a real-world multimedia system, GeoPlot relates real-world data to news videos and actual geographic distances are calculated so that patterns in the geospatial relationships of news events may be discovered.


Once all data and content have been captured, a real-world multimedia system analyzes it jointly to interpret it as well as consolidating and aggregating to higher levels of abstraction, where each layer adds progressively further useful data. Typically, human subject data requires more detailed analysis than environmental data, although this is not always the case. In the Point At system, mentioned in the previous section, two digital cameras capture the visitors’ pointing action and computer vision techniques are used to calculate the screen location that they are pointing at. Similarly, in the Interactive Blackboard system, the user’s gesture is captured by a single digital camera which is positioned high above the blackboard and is pointed at the inclined screen. The system computes the zone the user points to in real time, after image processing and interpretation. In Camurri et al.‘s approach, also previously mentioned, a layered framework is used to distinguish and map between signal-based descriptions of syntactic features of human movements, descriptions of these movements within a trajectory space, and semantic descriptions relating to meaning, affect, emotion and expressiveness. The 3D Live system, discussed above, processes video from the cameras surrounding the remote collaborator to segment the person out from the background so that a realtime view generation algorithm, based on shape-from-silhouette information, can be used to generate a synthetic view from the user’s viewpoint. In virtualized reality systems, view synthesis involves the rendering of a viewpoint perspective to the user. A variety of techniques may be used for analysis here, such as image flow or pixel correspondence, or by densely sampling the viewing space and using interpolation to cater for missing views. Additional knowledge about the scene or the imaging process may also contribute. In surveillance-based real-world multimedia systems, analysis frequently involves identifying and classifying objects from camera video streams. For example, objects may be tracked in time, identified as people, and then analyzed to determine their posture. From this posture, a given event may be identified. For example, a transition from a standing or sitting posture to a lying down posture may indicate a ‘falling down’ event . In the LucentVision system, video streams taken from eight cameras observing a tennis match (two for player tracking and six for ball tracking) are fed into a domain-specific, real-time tracking subsystem. This subsystem analyzes the streams to determine player and ball motion trajectories and to assign a player trajectory to a specific player via domain knowledge. This domain knowledge incorporates tennis rules and the current score to determine which player is on which side of the court and can be seen by which camera. Naaman et al. use captured longitude and latitude data for a given photo to derive the country, province/state, county, as well as further optional location information such as city, park, nearby landmarks, and so on. Together with time, this location data is then used to derive further data such as light status, weather status and temperature, elevation, season, and time zone, using various online Web resources.


Using the output from the Analysis component, the Association component of the architecture structures and integrates the real-world data and multimedia content within the system for storage within a real-world multimedia resource. Metadata schemes are key to this. The LucentVision system just discussed uses a relational database to organize data by the hierarchical structure of events in tennis, that is, a ‘match’ consisting of ‘sets’ consisting of ‘games’ consisting of ‘points’. Each event has an associated identifier, temporal extent, and score. Trajectories corresponding to the two players and the ball are associated with every point. Each point also has pointers to video clips from the broadcast production. Because the relational database structure does not support spatiotemporal queries based on analysis of trajectory data, an additional spatiotemporal analysis structure is used on top of the relational structure. In the XML-based VSDL-RW (Real World Video Stream Description Language), a variety of user-defined real-world data, such as location and map route patterns, may be represented in a hierarchical structure. This may be linked to the video streams through a variety of referencing methods, such as matching name labels, map patterns, or texture properties. A number of metadata standards also exist, which permit representation of real-world data. EXIF specifies how digital still cameras represent a variety of data associated with digital photographs, including real-world data such as the make and model of the camera, location and lighting conditions, time that the photograph was taken, and so on. In the Multimedia Description Schemes (MDS) of the MPEG-7 standard, a variety of real-world data may be described and integrated with multimedia content. These include data regarding persons, groups, organizations, citizenship, addresses, locations (including geographic position via longitude, latitude and altitude for various geodetic datum systems), relative and absolute times (down to fractions of a second), affective responses to multimedia content by users, usage histories, real-world entities and properties (such as length, height, weight, and temperature) and (for both multimedia content and metadata) creators, creation tools, creation locations and creation times.

Querying and Retrieval

The Querying and Retrieval components of the architecture concern the use of the real-world multimedia resource created in the other parts of the architecture. Querying may be explicit, where a formal query is made by the user, e.g. using a formal query language, or implicit, where the user expresses their requirements indirectly, e.g. through interaction with the system or through filtering processes. In Volgin et al.‘s multimedia retrieval system, mentioned earlier, the captured real-world data allows users to query and retrieve images based on real-world contextual properties such as whether it was hot or cold on the day that the user took the picture, or which users where nearby when the picture was taken, and so on. In LucentVision, mentioned previously, the user may query the database through a visualization interface to retrieve various reconstructions of tennis events ranging from high quality video representations to compact summaries, such as a map of players’ coverage of the court. Content-based queries are also supported as well as spatiotemporal queries which may be combined with score-based queries.

In the MediaConnector framework, digital cameras record time, position and heading metadata, even when users are not taking pictures but just have their cameras with them. This enables a rather unique perspective on retrieval whereby a user could retrieve pictures taken at the same time as their own but from different viewpoints, or when the user was not using their camera.


Through their use of real-world data, real-world multimedia systems more naturally reflect and behave like ourselves and our real-world environments. This can help to reduce the semantic and sensory gaps. The sensory gap is the gap between an object in the real world and the data in a computational description derived from multimedia content analysis, whereas the semantic gap is the lack of coincidence between the data that one can extract from multimedia content and the interpretation that that same data has for a user in a given situation. Integrating other sources of information, particularly real-world contextual data, provides much data that could not be derived otherwise but is still required by users, thus making the process of multimedia querying and retrieval more effective. However, while many advances have been made in real-world multimedia systems, we are still at the stage of learning and discovering how best we can use much of this real-world data, particularly human subject data where there are no hard and fast rules as to how best it may be interpreted and associated with multimedia content. Consequently, further research is required for each component of the architecture presented in Figure 1 in order to improve both efficiency and effectiveness, as well as to enable new applications in new domains that we cannot yet envisage.

Reason, J. Paul(1941–) - Military leader, Background, Upbringing, and Education, Chronology, Becomes Successful Entrepreneur [next] [back] Real Time Transport Protocol

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or