Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from K-O

Multimedia Synchronization - Area Overview - Historical Perspective, The Axes of Synchronization, Temporal Synchrony: Basic Constructs, Temporal Synchrony: Basic Mechanics

media stream time data

Wayne Robbins
Defense R&D Canada (DRDC), Future Forces Synthetic Environments,
DRDC Ottawa; Ottawa, Ontario, Canada

Definition: Multimedia synchronization refers to the coordination of multimedia information along three axes: content, space, and time.

Informally, synchronization has long referred to “being at the right place at the right time”. In many ways, this maxim forms the essence of multimedia synchronization, which can be described as the coordination of multimedia information along three orthogonal axes: content, space and time – that is, “the right stuff, the right place, and the right time”

Historical Perspective

The breadth and diversity of the multimedia arena have lead to multimedia synchronization needing to address a broad array of situations and issues. An area of avid interest and intense research for well over a decade, multimedia synchronization originated as a necessity to address the convergence of networked and multimedia systems combined with the increased use of continuous, time-based media, such as audio and video. Originally, multimedia systems primarily dealt with “traditional media” such images and text, which were stored, manipulated and rendered locally, on individual workstations. Consequently, issues relating to handling media were related to local computer infrastructure (e.g., storage and retrieval, encoding/decoding, display management and so forth). Timing considerations (i.e., synchronization), however, were not a significant issue because the media did not have a temporal component; therefore, best effort timing was adequate and utilized for simplicity.

Given the introduction of time-based media, however, best effort synchronization became insufficient. Aspects related to workstation processing (such as those associated with general purpose, multitasking operating systems) soon necessitated that more emphasis be placed on synchronization. Combined with the introduction of network technologies, distributed multimedia systems also required that data transfer be considered. Examples included managing latencies to facilitate interactivity and providing synchronization both internally (the actual media, both individually and with respect to each other) and externally (e.g., relative to different distributed endpoints and devices). Increasingly, these aspects needed to be considered in the context of complex multimedia scenarios consisting of multiple time-based media of different temporal characteristics (e.g., frame rates) coming from different sources, each with its own data transfer characteristics (e.g., network QoS).

The evolution of the expectations and execution context for multimedia systems has paralleled its growing pervasiveness and increased integration into wider computational, communication and collaborative systems. Originally, multimedia functionality was realized solely by dedicated applications and multimedia synchronization addressed a variety of fundamental topics, ranging from taxonomies and modeling to formal specification methods, algorithms and software architectures. While such topics still offer challenges and research issues, the contemporary approach to multimedia synchronization is as an infrastructural service, used and configured by applications and other system layers as required. As such, synchronization services need to interface with lower-level devices and system layers while also responding to higher-level applications issues. In the context of a networked environment, the source (e.g., server), the receiver (e.g., client application) and the infrastructure (e.g., operating system, storage, network, middleware, etc.) all need to be taken into account, both in terms of classic distributed system issues as well as multimedia-specific considerations.

The Axes of Synchronization

The three axes of content, space and time provide a useful way to categorize the “organization” of multimedia information. Within the literature, the term multimedia synchronization addresses each of these dimensions. In contemporary practice, however, multimedia synchronization typically refers to the temporal axis while issues of space and content are more commonly addressed under the terms “layout” and “content management.”

Spatial synchronization refers to the physical arrangement (i.e., layout) of multimedia objects and the corresponding relationships between them on a particular output device (e.g., display monitor). Conversely, content synchronization refers to the maintenance of a media object’s relationship to (dependency on) a particular piece of data. Content synchronization can be seen as part of a larger topic known as content management, the managing of electronic content throughout its lifecycle, ranging from creation and storage to dissemination and destruction. The premise is that the media content can exist as independent components within the content management system and then be “assembled” into presentations at “run time.”

Temporal synchronization refers to the maintenance of temporal relationships between media. For example, consider the classic “lip sync” relationship between the audio/video of someone speaking, or the animated “build” of a typical PowerPoint slide. For the audio/video pairing, if the media are not synchronized (i.e., the temporal relation is not maintained), their intended meaning can be degraded to the point of being completely lost. In the case of the animated slide build, the clarity of the presentation is often determined by the appropriate ordering, timing and layout of objects on the slide. Therefore, should any of these aspects (e.g., ordering or timing) be incorrect, the slide can become difficult to understand, or possibly even convey incorrect information. Consequently, the specification, characterization and facilitation of such temporal relationships can be an important but involved effort.

With these issues in mind, the rest of this article will discuss the various facets of temporal synchronization, the de facto reference for multimedia synchronization.

Temporal Synchrony: Basic Constructs

In the nomenclature traditionally associated with multimedia synchronization ,independent media entities are generally known as media (or multimedia) objects. To more clearly identify media types, classifications are used to identify particular objects (e.g., a video object). By virtue of objects having variable durations, each is divided into a sequence of (one or more) informational units for purposes of synchronization. These subdivisions are usually known as Logical Data Units (LDUs). For example, an image would consist of a single LDU while a video object would consist of multiple LDUs, typically known as “frames.”

To meet the needs of modern multimedia environments, synchronization between media objects must consider relationships between time-dependent (i.e., time-based) media objects and time-independent (i.e., non-time-based) media objects. Time-dependent media are referred to as continuous media while time-independent media objects are known as discrete. A discrete medium has a single LDU while a continuous media consists of a series of LDUs which are isochronous (i.e., regularly spaced) in nature. Consequently, continuous media objects are often characterized (and abstracted) as a media stream. Playback of a media stream therefore constitutes rendering its LDUs in sequence and at the appropriate time. The terms object and stream are generally used interchangeably, with the choice usually being context (e.g., medium and activity) dependent.

To convey the meaning of a particular medium, its individual temporal requirements must be met. For example, a video object must be played back at the appropriate frame rate. Additionally, those media objects used in combination must also consider the temporal requirements of their composite group. For example, a presentation may specify the concurrent display of a text string with a video clip for n seconds, followed by an image. Enabling such a presentation requires specifying the temporal relationships between the media objects as well as deriving their playback schedule. Furthermore, to ensure that playback conforms to the schedule, a synchronization mechanism must be used to enforce it. Large-grain synchrony at an object level must therefore be provided in order to correctly begin playback. Media objects within a presentation, however, may have different temporal dimensions (i.e., continuous vs.discrete). Therefore, the synchronization mechanism must support multiple synchronization granularities so that continuous media can also be rendered correctly. The combination of media objects with differing synchronization characteristics has been characterized as a multisynchronous space.

To address such a mix of synchronization characteristics, multimedia synchronization is typically classified in relation to the number of media objects within a temporal relationship as well as their temporal dimension:

  • intra stream vs. inter-stream
  • event vs. continuous

Intra-stream synchronization (or continuity) refers to synchronization internal to a media stream; i.e., exercising the flow of data within a single stream so that it is played back correctly. Conversely, inter-stream synchronization refers to the synchronization between independently running media streams; i.e., aligning the playback of two or more streams. These categories refer to synchronization autonomy, be that within a single stream or between multiple ones.

Event synchronization denotes the alignment of a media object’s start to a specific time or event; for example, the display of an image when the background music starts or the execution of an action in response to a hyperlink being navigated (i.e., “clicked”). Continuous synchronization, however, refers to the fine-grain synchrony required by continuous media within the duration of the object. For a continuous medium, continuous synchronization is the means by which intra-stream synchronization (continuity) is achieved. The result is the typical isochronous rhythm associated with continuous media. Conversely, the classic “lip sync” problem between audio and video streams is an illustration of continuous inter-stream synchronization. Of course, for continuous inter-stream synchronization to be realized, it is assumed that each stream has also been intra-stream synchronized.

Another classification scheme refers to the artificiality of the temporal relationships within or between streams. Specifically, the categorization of live vs. synthetic refers to the mode of a medium in two distinct but related ways: (1) its stored vs. non-stored source; and (2) how its temporal aspects were determined. In terms of its source, media mode refers to whether the object is being captured “live” in real-time from a real-world sensor (e.g., camera, microphone, etc.). Conversely, synthetic refers to those media which are being retrieved from secondary storage, even though they may have originally been captured “live” and then stored. Such differences directly impact on the approaches (e.g., mechanisms) used to control synchronization within a system, such as buffering techniques and the ability to defer data retrieval. The second interpretation of mode refers to how the temporal aspects of a medium were determined. Specifically, live synchronization attempts to reproduce the temporal relations that (naturally) existed during the capture process. On the contrary, synthetic synchronization utilizes temporal relations that are artificially specified. The combination of synthetic and live synchronization in a single environment has been referred to as mixed mode synchrony.

Various traditional synchronization issues are also relevant within a multimedia context, such as timeliness, precision, granularity, causality and paradigm. For example, timeliness applies to many levels within a multimedia system, including network QoS and the availability of LDUs at display devices. Variations from the appropriate times result in asynchrony and can lower system quality as perceived by the user. Timeliness is also necessary to achieve natural and fluid communication between participants in conversational and interactive environments. However, pin-point precision is not always required in order to preserve semantics; rather, a tolerable degree of asynchrony at the user interface is acceptable. Consider the two well-known and classic cases of lip synchronization (or “lip sync”) and pointer synchronization. The first refers the alignment of the audio and video tracks of a human speaking, while the second refers to the placement of the (mouse) pointer over an image relative to accompanying audio data. The importance of lip synchronization is crucial to ensuring people’s level of comfort and attention within real-time conferencing, or any system involving vocal accompaniment to “speaking” visuals. In particular, the audio/video tracks in the traditional “lip sync” scenario must render their corresponding LDUs within 80ms in order to appear synchronized; otherwise, user comprehension is degraded. Similarly, “pointer lag” must be bounded in order for the user to relate its action to the audio commentary. Indeed, Steinmetz’s seminal work on measuring human tolerances to asynchrony for different media combinations in different visual alignments (e.g., “talking heads”) continues to serve as an empirical baseline for system design and performance.

Temporal Synchrony: Basic Mechanics

Beyond the abstractions, constructs and frameworks to describe multimedia synchronization requirements, there is the need to address the mechanisms to implement them. Primarily, this issue involves three questions: (1) what kind of mechanisms are required; (2) where should the mechanisms be placed; and (3) what are their implementation considerations.

Two general techniques (mechanisms) were proposed in: synchronization markers and synchronization channels. Synchronization markers function as tags (akin to timestamps) by which media streams can correlate their temporal position during rendering. These tags effectively mark off sections of the media stream and could be transmitted as part of a raw data stream or generated externally and imposed on the data stream. For example, a video clip could be transmitted as individual frames with inter-frame markers inserted between each frame; the SMPTE (Society of Motion Picture and Television Engineers) code used by high-end video equipment is an example of synchronization markers that are embedded with in each video frame itself. The use of synchronization channels is designed to isolate the control to a separate communications channel running in parallel to the data stream. The control information within the synchronization channel contains references to the data transmitted in data-only channels, directing the synchronization mechanism at the receiver as to how to align the data.

Numerous approaches to media synchronization within presentational systems can be found in the literature. In general, the underlying mechanisms can be characterized as follows:

  • Layered construction: Synchronization is addressed at multiple stages, using different entities and mechanisms to correct asynchrony as it becomes noticeable (e.g., the network and playback levels, within and between streams, etc.).
  • Object abstraction: Media data and system components are modeled as independent but interacting objects.
  • Event synchronization enabled through scheduling: Coarse-grain event-level synchronization is facilitated by using scheduled media object playback times. The specification of the scenario’s timeline can either mimic a real situation or be completely artificial.
  • Continuous synchronization enabled through fine-grain temporal intervals: Fine-grain continuous synchronization is achieved through the division of a media object into a series of small temporal sub-divisions which are individually aligned to their correct playback time.

In contrast, conversational systems tend to rely more on protocol-based techniques. Well-known examples include MBONE tools such as ivs, vat and rat, which are based on the RTP (Real-Time Protocol) protocol. In these systems, the synchronization mechanism is a protocol engine ensuring that the media data conforms to the protocol. Such an approach is in contrast to presentational systems in which the synchronization mechanism ensures the data conforms to an external specification (i.e., the presentation’s timeline). In conjunction with the protocol-oriented approach, conversational systems also tend to use the stream-oriented abstraction and media playback is generally intended to reflect the temporal characteristics of the media as obtained during the capture process.

In tandem with the selection of the kind of synchronization mechanism, there is also the question of where to place the synchronization mechanism. Within a distributed multimedia system, there are basically three possible locations: the media source, the receiving site and the intervening network.

Synchronization at the source (e.g., server) implies that data is synchronized before transmission to the receiving site. Consequently, the temporal relationships imposed prior to transmission must be maintained during transmission and up until playback at the receiving site. A fundamental assumption of this method is that all the media to be synchronized are located at the same source and can be “wrapped up” or multiplexed into a single data stream. However, these assumptions may not always be realistic, desirable or amenable to the system in question. Furthermore, this technique implies an intelligent source that is knowledgeable about the media it is providing and well as its intended use at the receiver. Should source synchronization techniques be employed with multiple sources, there will still be the need to provide an additional coordination layer at the sink(s) to align the separate incoming streams.

Synchronization at the receiving site (i.e., client or sink) enforces the required temporal constraints on each media stream after it is received from its source over the network. This kind of synchronization mechanism allows each media stream to originate from its own source and does not require any inter-stream management by the network. However, it can place a large processing burden on the receiver and provides (as with source-based synchronization) a solitary approach to the synchronization of media and their interactions. For those systems consisting of multiple receiving sites, there is also the need to ensure coordination between the sites as well as synchronization at each one. Synchronization within the network was traditionally viewed as a protocol-based approach that formed an elaborate “hand-shaking” mechanism with precisely defined timing. As such, it proved a complex technique that was network-dependent and prohibitively difficult when a large number of media streams became involved. As with source-based synchronization, this technique also assumed an intelligent source and a receiving site that will do nothing that could somehow induce asynchrony between the streams. The latter assumption is completely unrealistic in terms of contemporary computing reality.

The prime difficulty with the above categorization is the implicit notion that the mechanisms would be used in isolation. An additional problem is that it assumes the “old-style” computer vs. network model: the source and sink are the computational entities while the network is a “dumb” transport agent. However, by employing a combination of mechanisms deployed at various parts of the distributed system (e.g., source, sink and network), the complexity and overhead of synchronization can be amortized throughout the system. It also allows more flexibility in what kinds of synchronization aspects are dealt with at the various locations. The ability to combine these approaches has been augmented by the development of intelligent, service-based networks, allowing synchronization to be transformed into a network service using computationally-enabled “smart” communications systems. These and other implementation considerations need to be addressed in terms of how actual technological choices (ranging from operating system to design practices) can influence the ability to provide synchronization within a multimedia environment. Some of these aspects are addressed in the companion article on infrastructure and engineering.

Multimedia Synchronization Infrastructure and Engineering - Operating and Real-Time Systems, Middleware and Networking, Database and Data Management [next] [back] Multimedia Streaming on the Internet - Compression, QoS Control/Monitor, Streaming Protocols, Media Synchronization

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or

Vote down Vote up

almost 7 years ago

Nice material, easy to understand the concept.