Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from F-J

High Definition Live Streaming - Video Acquisition and Compression, HD Video Transmission, Rendering of HD Video Streams, Delay Measurements, Application Areas

frames frame bandwidth audio

Min Qin and Roger Zimmermann
Computer Science Department
University of Southern California, Los Angeles, USA

The high-definition (HD) video standard has been developed as the successor to the analog, standard definition (SD) video, which (in the United States) dates back to the first proposal by the National Television System Committee (NTSC) in 1941. The HD standard has been developed by the Advanced Television Systems Committee (ATSC) and was approved in 1995. The ATSC standard encompasses an all-digital system and was developed by a Grand Alliance of several companies. Even though the ATSC standard supports eight-teen different formats, we are restricting our discussion here to the ones that provide the most improvement in picture quality over NTSC. The two best-known HD formats are 1080i (1920 × 1080, 60 fields per second interlaced) and 720p (1280 × 720, 60 frames per second progressive), compared with 480i for SD.

Only after the year 2000 has digital HD equipment become somewhat common and affordable. High-definition displays can be found in many living rooms. The United States Congress has now mandated that all US broadcasters must fully switch to digital and cease analog transmissions by 2009. On the acquisition side, HD has started to migrate from professional equipment to consumer devices. In 2003 JVC introduced the first affordable camcorder with 720 lines of resolution. A consortium of companies has since announced an augmented standard for consumer equipment under the name HDV. The HDV standard calls for the video sensor streams to be MPEG-2 compressed at a rate that is suitable for storage on MiniDV tapes (i.e., less than 25 Mb/s). On the networking side, significant bandwidth is now available, and therefore high-quality live video streaming over the Internet has become feasible. As HD streaming is graduating from research environments to commercial settings, existing applications — such as video conferencing, CSCW (computer supported cooperative work) and distributed immersive performances — are taking advantage of HD live streaming to achieve enhanced user experiences. Early tests were performed by the University of Washington and the Research Channel in 1999 at the Internet2 member meeting in Seattle. Another example is the UltraGrid project, which focuses on uncompressed stream transmissions. Early commercial implementations have been announced by companies such as LifeSize, Polycom, and others.

Because most traditional store-and-forward HD-quality streaming systems use elaborate buffering techniques that introduce significant latencies, there techniques and algorithms are not well suited for real-time environments. Three of the main design parameters for HD live streaming are: (1) the video quality (mostly determined by the resolution of each video frame, the number of bits allocated for each of the three transmitted colors, the number of frames per second, and the compression technique used); (2) the transmission bandwidth; and (3) the end-to-end delay. Due to the massive amount of data required to transmit such streams simultaneously achieving low latency and keeping the bandwidth low are contradictory requirements. Packet losses and transmission errors are other problems in streaming HD live video. Because errors in one video frame may propagate to subsequent frames when certain compression algorithms are used, small transmission errors may result in significant picture quality degradation over a period of time. Also, many live streaming systems do not recover packet losses in order to reduce end-to-end delay.

Figure 1 shows an experimental HD live video streaming system called HYDRA. The source streams captured by a JVC JY-HD10U camera are packetized by a Linux machine and sent through the network. On the receiving side, a similar setting is employed. A Linux machine will recover the stream and perform video decompression and rendering in real time.

Video Acquisition and Compression

Two commonly used high definition video formats are 1080i and 720p. There is also an Ultra High Definition Video format proposed by NHK, which has a resolution of 7,680 × 4,320 pixels. Without any compression algorithms, the required bandwidth to transmit the raw HD video may rise to as high as 1.485 Gb/s (e.g., the SMPTE 1 292 standard). Many existing IP-based networks do not provide such capacity to the end-user. However, with compression algorithms, the bandwidth can be significantly reduced and a typical rate is approximately 20Mb/s. Table 1 lists the bandwidth requirements for some MPEG-2 compressed HD video formats.

Video cameras provide the sensor media stream in either uncompressed or compressed format. A common uncompressed output format is HD-SDI, which must be either captured with a compatible interface board in a computer or sent through an HD compressor. Some cameras directly produce a compressed stream, often delivered via an IEEE 1394 bus (i.e., FireWire) connection. Although video compression algorithms can reduce the bandwidth requirement, they often introduce additional delays. Commonly used encoding methods for HD video are H.262 (MPEG-2) or H.264 (MPEG-4 part 10). A number of proprietary formats suitable for production and editing environments (e.g., HDCAM) also exist. There are three ways to encode a frame in an MPEG-2 video bitstream: intra-coded (I), forward predicted (P) and bidirectional predicted (B). I -frames contain independent image data, which is passed through a series of transformations, including DCT, run-length and Huffman encoding. On the other hand, P and B frames only encode the difference between previous and/or subsequent images. As a result, P and B pictures are much smaller than I frames. A typical MPEG-2 video stream looks like this:

A sequence of frames starting with an J-frame and including all subsequent non- I frames is called a group of pictures (GOP, see Figure 2). The fact that P and B frames depend on other frames has two important implications. First, a codec must keep several frames in memory to compute the differences across multiple frames. This increases the latency from the time the video sensor acquires an image until the compressed bitstream can be transmitted. Second, transmission errors in a previous frame may propagate to the current frame. For real-time and networked applications, transmission errors are common, especially when the bit rate is high. Therefore, different strategies may be employed to reliably transmit HD video over a network. We will cover some of them in the next section.

Compared to the H.262 codec, the newer H.264 codec allows a higher compression ratio. The H.264 codec supports multi-frame motion compensation, which allows a frame to compute motion vectors by referencing multiple previous pictures. However, due to its complexity, the H.264 codec may introduce a longer delay.

1 Society of Motion Picture Television Engineers,

2 Codec is short for compressor-decompressor.

HD Video Transmission

To send the captured HD video to the remote destination, the source devices must interface with a network. Data streams are digitized and transmitted in discrete packets. The Real-time Transport Protocol (RTP) is often used to transmit video streams over a network. Since the purpose of many live streaming systems is to enable video conferencing among multiple sites, the H.323 protocol is often incorporated. To host a conference among multiple participants, H.323 introduces the concept of a multipoint control unit (MCU), which serves as a bridge to connect participants. Popular H.323 clients (non-HD) include GnomeMeeting and Microsoft NetMeeting.

To transmit HD video from a source to a destination, two factors need to be considered: (1) the actual bandwidth limitation of the link, and (2) its packet loss characteristics. Due to the high bandwidth requirement of HD video, the network may experience temporary congestions if the link is shared with other traffic. Moreover, adapting to these congestions is not instantaneous. As a result, packet losses may occur. For raw HD video, packet losses in one frame only affect that frame. However, for most codecs that use inter-frame compression, the error caused by packet loss in one frame may propagate to later frames. To recover packet losses, there exist three commonly used techniques: 1) forward error correction (FEC); 2) concealment; and 3) retransmission. FEC adds redundant data to the original video stream. As a result, the receiver can reconstruct the original data if packet losses occur. However, FEC adds a fixed bandwidth overhead to a transmission, irrespective of whether data loss occurs or not. Unlike FEC, the concealment solution does not recover packets. The receiver substitutes lost packets with the interpolated data calculated from previous frames. The goal is to keep the residual errors below a perceptual threshold. The receiver must be powerful enough to carry the workload of both rendering and data recovery. Finally, retransmission is a bandwidth-efficient solution to packet loss. Here, the receiver will detect lost data and request it again from the source. Hence, this approach introduces at least one round-trip delay, and therefore it may not be suitable for networks with long round-trip delays. The HYDRA live streaming system, for example, uses a single retransmission strategy to recover packet losses. Buffering is kept to a minimum to maintain a low end-to-end latency.

Rendering of HD Video Streams

When a media stream is transmitted over a network, the receiver side requires a low-delay rendering component to display the HD video stream. Various hardware and software options for decoding HD video streams are currently available. We classify them into the following two categories.

  1. Hardware-based : When improved quality and picture stability are of paramount importance, hardware-based video rendering systems are often preferred. Commercial systems include, for example, the CineCast HD decoding board from Vela Research. It can communicate with the host computer through the SCSI (Small Computer Systems Interface) protocol. Other stand-alone decompressors require the input stream to be provided in the professional DVB-ASI format. The hardware solution has the advantage of usually providing a digital HD-SDI (uncompressed) output for very high picture quality and a genlock input for external synchronization. However, this type of solution may be costly and some of the hardware decoders include significant buffers which increase the rendering delay.
  2. Software-based : A cost-effective alternative to hardware-based rendering is using software tools. Recall that rendering HD video streams is computationally expensive. An example software rendering tool is the libmpeg2 library (a highly optimized rendering code that provides hardware-assisted MPEG decoding on current generation graphics adapters). Through the XvMC extensions of the Linux’ X11 graphical user interface, libmpeg2 utilizes the motion compensation and iDCT hardware capabilities on modern graphics GPUs. This can be a very cost-effective solution since suitable graphics boards can be obtained for less than one hundred dollars. In terms of performance, this setup can easily achieve realtime performance with a modern CPU/GPU combination. Table 2 illustrates the measurements we recorded from the Iibmpeg2 library in decoding HD MPEG-2 streams with an Nvidia FX5200 GPU. Two sub-algorithms in the MPEG decoding process – motion compensation (MC) and inverse discreet cosine transform (iDCT) — can be performed either in software on the host CPU or on the graphics processing unit GPU. The tests were performed with dual Xeon 2.6 GHz CPUs. Without hardware support, the machine can only merely satisfy the requirement of HDV 720p.

Delay Measurements

When transmitting live, interactive-media streams, one of the most critical aspects is the end-to-end latency experienced by the participants. The ideal transmission system would achieve very high quality while requiring minimal bandwidth and producing low latency. However, in a real-world design, a system must find a good compromise among all these requirements. Delays may result from many factors, including video encoding, decoding and network latency. Even monitors may introduce certain latency, for example plasma displays contain video sealers that can introduce several frames of latency. Because many rendering tools demultiplex video and audio streams, it is important to measure both the video and audio latency to understand whether the audio and video streams are synchronized.

Many available devices constrain some of the design space and therefore parameters cannot be arbitrarily chosen. Let us consider the JVC JY-HD10U camera as an example video source. The built-in MPEG compressor is very effective, considering that the camcorder is designed to run on battery power for several hours. The unit produces a reasonable output bandwidth of 20 Mb/s. The high compression ratio is achieved by removing temporal redundancies across groups of six video frames (also called inter-frame coding in the MPEG standard with a group-of-picture (GOP) size of 6). Therefore, video-frame data must be kept in the camera memory for up to six frames, resulting in a (6 f)/(30 fps) – 0.2 second camera latency. Figure 3 shows the complete end-to-end delay across the following chain: camera (sensor and encoder); acquisition; computer transmission; and a rendering computer (decoder) display. The camera was directed at a stop-watch application running on a computer screen. Then a still picture is taken to show both the stop-watch application and its rendered video. The latency for video only was approximately 310 milliseconds, which, considering the complex computations in coding and decoding HD MPEG-2, is surprisingly good.

Audio delay can be measured with the set-up shown in Figure 4. An audio source produces a sound signal, which is then split with a Y-cable into two paths. The signal on the test path is sent to the microphone inputs of the camera and then passed through the HD live-streaming system. The audio signal is then converted back to its analog form in the rendering computer via a regular sound card. The measuring device records two audio signals arriving via the path directly from the source and the test path.

Figure 5 shows the original (top) and delayed (bottom) audio signals acquired by the measuring recorder. Due to effects of compression/decompression, the delayed audio signal looks different (i.e., with added noise) from the original waveform. Cross-correlation can be used to compare the two audio files and calculate the estimated delay.

Application Areas

Many applications can benefit from HD live video streaming technology. Here we list several examples.

Video conferencing has become a common business tool, often replacing traditional face-to-face meetings. The user experience is very much dependent on the quality of the video and audio presented. Ideally, the technology should “disappear” and give way to natural interactions. HD video conferencing can aid in this goal in several ways. For example, users can be portrayed in life-size which helps to create the effect of a “window” between the conferencing locations . In addition, non-verbal cues, which are important to human interactions, can be communicated via these high-resolution environments.

The AccessGrid is a collection of resources that support video and audio collaborations among a large number of groups. The resources include large display systems, powerful dedicated machines, high quality presentation and visualization environments, and high network bandwidth. AccessGrid can involve more than 100 sites in collaboration and each site is able to see all others. Unlike conventional H.323-based video conferencing systems, AccessGrid does not incorporate any MCU in its design. It uses IP-multicast to transmit the data from one site to all the others. An example AccessGrid system that delivers HD video was implemented at the Gwangju Institute of Science & Technology (GIST) Networked Media Laboratory. It uses the VideoLAN client (VLC) and DVTS (Digital Video Transport System) for high-quality video delivery.

The Distributed Immersive Performance (DIP) concept envisions the ultimate collaboration environment: a virtual space that enables live musical performances among multiple musicians and enjoyed by different audiences, all distributed around the globe. The participants, including subsets of musicians, the conductor and the audience, are in different physical locations. Immersive audio and high quality video, along with low latency networks are three major components that are required to bring DIP to reality. Early versions of the DIP concept have been successful demonstrated by a number of research groups. An example is a June 2003 networked duo, where two musicians played a tango together while physically being present in two different locations.

High Midnight [next] [back] Higgs, Peter (Ware)

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or

Vote down Vote up

almost 6 years ago

This article is great! I'll appreciate to have something like this base on "VIDEO TRANSMISSION WITH HIGH LATENCY"

Vote down Vote up

almost 6 years ago

This article is great! I'll appreciate to have something like this base on "VIDEO TRANSMISSION WITH HIGH LATENCY"