Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from U-Z

Video Over Ip - Introduction, Video Requirements, Video Compression Standards, Session Setup, Video Distribution, Quality of Service (QoS)

network internet packet mpeg

Artur Ziviani
National Laboratory for Scientific Computing (LNCC), Petropolis, Brazil

Marcelo Dias de Amorim
National Center for Scientific Research (CNRS/LIP6), Paris, France

Definition: Video over IP refers to a challenging task to define standards and protocols for transmitting video over IP (Internet Protocol) networks.


The Internet is doubtless one of the greatest success examples ever observed in the information technology world. Its evolution can be explained by two complementary views. On the one hand, advances in communication and information technologies have allowed rapid increase in transmission capacity in both wired and wireless domains. On the other hand, users are becoming more exigent and asking to transmit larger amounts of data of multiple natures. In this context, transmitting video over IP networks is particularly challenging. Indeed, at the time of design of the Internet, expected data rates seen by terminals were about a few Kbps. Later, with the advent of multimedia communications, bandwidth requirements are now measured in the order of Mbps.

The Internet operates as a packet-switched network that interconnects end nodes implementing the TCP/IP protocol stack. In the TCP/IP protocol stack, the network layer protocol is the Internet Protocol (IP). Under IP, each host that communicates directly with the Internet has an address assigned to it that is unique within the network. This is known as the IP address of each host. Further, each IP address is subdivided into smaller parts: a network identifier and a host identifier. The former uniquely identifies the access network to which the host is attached. The latter indicates the host within a particular network. Routers periodically exchange routing information about address identifiers of the concerned access networks. As a result of this periodic information exchange, routers are able to build routing tables that guide packet forwarding among different networks. These tables are used at each intermediate router along a path within the network to indicate the forwarding interface an IP packet (datagram) should take in order to get to its destination.

In the IP layer, if the packet size exceeds a maximum frame size of the network connected to the output interface, the packet is further divided in smaller parts in a process called fragmentation. These fragments are then forwarded by the network in separate packets toward the destination. The destination is responsible for reassembling the fragments into the original packet. Note that if a fragment is lost within the network, the destination is unable to reassemble the packet that has been fragmented and the surviving fragments are then discarded. The forwarding decision for each packet is individually taken at each intermediate node. As incoming IP packets arrive at an intermediate router, they are stored in a buffer waiting for the router to process a routing decision for each one of them to indicate their appropriate output interface. Persistent packet buffering at routers is known as network congestion. Furthermore, if a packet arrives at a router and the buffer is full, the packet is simply discarded by the router. Therefore, severe network congestion causes packet losses as buffers fill up. As a consequence of these characteristics, IP service offers no guarantees of bounded delay and limited packet loss rate. There is also no assurance that packets of a single flow will follow the same path within the network or will arrive at the destination in the same order they were originally transmitted by the source.

IP provides a connectionless best-effort service to the transport layer protocols. The main transport protocols of the TCP/IP protocol stack are the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP). TCP offers a connection-oriented and reliable service. In turn, UDP provides a connectionless best-effort service. The choice of transport protocol depends on the requirements of the adopted application. From the viewpoint of an application, UDP simply provides an extension of the service provided by IP with an additional UDP header checksum. Hence, applications using UDP see the IP service as it is. In contrast with UDP, TCP seeks to mask the network service provided by IP to the application protocols by applying connection management, congestion control, error control, and flow control mechanisms.

The Real-time Transport Protocol (RTP) supports applications streaming audio and/or video with real-time constraints over the Internet. RTP is composed by a data and a control part. The data part of RTP provides functionality suited for carrying real-time content, e.g. the timing information required by the receiver to correctly output the received packet stream. The control part, called Real-time Transport Control Protocol (RTCP), offers control mechanisms for synchronizing different streams with timing properties prior to decoding them. RTCP also provides reports to a real-time stream source on the network Quality of Service (QoS) offered to receivers in terms of delay, jitter, and packet loss rate experienced by the received packets. These functionalities are also available to the multicast distribution of real-time streams.

The transmission of video over the Internet has received great attention from both academic and industrial communities. In this text, we summarize some of the most important topics related to video transmission over the Internet. We propose an overview of the domains by briefly addressing standards, characteristics, communication techniques, and video quality. This list is not exhaustive, but representative of the greatest advances in video transmission over the latest years.

Video Requirements

The nature of video content directly influences on the achieved compression rate of a video encoder and the resulting traffic to be transmitted. For example, at the one hand, a news video sequence usually shows a person just narrating events and, as a consequence, most of the scene is still, thus favoring video compression techniques based on motion estimation. At the other hand, an action movie is less susceptible to compression because of frequent camera movements and object displacements at the movie scenes. Furthermore, scene changes produce disruptions that result in larger coded frames. Therefore, video sequences with frequent scene changes, like music video clips, generate highly bursty network traffic.

The transmission of video sequences over the Internet imposes different requirements on video quality delivery. Processing, transmission, propagation, and queuing delays compose the total delay that takes a packet to be fully transmitted from the video source to its destination. The processing delay consists of the coding and packetization at the source, the packet treatment at intermediate routers, and the depacketization and decoding at the receiver. The transmission delay is a function of the packet size and transmission capacity at the links. The propagation delay is a characteristic of each communication medium. The queuing delay is unpredictable because it depends on the concurrent traffic video packets encounter at each intermediate router. Maximum delay is an important metric for interactive applications with real-time requirements, such as videoconferencing and distance collaboration. Streaming video applications also depend on packets arriving within a bounded delay for timely content reproduction. If a packet arrives too late at the receiver, the packet is considered as lost since it is useless for video playback. A solution is to transmit some packets in advance and accumulate them in a buffer before starting the video reproduction. A bounded jitter is also desirable because it reduces the buffer capacity needed at receivers to compensate these variations in delay. IP is a connectionless protocol and hence packets may follow different paths through the network. As a consequence, they may arrive at the receiver out-of-order. All these issues affect the video quality perceived at the receiver by an end user.

Video Compression Standards

As shown before, video data are in general greedy of network resources. In some cases, bandwidth is quite abundant and the requirements for transmitting video are not really an issue. However, the high heterogeneity of the Internet is still characterized by a large number of low-capacity terminal links. Algorithms for compressing information, and in particular video data, are highly demanded and have been the main focus of many research efforts in the latest decades. In general, the idea behind compression is to assign few bits to low priority events and more bits to high priority events (the term priority is used here in its broader meaning, and can have different significations depending on the context). Two entities have contributed with the most used algorithms and standards for video compression: MPEG and ITU.

The Motion Picture Expert Group (MPEG), launched in 1998, is an ISO/IEC working group that works towards standards for both audio and video digital formats and for multimedia description frameworks. The main standards for video coding released by this group include MPEG-1, MPEG-2, and MPEG-4. The MPEG-1 standard defines a series of encoding techniques for both audio and video (video is part 2 of the standard), designed for generating flows of up to 1.5 Mbps. The main goal of MPEG-1 was to address the problem of storing video in CD-ROMs, and it has become a successful format for video exchange over the Internet. However, higher rates than the 1.5 Mbps of MPEG-1 became rapidly a need. This led to the definition of MPEG-2, which defines rates from 1.5 to tens of Mbps. MPEG-2 is based on MPEG-1, but proposes a number of new techniques to address a much larger number of potential applications, including digital video storage and transmission, high-definition television (HDTV), and digital video disks (DVDs). MPEG-1 and 2 achieve good compression ratios by implementing causal and non-causal prediction. Briefly, a video flow is defined as a sequence of group of pictures (GOP), composed of three types of frames (or pictures): I, P, and B. I-frames, also called reference frames, are basically low-compressed pictures that serve as reference for the computation of P and B frames. P-frames are obtained from a past I-frame by using motion prediction, and can then be encoded with higher compression ratios. B-frames, or bidirectional frames, are based on both previous and future I and P frames within a GOP, which leads to high compression ratios. Figure 1 illustrates this hierarchical structure of GOPs.

MPEG-4 part 2 (see also H.264 below) covers a gap in the objectives of the previous two standards: the need for a flexible framework that would be adaptable to the wide range of applications and the high heterogeneity found in the Internet. MPEG-4 targets the scaling from a few Kbps to moderately high bit rates (about 4 Mbps). The main innovation brought by MPEG-4 was the use of the concept of video objects, in which a scene is decomposed in a number of objects that can be treated differently one from another during the video transmission ( e.g ., prioritization among objects).

The International Telecommunication Unit (ITU), in its H series (e.g., H.310, H.320, and H.324), has also addressed the problem of transmitting compressed video over the Internet. The first standard proposed by ITU is the H.261 video codec. In order to deal with a wide range of communication patterns, the H.261 standard defines a number of methods for coding and decoding video at rates of p×64 Kbps, where p varies in the range of 1 to 30. Later, three other standards have appeared: H.262, H.263, and H.264. The H.262 standard targets higher bit rates (in fact, H.262 is the result of a partnership that also led to MPEG-2), and has not been widely implemented in the H.320 series. Both H.261 and H.263 are based on the same principles, although H.263 introduces a number of improvements that lead to equivalent video quality for even half of the bandwidth. H.261 supports both QCIF (176×144 pixels) and CIF (352×288) resolutions, whereas H.263 also principles, although H.263 introduces a number of improvements that lead to equivalent video quality for even half of the bandwidth. H.261 supports both QCIF (176×144 pixels) and CIF (352×288) resolutions, whereas H.263 also supports SQCIF (128×96 pixels), 4CIF (704×576 pixels), and 16CIF (1408×1152 pixels). The H.264 standard, also known as MPEG-4 part 10 or H.264/AVC, is the result of a joint work between ITU and MPEG (the partnership is called Joint Video Team -JVT), with the objective of defining a codec capable of generating, without increased complexity, good video quality at lower bit rates than previous formats (H.262, MPEG-2, MPEG-4 part 2). H.264/AVC makes use of advanced coding techniques in order to generate video for a wide range of applications, from low to high resolution and at varying bit rates (for example, DVD, broadcast, 3G mobile content).

Session Setup

An important issue in multimedia communications is how sessions are set up. Session establishment applies for multimedia communications in general, and can be set up in two different ways: by announcement or by invitation. A TV program is an example of session established by announcement, whereas a videophone call is an example of session established by invitation. The Internet Engineering Task Force (IETF) has proposed a set of protocols for session setup, covering description, announcement, and invitation phases. The ITU-T has also proposed a standard, H.323, but in this document we focus on the lETF’s solutions.

The common protocol for session setup is the Session Description Protocol (SDP), which is used to describe the characteristics of the session to be established. In SDP, sessions are characterized by a well-defined set of descriptors in textual form, including: the name of the session, the objective, associated protocols, information about codecs, timing, among others. SDP operates in a complementary way with the protocols defined in the following. The Session Announcement Protocol (SAP) is a very simple protocol for announcing future multimedia sessions. It basically sends over a multicast session, in a periodic fashion, the description of the session defined by SDP. A bit more complicated is the Session Initiation Protocol (SIP), proposed to control multimedia sessions by invitation. The main contribution of SIP is the way it addresses corresponding nodes. In the classical telephone network, when making a call, the initiator knows exactly where the destination phone is physically located, but is neither sure that the person that will respond is the one she/he searches nor that this latter will be at the other side of the line. In SIP, the idea is to call a person and not the phone this person may be near.

Video Distribution

The way the video is transmitted from the source to the receivers has a direct impact both on the utilization of the network resources and on the quality of the video perceived at the terminals. The trivial way of sending a video to a destination is by establishing a unicast connection between the source and the destination. This approach is however not efficient if the same video must be sent to more than one receiver (potentially hundreds or thousands of them). Another class of communication, extensively studied in the latest years, provides interesting properties for video communication: the multicast . In this paradigm, data sent by the source are received by a number of destinations belonging to a multicast group. The advantage of multicasting is that the source transmits only one copy of the data to multiple receivers, instead of occupying the medium with multiple copies of the same data, thus avoiding the inefficient utilization of network resources.

Distributing video in multicast mode raises a number of technical issues. Endsystems in the Internet are quite heterogeneous in terms of receiving capacity. Indeed, access speeds vary from a few Kbps up to many Mbps. Thus, there is the problem of determining at which rate the source should send the video. On the one hand, if the source sets the rate to the minimum receiving capacity (in order to maximize the number of destinations able to receive the video), high-speed terminals become underutilized. On the other hand, if the source increases the transmission rate, high-speed terminals can be provided with improved video quality although slow terminals cannot be served anymore.

In order to address this problem, two techniques have been proposed so far: multiple flows and multi layering. In the first one, the raw video is coded at different rates and each flow is sent over a different multicast sessions. In this way, slow terminals join multicast groups that transfer low-rate flows, whereas high-speed receivers join higherrate multicast sessions. This solution solves the problem of satisfaction of the receivers, but leads to higher network loads; indeed, bandwidth is wasted to transmit different flows of the same video. The technique of multi-layering addresses exactly this point. This approach relies on the capacity of some coders to split the raw video in multiple layers. These layers are hierarchically organized in a base layer and one or more enhancement layers. Enhancement layers cannot be decoded without the base layer and other enhancement layers of higher priority. In this way, the base layer plus two enhancement layers result in better video quality than the configuration with the base layer and one enhancement layer. The system that adopts a multi-layer approach is similar to the one that uses multiple flows. Each layer is transmitted over a particular multicast group and receivers join as many groups as they want or are able to receive simultaneously. The multi-layering solution, although more expensive for the implementation because it needs a multi-layer codec, solves both the satisfaction and overhead problems.

More recently, some interesting solutions propose to use peer-to-peer concepts to distribute video over the Internet, for both video on demand and streaming. In the peer-to-peer communication paradigm, end-systems establish connections among them forming a logical structure called overlay network. Peer-to-peer techniques can be used for video communication in two different, not necessarily orthogonal, ways. The first one deals with the absence of central servers, and each end-system is both a server and a client. The second one, more related to the context of this text, refers to the real-time video distribution. Although native multicast, as described above, has shown to be an efficient solution for multimedia distribution, it has not been widely implemented yet. For this reason, many solutions for overlay multicast have been proposed. With this solution, a logical distribution tree is established from the source to the receivers, where the receivers themselves occupy nodes of the tree. The advantage of this approach is that the load in the source’s output link is dramatically reduced without requiring the implementation of native multicast.

Quality of Service (QoS)

Quality video delivery over the widely deployed IP-based networks is required by several applications, such as distance learning and collaboration, video distribution, telemedicine, video conferencing, and interactive virtual environments. Nevertheless, the best-effort model of the conventional Internet has become inadequate to deal with the very diverse requirements on network QoS of video streams. The hierarchical structure of video encoding with possible error propagation through its frames imposes a great difficulty on sending video streams over lossy networks because small packet loss rates may translate into much higher frame error rates. Besides being lost, some packets may also suffer unpredictable amounts of delay or jitter due to network congestion at intermediate routers, compromising their accomplishment of real-time constraints. All these issues related to the best-effort service of IP may seriously contribute to degrade the perceived quality of an end user at the video reproduction.

An alternative to recover from transmission errors, which can turn a frame undecodable, is to apply error concealment techniques. In the absence of a frame due to errors, error concealment may replace the lost frame by a previous one or roughly estimate it from adjacent well-received frames. Further, error concealment techniques differ on the roles the encoder and decoder play in recovering from errors. We direct the interested reader to for a full review on error concealment techniques. Different QoS schemes may also be applied to the video stream in order to adapt it for transmission given the network conditions. These adaptive strategies involve applying redundancy; either by using Forward Error Correction (FEC) to tolerate some losses or by using a different compression factor in the video encoding that may be achieved in changing the adopted GOP pattern. FEC schemes protect video streams against packet losses up to a certain level at the expense of data redundancy. Adopting different frame patterns allows a video stream to better adapt itself to the available transmission conditions. Within the network, unequal protection based on frame type may avoid quality degradation due to the loss of one particularly important frame and the possible propagation effect of this loss throughout the hierarchical structure of compressed video streams. The joint adoption of these QoS strategies may as well contribute to a better delivery quality of video streams. Figure 2 shows sample visual results of the negative effects that the best-effort service of IP may impose to video sequences and how the appliance of QoS mechanisms can enhance the video delivery quality to an end user.

Summary and Outlook

An ever-increasing number of users rely on the Internet to communicate and exchange data using a wide range of applications. In particular, multimedia applications, such as those that make use of video transmissions, are on high demand. Nevertheless, the conventional best-effort model of the Internet does not necessarily meet the requirements on both network and application level QoS imposed by video applications. In this text, we review the issues associated with transmitting video over IP, the network-layer protocol that interconnects the different networks that compose the Internet. We discuss relevant characteristics of the Internet protocols, pointing out how they impose challenges to the quality transmission of video streams in IP-based networks. We also present techniques and standards for video coding, distribution, and quality of service provision given the requirements imposed by video applications.

Currently, we witness voice over IP applications becoming really popular as the offered quality increases and users are being attracted by telephone-like quality at significantly lower costs. Video over IP has the potential to face a similar situation in the foreseeable future, given that it enables a wide range of interesting applications, including videoconferencing, video on demand, telemedicine, distance learning and collaboration, remote surveillance, and interactive virtual environments. Further developments on areas such as video adaptability to ever-changing network conditions, QoS provision and monitoring, and video distribution are essential for these video over IP applications to become a reality in the daily life of the general public.

Video Search - Semiotic Video, Phencestetic video [next] [back] Video Inpainting

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or

Vote down Vote up

almost 7 years ago

am in my final year @ Ghana telecom university .
i have chosen video over ip as my project. i need some help with the diagrammatic representation of this technology, simulation software and other necessary information .... thank you