Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from A-E

Audio Streaming - Introduction, Audio Compression, Dissemination over the Network, Real-time Transport Protocol (RTP)

data delay receiver server

Shervin Shirmohammadi
University of Ottawa, Canada

Jauvane C. de Oliveira
National Laboratory for Scientific Computation, Petropolis, RJ, Brazil

Definition: Audio streaming refers to the transfer of audio across the network such that the audio can be played by the receiver(s) in real-time as it is being transferred.


Audio streaming can be for various live media, such as the Internet broadcast of a concert, or for stored media, such as listening to an online jukebox. Real time transfer and playback are the keys in audio streaming. As such, other approaches, such as downloading an entire file before playing it, are not considered to be streaming. From a high-level perspective, an audio streaming system needs to address three issues: audio compression, dissemination over the network, and playback at the receiver.

Audio Compression

Whether the audio is coming from a pre-stored file, or is captured live, it needs to be compressed to make streaming practical over a network. Uncompressed audio is bulky and most of the time not appropriate for transmission over the network. For example, even a low-quality 8Khz 8-bit speech in the PCM format can take from 56 to 64 kbps; anything with higher quality takes even more bandwidth. Compression is therefore necessary for audio streaming. Table 1 shows a list of several streaming standards, with the typical target bitrate, relative delay and usual target applications.

It should be noted that the delays disclosed in table 1 are based on the algorithm. For example, for PCM at 8 KHz sampling, we have one sample at every 0.125 milliseconds. Sample based compression schemes, such as PCM and ADPCM, are usually much faster than those that achieve compression based on the human vocal system or psychoacoustic human model like MP3, LPC and CELP. So, for delay-conscious applications such as audio streaming, which needs to be in real-time, one may select a sample-based encoding scheme, bandwidth permitting; otherwise one of the latter would be a better choice as they achieve higher compression. For a detailed discussion about the audio compression schemes please see the “Compression and Coding Techniques, Audio” article. In the streaming context, the Real Audio (ra) and Windows Media Audio (wma) formats, from Real Networks and Microsoft Corp. respectively, are also used quite often.

Dissemination over the Network

Unlike elastic traffic such as email or file transfer, which are not severely affected by delays or irregularities in transmission speed, continuous multimedia data such as audio and video are inelastic. These media have a “natural” flow and are not very flexible. Interruptions in audio while streaming it is undesirable and creates a major problem for the end user because it distorts its real-time nature. It should be pointed out that delay is not always detrimental for audio, as long as the flow is continuous. For example, consider a presentational application where the audio is played back to the user with limited interaction capabilities such as play/pause/open/close. In such a scenario, if the entire audio is delayed by a few seconds, the user’s perception of it is not affected due to lack of a reference point, as long as there are no interruptions. However, for a conversational application such as audio conferencing, where users interact with each other, audio delay must not violate certain thresholds because of the interaction and the existence of reference points between the users.

The transport protocol used for audio streaming must be able to handle the real-time nature of it. One of the most commonly-used real-time protocols for audio streaming is the Real-time Transport protocol (RTP), which is typically used with the Real Time Streaming Protocol (RTSP) for exchanging commands between the player and media server, and sometimes used with the Real -time Transport Control Protocol (RTCP) for Quality of Service (QoS) monitoring and other things. These protocols are briefly discussed next.

Real-time Transport Protocol (RTP)

To accommodate the inelasticity of audio streaming, there is a need for special networking protocols. The most common such protocol is the Real-time Transport Protocol (RTP) . It is usually implemented as an application-level framing protocol on top of UDP, as shown in Figure 1. It should be noted that RTP is named as such because it is used to carry real-time data; RTP itself does not guarantee real-time delivery of data. Real time delivery depends on the underlying network; therefore, a transport-layer or an application-layer protocol cannot guaranty real-time delivery because it can’t control the network. What makes RTP suitable for multimedia data, compared to other protocols, are two of its header fields: Payload Type, which indicates what type of data is being transported (Real Audio, MPEG Video, etc.), and Timestamp, which provides the temporal information for the data. Together with the Sequence Number field of the RTP header, these fields enable real-time playing of the audio at the receiver, network permitting. RTP supports multi-point to multi-point communications, including UDP multicasting.

Real-time Transport Control Protocol (RTCP)

RTP is only responsible for transferring the data. For more capabilities, RTP’s companion protocol the Real-time Transport Control Protocol (RTCP) can be used. RTCP is typically used in conjunction with RTP, and it also uses UDP as its delivery mechanism. RTCP provides many capabilities; the most used ones are:

  • QoS feedback: the receiver can report the quality of their reception to the sender. This can include number of lost packets or the round-trip delay, among other things. This information can be used by the sender to adapt the source, if possible. Note that RTCP does not specify how the media should be adapted that functionality is outside of its scope. RTCP’s job is to inform the sender about the QoS conditions currently experienced in the transmission. It is up to the sender to decide what actions to take for a given QoS condition.
  • Intermedia synchronization: information that is necessary for the synchronization of sources, such as between audio and video can be provided by RTCP.
  • Identification: information such as the e-mail address, phone number, and full name of the participants can also be provided.
  • Session Control: participants can send small notes to each other, such as “stepping out of the office”, or indicate they are leaving using the BYE message, for example.

Real Time Streaming Protocol (RTSP)

Unlike RTP which transfers real-time data, the Real Time Streaming Protocol (RTSP) is only concerned with sending commands between a receiver’s audio player and the audio source. These commands include methods such as SETUP, PLAY, PAUSE, and TEARDOWN. Using RTSP, an audio player can setup a session between itself and the audio source. The audio is then transmitted over some other protocol, such as RTP, from the source to the player. Similar to RTP, RTSP is not real-time by itself. Real-time delivery depends on the underlying network.

A Typical Scenario

Figure 2 demonstrates a typical scenario of audio streaming.

Here, the client first goes to a web site, where there is a link to the audio. Upon clicking on that link, the webs server sends to the receiver the URL of where the audio can be found. In the case of an RTSP session, the link looks something like rtsp://www.audioserver.com/audio.mp3. Note that the Web server and the audio server do not have to be the same entity; it is quite possible to separate them for better maintenance. After receiving the above link, the client’s player established an RTSP link with the audio server through the SETUP command. The client can then interact with the server by sending PLAY, STOP, PAUSE, and other commands. Once the audio is requested for playing, RTP is used to carry the actual audio data. At the same time, RTCP can be used to send control commands between the client and the server. Finally, the session is finished with the TEARDOWN command of RTSP.

HTTP Streaming

HTTP streaming is an alternative to using RTP. The idea here is that the player simply requests the audio from the web server over HTTP, and plays it as the audio data comes in from the Web server. The disadvantages of this approach are the lack of RTP/RTCP features discussed above, and the fact that HTTP uses TCP which is not considered a real time protocol, especially under less-than ideal network conditions. As such, there can be more interruptions and delay associated with HTTP streaming compared to RTP streaming. However, HTTP streaming is used quite commonly for cases where the receiver has a high-speed Internet connection such as DSL and the audio bandwidth is not very high. In these cases, using HTTP streaming can be justified by its advantages; namely, the fact that HTTP is always allowed to go through firewalls, and that HTTP streaming is easier to implement as one can simply use an existing Web server.

Playback at the Receiver

Although the coding and transmission techniques described above significantly contribute to the audio steaming process, the ultimate factor determining the real-time delivery of audio is the network condition. As mentioned above, delay severely affects audio in conversational applications. But for presentational applications, delay is less detrimental as long as it has a reasonable amount for a given application and is relatively constant. However, even for presentational applications, the variance of delay, known as jitter, has an adverse effect on the presentation. In order to smoothen out the delay, the player at the receiver’s end usually buffers the audio for a certain duration before playing it. This provides a “safety margin” in case the transmission is interrupted for short durations. Note that the buffer cannot be too large, since it makes the user wait for too long before actually hearing the audio, and it cannot be too short since it won’t really mitigate the effect of jitter in that case.

An extension to the above buffering technique is the faster-than-natural transmission of audio data. Depending on the buffer size of the receiver, the sender can transmit the audio faster than its normal playing speed so that if there are transmission interruptions, the player has enough data to playback for the user. This technique would work for stored audio, but it does not apply to live applications where the source produces audio at a natural speed.

Interleaved Audio

Interleaved audio transmission is a technique that is sometimes used to alleviate network loss and act as a packet loss resilience mechanism The idea is to send alternate audio samples in different packets, as opposed to sending consecutive samples in the same packet. The difference between the two approaches is shown in figure 3. In 3a we see 24 consecutive samples being transmitted in one packet. If this packet is lost, there will be a gap in the audio equal to the duration of the samples. In 3b we see the interleaved approach for the same audio sample in 3a, where alternate samples are being sent in separate packets. This way if one of the packets is lost, we only lose every other sample and the receiver will hear a somewhat distorted audio as opposed to hearing a complete gap for that duration. In case of stereo audio, we can adapt this technique to send the left channel and the right channel in separate packets, so that if the packet for one of the channels is lost, the receiver still hears the other channel.

Audran, (Achille) Edmond [next] [back] Audio Conferencing

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or