Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from K-O

Motion Estimation - Introduction, Two-Stage Video Compression Model, Matching Criteria, Algorithms, Sub-pixel

frame frames scene reference

Jauvane C. de Oliveira
National Laboratory for Scientific Computation, Petropolis, RJ, Brazil

Definition: Motion estimation explores the temporal redundancy, which is inherent in video sequences, and it represents a basis for lossy video compression.

Other than video compression, motion estimation can also be used as the basis for powerful video analysis and video processing.


A standard movie, which is also known as motion picture, can be defined as a sequence of several scenes. A scene is then defined as a sequence of several seconds of motion recorded without interruption. A scene usually has at least three seconds. A movie in the cinema is shown as a sequence of still pictures, at a rate of 24 frames per second. Similarly, a TV broadcast consists of a transmission of 30 frames per second (NTSC, and some flavors of PAL, such as PAL-M), 25 frames per second (PAL, SECAM) or anything from 5 to 30 frames per second for typical videos in the Internet. The name motion picture comes from the fact that a video, once encoded, is nothing but a sequence of still pictures that are shown at a reasonably high frequency. That gives the viewer the illusion that it is in fact a continuous animation. Each frame is shown for one small fraction of a second, more precisely 1/ k seconds, where k is the number of frames per second. Coming back to the definition of a scene, where the frames are captured without interruption, one can expect consecutive frames to be quite similar to one another, as very little time is allowed until the next frame is to be captured. With all this in mind we can finally conclude that each scene is composed of at least 3 × k frames (since a scene is at least 3 seconds long). In the NTSC case, for example, that means that a movie is composed of a sequence of various segments (scenes) each of which has at least 90 frames similar to one another.

The Figure 1 shows a sequence of five consecutive frames in a lOfps-video stream of the standard “mother and daughter” reference video (presents other reference videos). Changes from frame 1 to frame 2 is almost negligible; only changes in frames 3 through 5 are in fact noticeable due to the hand movement. Do notice, however, that while the hand of the mother moves, the rest of each frame is quite similar to its previous frame. Such similarity between neighbor frames is known as “temporal redundancy”, while the similarity of pixels in a given frame (for instance those that render the wall in the frames in Figure 1) is known as “spatial redundancy”. Motion estimation is a technique that is often used to exploit the temporal redundancy described above.

Before going further with details on motion estimation we need to describe briefly how a video sequence is organized. As mentioned earlier a video is composed of a number of pictures. Each picture is composed of a number of pixels or pels (picture elements). A video frame has its pixels grouped in 8×8 blocks. The blocks are then grouped in macroblocks (MB), which are composed of 4 luminance blocks each (plus equivalent chrominance blocks). Macroblocks are then organized in “groups of blocks” (GOBs) which are grouped in pictures (or in layers and then pictures). Pictures are further grouped in scenes, as described above, and we can consider scenes grouped as movies. Motion estimation is often performed in the macroblock domain. For simplicity’ sake we’ll refer to the macroblocks as blocks, but we shall remember that most often the macroblock domain is the one in use for motion estimation.

For motion estimation the idea is that one block b of a current frame C is sought for in a previous (or future) frame R. If a block of pixels which is similar enough to block b is found in R , then instead of transmitting the whole block just a “motion vector” is transmitted. The definition of “similar enough” here is detailed in section 3 below.

The Figure 2 shows a couple of neighbor frames (A and B), B being the currently encoded frame while A is a previous reference frame. The block that contains the top-left portion of the star if sought for in frame A can be found shifted as much as denoted by the arrow which represents the motion vector ©. All is summarized in D. So, instead of encoding the pixels of the current blocks of the frame B one can simply send the motion vector which would, in this example, suffice.

Ideally, a given macroblock would be sought for in the whole reference frame; however, due to the computational complexity of the motion estimation stage the search is usually limited to a pre-defined region around the macroblock. Most often such region includes 15 or 7 pixels to all four directions in a given reference frame. The search region is often denoted by the interval [-p, p] meaning that it includes p pixels in all directions. Figure 3 shows this concept.

Two-Stage Video Compression Model

The video compression model is a two-stage procedure. The first procedure consists of taking advantage of the temporal redundancy followed by a procedure similar to that used for lossy image compression which aims at exploring the spatial redundancy. The Figure 4 shows the two stages in halves A and B using the last two consecutive frames from Figure 1. In the temporal redundancy exploitation stage (A) we have motion estimation of the current frame ( C ) using the reference frame ( R ). The first stage produces both a set of motion vectors ( i, j ) as well as difference macroblocks ( C-R ). The difference macroblocks then go through the second stage which exploits spatial redundancy. One may notice that the difference frame has usually very high spatial redundancy due to the fact that it only stores information of difference of motion estimated macroblocks as well as macroblocks where a good match is not found in the reference frame(s). The matching criteria are detailed in the next section.

We should notice that in Figure 4 the reference to C-R is a didactic reference to what in fact would be each macroblock of C minus the equivalent match block in R. It is also worth mentioning that in the image shown as C-R was applied the negative so that pixels whose difference is zero show up white rather than black, which would be tougher to visualize.

Matching Criteria

Let x, y denote the location of the current macroblock. The pixels of the current macroblock can then be denoted by C(x+k, y+l) while the pixels in the reference frame can be denoted as R(x+i, y+j). We can now define a cost function based on the Mean Absolute Error (MAE) or Mean Absolute Difference (MAD) as:

where M , N denotes the size of the macroblock, -p = i = p and – p = j = p , with [-p,p] being the search region as described above. The matching block will be R(x+i,y+j) for which MAE is minimized, henceforth i, j defines the motion vector.

It is important to notice that the MAE is not the only option for matching criteria, as one can use Mean Square Error and other expressions as well. MAE is, however, often selected due to its computational simplicity. It is also worth mentioning that when the minimum MAE has a value higher than a given threshold the block is said “not found”. That means that the motion estimation failed and that block is to be encoded without exploiting temporal redundancy with regard to the R reference frame. We can observe this phenomena when there is a scene change or a new object is inserted in the scene between the reference frame and the current frame, or yet if motion goes beyond the search area [ -p,p ].

Motion estimation doesn’t have to be performed in a single reference frame. In some cases bi-directional motion estimation is performed. That means that other than looking for a macroblock in a previous reference frame R, the block is also sought for in a reference frame F in the future. That is very useful for the case where a new object is inserted into the scene, as it’ll be found in a future reference frame F even tough not being present in a previous reference frame R. More generally, multiple frames in the past and future can be used by a motion estimator, but more often we either use a single frame in the past or one in the past and one in the future, as described herein.


The motion estimation is the most computational intensive procedure for standard video compression. To seek a match for a macroblock in a [ -p, p] search region leads to (2 p + 1) 2 search locations, each of which requires M × N pixels to be compared, where M, N give the size of the source macroblock. Considering F the number of reference frames being considered in the matching process, such procedure reaches a total of (2 p + 1) 2 × MN × F executions of the Mean Absolute Error expression:

| C(x + k, y + l) – R(x + i + k, y + j + l) |,

which involves 3 operations per pixel. Considering a standard 16×16 macroblock with p =15 and F=l in a VGA-sized video stream (640×480 pixels) we have 40×30 = 1200 macroblocks per frame. That leads to a grand total of 778.41M operations for a single frame. In a 30-fps video stream that would entail 23.3523G operations per second for motion estimation alone. If bi-directional motion estimation is in place we’d reach 46.7046G operations per second (Remember that we still have to perform the spatial redundancy exploitation stage which includes a number of steps as shown in Figure 4). The scheme described herein is refereed to as exhaustive search. If a video needs not to be processed in real time it may be the best option as the exhaustive search ensures that we’ll find the best match for every macroblock. In some cases, however, we won’t be able to afford an exhaustive search. Some algorithms have been developed aiming at finding a sub-optimal match in much less time than the exhaustive search.

One example of algorithm that provides a sub-optimal result in much less operations is the two-dimensional logarithm, which resembles binary search. The basic idea is to divide the search region in two areas at each step: the first is inside of the [- p /2, p /2] area, whilst the latter is outside of such. We select nine positions: (0,0) as well as the four corners and four medium points of the limit area, as shown in Figure 5 (left – marked as 1). The MAE is then calculated for each of the eight perimeter values and the smaller MAE is selected as the starting point for the next pass. We then go on selecting a [- p /4, p /4] region around the selected position (marked as 2 in the Figure 5) and repeat the process until we search in a one-pixel region around the previously selected pixel (marked as 4 in the Figure 5). The last MAE selected will be the ending point of the motion vector.

Sub-pixel Motion Estimation

Some video encoders allow the motion estimation to go one step further than the whole-pixel-domain motion estimation, allowing sub-pixel motion estimation. Half-pixel motion estimation, for instance, is found in video codecs such as H263. Sub-pixel motion estimation consists of creating imaginary pixels between existing pixels as some sort of average of the neighbor pixels. That extra level of pixels would allow for a motion vector which stops in between real pixels. That allows better matching in the event that an object didn’t move in the whole-pixel domain. The Figure 5 (right) shows the four extra positions between pixels, which are to be interpolated and also considered in an extra last-step in the motion estimation process.

Detection of Camera Motion, Motion Detection and Object Tracking

Motion Estimation and the resulting motion vectors can be used for a number of scene analysis through which it is possible to detect what happens in the video sequence. For instance, let us consider a video scene where the camera is panning right, such as that shown in Figure 6. If we consider a couple of consecutive frames we can verify that as the camera pans right the objects in the scene move leftwards until they eventually disappear at the left edge of the video frame. Considering macroblocks which compose the frame we can then deduct that most of the motion vectors should be pointing leftwards, as that’s where the objects are moving to in the video sequence. If there is no motion in the scene but the camera is moving, the motion vectors will be basically aligned and with similar intensity and direction than neighbor vectors. For some scene where we have both motions in the scene and the camera, we would also have a large number of motion vectors pointing in the opposite direction to where the camera is moving, at least in the macroblocks that compose the background. We can, hence, analyze the orientation of the motion vectors of a given frame (or a given set of consecutive frames) to identify camera movements.

Similarly we can verify other camera movements, as depicted in Figure 7, such as camera panning left (A), panning right (B), tilting down ©, tilting up (D), zooming out (E) or zooming in (F).

Scene change, i.e. an indoor scene change to an outdoor scenario for instance, can be detected through the failure to estimate motion from one frame to the next, as typically no matches should be easily found.

When we put all the camera movement/scene change detection together, we can build a comprehensive set of procedures for classification/analysis of video sequences based purely in motion estimation, as well as its resulting motion vectors.

Going one step further, if we have a scene which doesn’t change much, for instance consider a surveillance camera which captures a safe lock inside a bank, the motion estimation should overall find excellent matches right at the same position where a given macroblock was located in the previous frames. If someone approaches the safe there will be a group of macroblocks which will have some motion vectors pointing in the opposite direction as that individual moves. That could be used to trigger a security alarm. In order to prevent the alarm from being trigged if a small animal (a rat perhaps) moves around in the room, one can model the system to only trigger the alarm if the amount of macroblocks with movement detected is larger than a given threshold which would be typical of a human being.

Going yet another step further, in the surveillance application described above, the macroblocks which compose the movement of an object in the room delimits the format of such object. Analyzing such format may allow the identification of the object.


Even tough more commonly linked to lossy video compression, motion estimation is in fact a technique that goes beyond and allows for video processing and computational vision algorithms and applications. It allows a computer to detect movement as well as to perform comprehensive video sequence analysis, identifying scenes, and camera and object movements. A video compression standard such as MPEG4, which allows the coding of objects separately, requires such objects to be found in the first place. Motion estimation is one technique that allows for a simple, yet effective, object identification scheme. Since motion estimation is one of the most computer intensive tasks linked to video compression, it is fairly common to find specific hardware which performs at least parts of the motion estimation.

Motion Picture Inpainting on Aged Films - Motion Estimation and Defect Block Detection [next] [back] Motion Compensation for Video Compression - Motion Compensated Discrete Cosine Transform (DCT) Video Coder, Block Matching Motion Estimation

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or

Vote down Vote up

almost 7 years ago

thanx for the wonderful material..!