Video Content Analysis Using Machine Learning Tools - Introduction, Overview of Machine Learning Techniques
data model feature models
Yihong Gong
NEC Laboratories America, Cupertino, USA
Definition: Latest breakthroughs in machine learning methodologies have made it feasible to accurately detect objects and to model complex events with interrelated objects.
Introduction
The explosive growth in digital videos has sparked an urgent need for new technologies able to access and retrieve desired videos from large video archives with both efficiency and accuracy. Content-based video retrieval (CBVR) techniques developed in the past decide strive to accomplish this goal by using low level image features, such as colors, textures, shapes, motions, etc. However, as there is a huge semantic gap between data representations and real video contents, CBVR techniques generally suffer from poor video retrieval performances.
The key to strengthening video retrieval capabilities lies in higher level understanding and representation of video contents. Video content understanding based on machine learning techniques is one of the promising research directions to accomplish this goal. Machine learning techniques are superior in discovering implicit, complex knowledge from low level data sets. Latest breakthroughs in machine learning methodologies have made it feasible to accurately detect objects and to model complex events with interrelated objects. In recent years, machine learning applications have made impressive achievements in classifying video clips into predefined scene categories (i.e., indoor, outdoor, city view, landscape, etc), detecting events of interest from sports videos, etc. Compared to other approaches in the literature, machine learning-based methods often excel in detection accuracies, and in modeling complex events.
In this article, we focus on the state of the art of probabilistic data classification methods, an important subfield of machine learning, and their applications. Section 2 first provides overviews of probabilistic data classifiers by presenting two kinds of categorizations. Section 3 elaborates on the Maximum Entropy Model (MEM) which is a representative discriminative data classifier. Section 4 describes its application to baseball game highlight detections. Section 5 reveals performance evaluation results of the MEM-based system, and compares it with the system based on the Hidden Markov Model (HMM) which is a representative generative data classifier. Finally, Section 6 summarizes the paper.
Overview of Machine Learning Techniques
The tasks of scene classification, object and event detection can all be translated into a probabilistic data classification problem. Probabilistic data classification is an important sub-field of machine learning, and can be defined as the problem of classifying an example based on its feature vector into one of the predefined classes. Let x be the random variable representing feature vectors (observations) of input video clips, and y be the random variable denoting class labels. Data classifiers strive to learn a probability function from a set of training examples which indicates the probability of x belonging to class y. There are many possible ways to categorize a classifier. In the remaining part of this section, we describe two kinds of categorizations which characterize various data classifiers from two different viewpoints.
Models for Simple Data Entities vs. Models for Complex Data Entities
Many data entities have simple, flat structures that do not depend on other data entities. The outcome of each coin toss, the weight of each apple, the age of each person, etc are examples of such simple data entities. In contrast, there exist complex data entities that consist of sub-entities that are strongly related one to another. For example, a beach scene is usually composed of a blue sky on top, an ocean in the middle, and a sand beach at the bottom. In other word, beach scene is a complex entity that is composed of three sub-entities with certain spatial relations. On the other hand, in TV broadcasted baseball game videos, a typical home run event usually consists of four or more shots, which starts from a pitcher’s view, followed by a panning outfield and audience view in which the video camera tracks the flying ball, and ends with a global or close-up view of the player running to home base. Obviously, a home run event is a complex data entity that is composed of a unique sequence of sub entities.
Popular data classifiers that aim to model simple data entities include Naive Bayes, Gaussian Mixture Model (GMM), Neural Network, Support Vector Machine (SVM), etc. Neural Network, if the network architecture is appropriately designed for the problem at hand, has great potential to accomplish high data classification accuracies. The network design, however, usually relies heavily on the designer’s experiences and craftsmanship, and the lack of uniform, practically proven design principles has certainly hampered wide applications of Neural Networks. SVM, although relatively new compared to other popular data classification models, is emerging as the new favorite because of its ease of implementation and data generalization capability. In recent years, there have been numerous research studies that report the SVM’s superiority over traditional models, especially in the area of text classification.
For modeling complex data entities, popular classifiers include Bayesian Networks, Hidden Markov Models (HMM), Maximum Entropy Models (MEM), Markov Random Fields (MRF), Conditional Random Field Models (CRF), etc. HMM has been commonly used for speech recognition, and has become a pseudo standard for modeling sequential data. MEM and CRF are relatively new methods that are quickly gaining popularity for classifying sequential or interrelated data entities. MEM and CRF take an opposite approach to derive the conditional probability compared to HMM, which is elaborated in the following subsection.
Generative Models vs. Discriminative Models
Probabilistic data classifiers typically map an input example to one of the predefined classes through a conditional probability function derived from a set of training examples. In general, there are two ways of learning. Discriminative models strive to learning directly from the training set without the attempt to modeling the observation X . Generative models, on the other hand, computes first modeling the class-conditional probability of the observation x , and then applying the Bayes’ rule as follows:
Because can be interpreted as the probability of generating the observation x by class y , classifiers exploring can be viewed as modeling how the observation x is generated, which explains the name “generative model”.
Popular generative models include Naïve Bayes, Bayesian Network, GMM, and HMM, while representative discriminative models include Neural Network, SVM, MEM and CRF. Generative models have been traditionally popular for data classification tasks because modeling is often easier than modeling , and there exist wellestablished, easy-to-implement algorithms such as the EM algorithm, the Baum-Welch algorithm, etc to efficiently estimate the model through a learning process. The ease of use, and the theoretical beauty of generative models, however, do come with a cost. Many complex data entities, such as a beach scene, a home run event, etc, need to be represented by a vector X of many features that depend on each other. To make the model estimation process tractable, generative models commonly assume conditional independences among all the features comprising the feature vector X . Because this assumption is for the sake of mathematical convenience rather than the reflection of a reality, generative models often have limited performance accuracies for classifying complex data sets.
Discriminative models, on the other hand, typically make very few assumptions about the data and features, and in a sense, let the data speak for themselves. Recent research studies have shown that discriminative models outperform generative models in many applications such as natural language processing, webpage classifications, baseball highlight detections, etc.
Maximum Entropy Model
MEM is a representative discriminative model that derives the conditional probability directly from the training data. The principle of MEM is simple: model all that is known and assume nothing about what is unknown. In other words, given a collection of facts, MEM chooses a model which is consistent with all the facts, but otherwise is as uniform as possible.
To express each feature that describes the input data, MEM makes use of a feature indicator function f ij (x, y ) that is defined as follows:
The expected value of f ij with respect to the empirical distribution is defined as: where can be computed from the training data by enumerating the number of times that occurs together in the training data. On the other hand, the expected value of f ij with respect to the model is where is the empirical distribution of x in the training data. With the above notations, the MEM principle can be mathematically expressed as follows:
Maximize the entropy satisfying
This is a typical constrained optimization problem that can be solved by the Lagrange Multiplier algorithm. The Lagrange function for the above problem is defined as: where are the Lagrange multipliers. Fixing , the conditional probability that maximizes Eq. (4) is obtained by differentiating J with respect to p and setting it to zero. where is a normalizing constant determined by the constraint The dual function of the above problem is obtained by replacing p in Eq. (4) using Eq. (5).
The final solution is defined by that maximizes the dual function Eq. can be typically obtained using the Improved Iterative Scaling algorithm described in.
Baseball Highlight Detection Based On Maximum Entropy Model
In this section, we develop a unique MEM-based framework to perform the statistical modeling of baseball highlights. Our goal is to automatically detect and classify all major baseball highlights, which include home run, outfield hit, outfield fly, infield hit, infield out, strike out, and walk . Traditionally, the HMM is the most common approach for modeling complex and context-sensitive data sequences. The HMM usually assumes that the features describing the input data sequence are independent of each other to make the model estimation process tractable. It also needs to first segment and classify the data sequence into a set of finite states, and then observe the state transitions during its data modeling process. In contrast, our MEM-based framework needs neither to make the independence assumption, nor to explicitly classify the data sequence into states. It not only provides potentials to improve the data classification accuracy, but also remarkably simplifies the training data creation task and the data classification process.
Baseball videos have well-defined structures and domain rules. Typically, the broadcast of a baseball game is made by a fixed number of cameras at fixed locations around the field, and each camera has a certain assignment for broadcasting the game. This TV broadcasting technique results in a few unique views that constitute most parts of baseball plays. Each category of highlights typically consists of a similar transitional pattern of these unique views. The limited number of unique views and similar patterns of view transitions for each type of highlights have made it feasible to statistically model baseball highlights. The detailed description of the MEM-based framework is provided as follows.
Multimedia Feature Extraction
We use the following 6 types of multimedia features to capture and distinguish the baseball view patterns.
- Color distribution : A baseball field mainly consists of grass and base areas. We represent each scene shot by three keyframes: the first, the middle, and the last frames of the shot. Each keyframe is divided into 3×3 blocks, and its color distribution is composed of nine data pairs ( g i ,S i ) where g i , and S i are the percentages of green and soil colors in block i , respectively. The color distribution of the entire shot is then derived by averaging the color distributions of the three keyframes. With this feature, we can roughly figure out which part of the field the shot is displaying.
- Edge distribution : This feature is useful for distinguishing field views from audience views. The edge distribution of a shot is computed in a manner similar to the color distribution. First, edge detection is conducted for each keyframe to obtain edge pixels. Next, the frame is divided into 3×3 blocks, and the percentages of edge pixels in the nine blocks are computed. Finally, the edge distribution of the entire shot is derived by averaging the distributions of the three keyframes of the shot.
- Camera motion : Camera motion becomes conspicuous and intense in highlight scenes because cameras track either the ball or the players’ motions to capture the entire play. We apply a simplified camera motion model to estimate camera pan, tilt and zoom, which are the most commonly used camera operations in TV broadcasting.
- Player detection : Within the playfield, we discover the areas that have non-green, non-soil colors and higher edge densities. These areas are good candidates for baseball players. Among all the candidates, false candidates and outliers can be further discovered by tracking each candidate within the scene shot because genuine candidates possess stable image features (e.g. size and color) and consistent trajectories while false candidates do not.
- Sound Detection : Certain sounds such as cheers, applause, music, speech, and mixtures of music and speech, provide important clues for highlight detection. Our special sound detection module consists of two stages. In the training stage, we construct a model for each of the special sounds listed above using annotated training data. The mel-cepstral coefficients are used as the input feature vectors of the sound models, and the Gaussian mixture model (GMM) is used to model the distributions of the input vectors. In the detection stage, we first partition the audio stream into segments each of which possesses similar acoustical profiles, and then provide each audio segment as the input to all the five sound models, which each outputs a probability showing the likelihood of the audio segment being a particular sound. These five likelihoods will be used as part of the multimedia features in forming the feature vector of a scene shot.
- Closed Caption : Informative words from closed captions often provide the most direct and abstracted clues to the detection/classification process. We extract informative words based on the mutual information metric between a word and a highlight. From the training data, we have identified a list of 72 informative words for the major highlights, which include: field, center, strike out, base, double out, score, home run, etc. In forming the multimedia feature vector of a scene shot, we use 72 binary numbers to indicate the presence/absence of the 72 informative words.
MEM Model Construction
To cope with the asynchronous nature of different features, we set the time window of T w seconds (in our implementation). For a shot S k that starts at time, and ends at time we include all the image features extracted from the time interval and all the audio and text features detected from the interval to construct the multimedia feature vector of . Then, we combine the feature vectors of n consecutive shots to form an input vector x k to the MEM engine. For each feature i in x k , we introduce an feature indicator function f ij defined by Eq. (1) (see for detailed descriptions). During the training process, the MEM will iteratively adjust the weight ? ij for each feature function until all the weights converge. Consequently, features i that play a dominant role in identifying the highlight j will be assigned a large weight while features i that are either unimportant or unrelated to the highlight j will receive a very small or zero weight. If we know for sure that certain features i are independent of the highlight j , we can set the corresponding feature functions f ij to zero to reduce the number of parameters to be estimated. Otherwise, we can simply assume that every feature i is present in every highlight j , and let the learning process automatically determine the appropriate weight ? ij for each feature function f ij.
Experimental Evaluations
We collected 10 baseball videos totaling 32 hours for training and testing purposes. These games were obtained from five major TV stations in the U.S. and consist of 16 teams playing in 9 stadiums. All the games were manually labeled by three human operators who were not familiar with our baseball highlight detection/classification system. We used seven games as the training data and the remaining three games as the testing data. The labeled highlights in the testing data were used as the ground truth to evaluate the highlight detection/classification accuracy of our system.
On average, the recall and precision for highlight classification are 70.0% and 62.60%, respectively. Table 1 details the performance results for each type of highlights. The precisions for infield hit and infield out are relatively low because these two types of highlights usually have quite similar view transitional patterns which often lead to misclassifications. We missed some home runs due to the fact that there are not enough training samples.
For performance comparisons, we implemented another baseball highlight detection system based on the Hidden Markov Model (HMM), the technique that is widely used for event detections. The HMM system uses the same set of image, audio, and text features as described in Section 4.1, and is evaluated using the same training and testing data sets as well. The training data set, however, needs to be re-labeled so that for each highlight sequence, the sta o be labeled as well.rting, ending points and the category of each constituent view has This training data labeling task is much more arduous and time consuming.
The HMM system consists of seven unique HMM’s each of which models a particular type of highlights. For each HMM, we define the following items:
- State V: is one of the seven unique views making up most parts of the baseball highlights.
- Observation M : is the multimedia feature vector created for a single scene shot. It is different from the feature vector x k used by the MEM-based system in that x k is created by combining the feature vectors of n consecutive shots.
- Observation probability: is the probability of observing the feature vector M given the state V. We use the Bayes rule to compute from the training data.
- Transition probability : is the probability that state V t , transits to state V t+1 at the next time instant. Given the class of highlights, the state (view) transition probability can be learned from the training data by the HMM learning algorithms.
- Initial state distribution p : can also be learned from the training data.
The above five items uniquely define the HMM. Given the HMM H k , the probability of observing the sequence can be obtained as:
where represents a possible state sequence. When a new video clip M x (which consists of 3-5 shots depending on the HMM model) arrives, we compute the probability using each HMM exceeds the predefined threshold, M x will be classified into the highlight class h = arg max.
For ease of comparison, we have placed the highlight classification accuracies of the HMM system shoulder by shoulder with those of the MEM system in Table 1. It is observed that the MEM produced better performance than the HMM on all highlight categories, and this advantage becomes very remarkable for the categories of strike out and walk . This difference can be explained by the fact that the HMM uses the naive bayes to calculate the observation probability , which assumes that all the features extracted from each shot are independent of each other, and the fact that the HMM is unable to handle the combined features of consecutive shots. Obviously these limitations have reduced the system’s ability to model the correlations among the multimedia features as well as consecutive shots. For the highlight categories that have relatively short view transitional patterns, such as strike out and walk , the HMM might not be able to compensate for errors in the observation probabilities because less contextual information is contained within the sequence.
Summaries
In this article, we focused on the state of the art of probabilistic data classification methods and their applications. We elaborated on the MEM which is a representative discriminative data classifier, applied it to baseball highlight detections and classifications, and compared it with the HMM which is a representative generative data classifier. Because the MEM-based framework needs neither to make the feature independence assumption, nor to explicitly classify the data sequence into states, it not only provides potentials to improve the data classification accuracy, but also remarkably simplifies the training data creation task and the data classification process. Our experimental evaluations have confirmed the advantages of the MEM over the HMM in terms of highlight detection and classification accuracies. Discriminative data classifiers are becoming new favorite compared to generative data classifiers because of their superiority in data modeling abilities and classification accuracies.
User Comments