Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from K-O

Multiple Source Alignment for Video Analysis - Abstract, Introduction, Screenplays in movie production practice, System Architecture, a. Screenplay parsing

speaker film dialogue time

Nevenka Dimitrova 1 and Robert Turetsky 1 ’ 2
(1) Philips Research, Briarcliff Manor, NY, USA
(2) Columbia University, New York, NY, USA

Definition: High-level semantic information, which is otherwise very difficult to derive from the audiovisual content, can be extracted automatically using both audiovisual signal processing as well as screenplay processing and analysis.


Multimedia content analysis of video data so far has relied mostly on the information contained in the raw visual, audio and text signals. In this process the fact that the film production starts with the original screenplay is usually ignored. However, using screenplay information i like using the recipe book for the movie. We demonstrated that high-level semantic information that is otherwise very difficult to derive from the audiovisual content can be extracted automatically using both audiovisual signal processing as well as screenplay processing and analysis.

Here we present the use of screenplay as a source of ground truth for automatic speaker/character identification. Our speaker identification method consists of screenplay parsing, extraction of time-stamped transcript, alignment of the screenplay with the time- stamped transcript, audio segmentation and audio speaker identification. As the screenplay alignment will not be able to identify all dialogue sections within any film, we use the segments found by alignment as labels to train a statistical model in order to identify unaligned pieces of dialogue. Character names from the screenplay are converted to actor names based on fields extracted from. We find that on average the screenplay alignment was able to properly identify the speaker in one third of lines of dialogue. However, with additional automatic statistical labeling for audio speaker ID on the soundtrack our recognition rate improves significantly.


Current practice in film production relies on screenplay as the crucial element for the overall process. The screenplay provides a unified vision of the story, setting, dialogue and action of a film – and gives the filmmakers, actors and crew a starting point for bringing their creative vision to life. For researchers involved in content-based analysis of movies, the screenplay is a currently unexploited resource for obtaining a textual description of important semantic objects within a film coming straight from the filmmakers. Hundreds of copies of a screenplay are produced for any film production of scale. The screenplay can be reproduced for hobbyist or academic use, and thousands of screenplays are available online.

The difficulty in using the screenplay as a shortcut to content-based analysis is threefold: First, the screenplay follows only a semi-regular formatting standard, and thus needs robust parsing to be a reliable source of data. Second, there is no inherent correlation between text in the screenplay and a time period in the film. Third, lines of dialogue or entire scenes in the movie can be added, deleted, modified or shuffled. We address these difficulties by parsing the screenplay and then aligning it with the time-stamped subtitles of the film. Statistical models can then be generated based on properly aligned segments in order to estimate segments that could not be aligned.

Our test-bed of this framework is in character/speaker identification. Unsupervised (audio) speaker identification on movie dialogue is a difficult problem, as speech characteristics are affected by changes in emotion of the speaker, different acoustic conditions, ambient noise and heavy activity in the background. Patel and Sethi have experimented with speaker identification on film data for use in video indexing/classification, but require that training data be hand-labeled and that all dialogues be hand-segmented. Salway et al. describe the association of temporal information in a movie to collateral texts (audio scripts). Wachman et al have used script information from situation comedy for labeling and learning by example interactive sessions.

The remainder of this article will proceed as follows. First, we introduce the content and structure of the screenplay. Subsequently, we detail extracting information from the screenplay and the alignment process. We present a quantitative analysis of alignments, and preliminary results of automatically trained audio speaker ID.

Screenplays in movie production practice

The screenplay is the most important part of making a movie. It describes the story, characters, action, setting and dialogue of a film. Additionally, some camera directions and shot boundaries may be included but are generally ignored. The screenplay generally undergoes a number of revisions, with each rewrite looking potentially uncorrelated to prior drafts (see Minority Report). After the principal shooting of a film is complete, the editors assemble the different shots together in a way that may or may not respect the screenplay.

The actual content of the screenplay generally follows a (semi) regular format. Figure 1 shows a snippet of a screenplay from the film Contact. The first line of any scene or shooting location is called a slug line. The slug line indicates whether a scene is to take place inside or outside (INT or EXT), the name of the location (e.g. ‘TRANSPORT PLANE’), and can potentially specify the time of day (e.g. DAY or NIGHT). Following the slug line is a description of the location. Additionally, the description will introduce any new characters that appear and any action that takes place without dialogue. Important people or objects are made easier to spot within a page by capitalizing their names.

The bulk of the screenplay is the dialogue description. Dialogue is indented in the page for ease of reading and to give actors and filmmakers a place for notes. Dialogues begin with a capitalized character name and optionally a (V.O.) or (O.S.) following the name to indicate that the speaker should be off-screen (V.O. stands for “Voice-over”). Finally, the actual text of the dialogue is full-justified to a narrow band in the center of the page.

The continuity script, a shot-by-shot breakdown of a film, is sometimes written after all work on a film is completed. A method for alignment of the continuity script with closed captions was introduced by Ronfard and Thuong . Although continuity scripts from certain films are published and sold, they are generally not available to the public online. This motivates analysis on the screenplay, despite its imperfections.

System Architecture

One reason why the screenplay has not been used more extensively in content-based analysis is because there is no explicit connection between dialogues, actions and scene descriptions present in a screenplay, and the time in the video signal. This hampers our effectiveness in assigning a particular segment of the film to a piece of text. Another source of film transcription, the closed captions, has the text of the dialogue spoken in the film, but it does not contain the identity of characters speaking each line, nor do closed captions possess the scene descriptions, which are so difficult to extract from a video signal. We get the best of both worlds by aligning the dialogues of screenplay with the text of the film’s time stamped closed captions.

Second, lines and scenes are often incomplete, cut or shuffled. In order to be robust in the face of scene re-ordering we align the screenplay to the closed captions one scene at a time and filter out potential false positives through median filtering.

Finally, it is not possible to correlate every sentence in the screenplay for every piece of dialogue. Thus, it is important to take information extracted from the time-stamped screenplay, combined with multimodal segments of the film (audio/video stream, closed captions, information from external websites such as other films), to create statistical models of events not captured by the screenplay alignment.

A system overview of our test-bench application, which includes pre-processing, alignment and speaker identification throughout a single film, is shown in Figure 2. First we parse the text of a film’s screenplay, so that scene and dialogue boundaries and metadata are entered into a uniform data structure. Next, the closed caption and audio streams are extracted from the film’s DVD. In the most important stage, the screenplay and closed caption texts are aligned. The aligned dialogues are now time-stamped and associated with a particular character. These dialogues are used as labeled training examples for generic machine learning methods (in our case we’ve tested neural networks and GMM’s) which can identify the speaker of dialogues which were not labeled by the alignment process.

In our experiments, we were working towards very high speaker identification accuracy, despite the difficult noise conditions. It is important to note that we were able to perform this identification using supervised learning methods, but the ground truth was generated automatically so there is no need for human intervention in the classification process.

a. Screenplay parsing

From the screenplay, we extract the location and time and description of a scene, the individual lines dialogue and their speaker, and the parenthetical and action direction for the actors, and any transition suggestion (cut, fade, wipe, dissolve, etc) between scenes. We used the following grammar for parsing most screenplays:

A similar grammar was generated for two other screenplay formats which are popular on-line.

b. Extracting time-stamped subtitles and metadata

For the alignment and speaker identification tasks, we require the audio and time stamped closed caption stream from the DVD of a film. In our case, four films, Being John Malkovich, Magnolia, L.A. Confidential and Wall Street, were chosen from a corpus of DVDs. When available, subtitles were extracted from the User Data Field of the DVD. Otherwise, OCR (Optical Character Recognition) was performed on the subtitle stream of the disc.

Finally, character names from the screenplay are converted to actor names based on fields extracted from . If no match from a character’s name can be found in IMDB’s reporting of the film’s credit sequence (e.g. ‘Hank’ vs. ‘Henry’), we match the screenplay name with the credit sequence name by matching quotes from the “memorable quotes” section with dialogues from the screenplay.

c. Dialogue screenplay alignment

The screenplay dialogues and closed caption text are aligned by using dynamic programming to find the “best path” across a similarity matrix. Alignments that properly correspond to scenes are extracted by applying a median filter across the best path. Dialogue segments of reasonable accuracy are broken down into closed caption line sized chunks, which means that we can directly translate dialogue chunks into time-stamped segments. Below, each component is discussed.

The similarity matrix is a way of comparing two different versions of similar media. In our similarity matrix, every word i of a scene in the screenplay is compared to every word j in the closed captions of the entire movie. In other words, we populate a matrix:

SM ? screenplay(scene_num, i) == subtitle(j)

SM=1 if word i of the scene is the same as word j of the closed captions, and SM=0 if they are different. Screen time progresses linearly along the diagonal i=j, so when lines of dialogue from the screenplay line up with lines of text from the closed captions, we expect to see a solid diagonal line of l’s. Figure 3 shows an example similarity matrix segment.

In order to find the diagonal line which captures the most likely alignment between the closed captions and the screenplay scene, we perform dynamic programming. Figure 4 visualizes the successful alignment of scene 30 of Magnolia. The three diagonal regions indicate that much of this scene is present in the closed captions, but there are a few dialogues which written into the screenplay but were not present in the finished film and vice versa.

The missing dialogues can be removed by searching for diagonal lines across DP’s optimal path, as in. We use an m-point median filter on the slope of the optimal path, so that in addition to lining up with the rest of the scene, the dialogue must consistently match at least (m+l)/2 (e.g. 3) out of m (e.g. 5) words to be considered part of a scene. In Figure , proper dialogues are demonstrated in the shaded regions. The bounds of when the alignment is found to be proper are used to segment aligned scenes from omitted ones. We can then warp the words of the screenplay to match the timing of the closed captions.

Performance evaluation

In order to analyze quantitatively the alignment we define the coverage of the alignment to be the percentage of lines of dialogue in the film in which the alignment was able to identify the speaker. The accuracy of the alignment is the percentage of speaker IDs generated by the alignment which actually correspond to dialogue spoken by the tagged speaker. Accuracy is a measure of the purity of the training data and coverage is a measure of how much data will need to be generated by classification.

Table 1 presents lines of dialogue that were identified as belonging to a main character in Being John Malkovich. We were able to achieve a high degree of accuracy in labeling the speaker for each segment.

While the alignment process affords a high level of confidence in terms of the accuracy (approximately 90%) of segment speaker label generation, the liquid nature of the screenplay means we are unable to label most of the film. Table 2 presents a measure of how much dialogue the alignment is able to cover in each of our four films. This motivates creating a speaker-identification system based on statistical models generated from the segment labeling as found by alignment.

Automatic Speaker ID using Statistical Models

Our speaker identification system examines the behavior of audio features over time. We performed extensive testing with various combinations of audio features reported to have high discriminability (MFCC, LSP, RASTA-PLP), incorporating mean subtraction and deltas. In our case we have a good deal of training data so we can use simple classifiers. The goal of our classifier is to allow for different clusters in feature-space that correspond to the voice characteristics of an actor under different emotional and acoustic conditions.

While our method is under large scale benchmarking, initial results are promising. Table 3 presents frame accuracy on unlabeled data for the main characters of “being John Malkovich.”. Here we are using the first 13 MFCC components at 12.5msec intervals stacked across a .5sec time window. We should note here that speech segments in movies have a mean length of .5 seconds, whereas the best performing speaker ID systems use a signal length of 2-5 seconds. Principal Component Analysis was used to reduce the dimensionality of feature-space, and classification was performed using an 8-component Gaussian Mixture Model. Note that the table demonstrates that identification accuracy is highly speaker dependant.


High level semantic information can be used to automatically create models for automatic content analysis. The screenplay contains data about the film, that is not extractable at all by audiovisual analysis or if it can be extracted then the reliability is very low. These high level concepts are closer to the human understanding of the film and to the potential methods of searching of audiovisual content. We used screenplay information for speaker ID, which has limited coverage of about 30%. Then we used the same framework for generating labels for a statistical approach to audio speaker ID. There is a limitless number of potential applications for the alignment such as extraction of semantic description of a scene, affective descriptions, mood analysis and others.

Multiplication of the Loaves and Fishes [next] [back] Multimodal Interfaces

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or