Other Free Encyclopedias » Online Encyclopedia » Encyclopedia - Featured Articles » Contributed Topics from F-J

Facial Animation and Affective Human-Computer Interaction - Affective Human-Computer Interaction, Expressing Emotion, Facial Animation in MPEG-4, Conclusions

faps expression expressions information

Kostas Karpouzis and Stefanos Kollias
Image, Video and Multimedia Systems Laboratory, National
Technical University of Athens, Athens, Greece

Definition: Affective Human-Computer Interaction (HCI) systems utilize multimodal information about the emotional state of users.

Affective Human-Computer Interaction

Even though everyday human-to-human communication is thought to be based on vocal and lexical content, people seem to base both expressive and cognitive capabilities on facial expressions and body gestures. Related research in both the analysis and synthesis fields is based on trying to recreate the way the human mind works while making an effort to recognize such emotion. This inherently multimodal process means that in order to achieve robust results, one should take into account features like speech, face and hand gestures or body pose, as well as the interaction between them. In the case of speech, features can come from both linguistic and paralinguistic analysis; in the case of facial and body gestures, messages are conveyed in a much more expressive and definite manner than wording, which can be misleading or ambiguous, especially when users are not visible to each other. While a lot of effort has been invested in examining individually these aspects of human expression, recent research has shown that even this approach can benefit from taking into account multimodal information.

In general, Human-Computer Interaction (HCI) systems that utilize multimodal information about the emotional state of users are presently at the forefront of interest of the computer vision and artificial intelligence community. Such interfaces give the opportunity to less technology-aware individuals, as well as handicapped people, to use computers more efficiently. In this process, real world actions of a human are transferred in the virtual environment through a representative (ECA – Embodied Conversational Agent), while the virtual environment recognizes these actions and responds via system ECAs. An example of this enhanced HCI scheme is virtual malls. Business-to-client communication via the web is still poor and mostly based on the exchange of textual information. What most clients actually look for is a human, or human-like salesman who would smile at them and adapt to their personal needs, thus enhancing the humane aspects of e-commerce, interactive TV or online advertising applications. ECAs can also be used more extensively in real-time, peer-to-peer multimedia communication, providing enhanced means of expression missing from text-based communication; ECAs can express their emotions using human-like expressions and gestures not only during a chat via the web or a teleconference but also during broadcasting news, making them more attractive since they would be pronounced in a human-like way.

Expressing Emotion

Facial and hand gestures, as well as body pose constitute a powerful way of non-verbal human communication. Analyzing such multimodal information is a complex task involving low-level image processing tasks, as well as pattern recognition, machine learning and psychological studies. Research in facial expression analysis and synthesis has mainly concentrated on primary or universal emotions. In particular, sadness, anger, joy, fear, disgust and surprise are categories of emotions that usually attract most of the interest in human computer interaction environments. Very few studies explore non-primary emotions; this trend may be due to the great influence of the works of Ekman and Friesen and Izard who proposed that the universal emotions correspond to distinct facial expressions universally recognizable across cultures. More recent psychological studies have investigated a broader variety of intermediate or blended emotions.

Although the exploitation of the results obtained by the psychologists is far from straightforward, computer scientists can use some hints to their research. The MPEG-4 standard indicates an alternative, measurable way of modeling facial expressions, strongly influenced from neurophysiological and psychological studies. For example, FAPs that are utilized in the framework of MPEG-4 for facial animation purposes, are strongly related to the Action Units (AUs), which consist of the core of the Facial Action Coding System (FACS) .

One of the studies carried out by psychologists and which can be useful to researchers of the area of computer graphics and machine vision is the one of Whissel’s, who suggested that emotions are points in a space with a relatively small number of dimensions, which with a first approximation are only two: activation and evaluation (see Figure 1). In this framework, evaluation seems to express internal feelings of the subject, while, activation is related to the magnitude of facial muscles movement and can be more easily estimated based on facial characteristics.

Facial Animation in MPEG-4

Already early in life, people are instinctively able to recognize and interpret human faces, as well as distinguish subtle expressive differences. Therefore, it is imperative that synthetic face images are as faithful to the original and expressive as possible, in order to achieve the desired effect. Once a generic 3-D head model is available, new views can be generated by adapting it to a real one and then transforming it. Regarding animation, two different approaches are proposed , namely the clip-and-paste method, where prominent facial feature templates are extracted from actual frames and mapped onto the 3-D shape model and algorithms based on the deformation of 3-D surfaces.

In order to meet the demands of structured multimedia information, the MPEG-4 standard introduces the fundamental concept of media objects such as audio, visual, 2D/3D, natural and synthetic objects, along with content-based access and behavioral modeling, to make up a multimedia scene. The main aspects of the standard deal with hybrid coding and compression of media objects, universal content accessibility over various networks and provisions of interactivity for the end-user. The spatial and temporal relationships between scene objects are defined in a dedicated representation called BIFS — Binary Format for Scenes, which inherits the scene graph concept, as well as animation procedures from VRML. In terms of functionalities related to ECAs (Embodied Conversational Agents), both standards define a specific set of nodes in the scene graph to allow for a representation of an instance. However, only the MPEG-4 specifications deal with streamed ECA animations. In addition to this, MPEG-4 makes special provisions for the semantic representation of information pertinent to facial animation. The face object specified by MPEG-4 is structured in a way that visemes, i.e. the visual manifestations of speech, are intelligible, facial expressions allow the recognition of the speaker’s mood, and reproduction of a real speaker is as faithful and   portable as possible. To fulfill these objectives, MPEG-4 specifies three types of facial data , namely Facial Definition Parameters (FDPs) used to adapt a generic 3D facial model available at the receiver side, Facial Animation Parameters (FAPs), which allow for animating the 3D facial model and FAP Interpolation Table, which allow the definition of interpolation rules for the FAPs to be interpolated at the decoder. The 3D model is then animated using the transmitted FAPs and the FAPs interpolated according to the FIT.

FAPs are closely related to face muscle actions and were designed to allow the animation of faces, reproducing movements, expressions and speech-related deformation. The chosen set of FAPs represents a complete set of basic facial movements, allowing terminals to represent most natural facial actions as well as, by exaggeration, some non-human like actions, e.g. useful for cartoon-like animations. The complete FAP set consists of 68 FAPs, 66 low-level parameters associated with the lips, jaw, eyes, mouth, cheek, nose, etc. and two high-level parameters (FAPs 1 and 2) associated with expressions and visemes, respectively. While low-level FAPs are associated with movements and rotations of key facial features and the relevant areas (see Figure 2), expressions and visemes represent more complex actions, typically associated with a set of FAPs. Although the encoder knows the reference feature point for each low-level FAP, it does not precisely know how the decoder will move the model vertices around that feature point, i.e. the FAP interpretation model. The expression FAP enables the animation of a universal expression (joy, sadness, anger, fear, disgust and surprise), while visemes provide the visual analog to phonemes and allow the efficient rendering of visemes for better speech pronunciation, as an alternative to having them represented using a set of low-level FAPs.

Figure 3 shows some examples of animated profiles. Figure 3(a) shows a particular profile for the archetypal expression anger, while Figure 3(b) and © show alternative profiles of the same expression. The difference between them is due to FAP intensities. Difference in FAP intensities is also shown in Figures 3(d) and (e), both illustrating the same profile of expression surprise. Finally, Figure 3(f) shows an example of a profile of the expression joy. In the case of intermediate or blended expressions, one may utilize the above mentioned representations, along with their position in the activation-evaluation space to come up with relevant profiles. Details and results of this approach are shown in.


A straightforward and portable facial animation representation is a crucial factor when catering for affect-aware systems. In this framework, the MPEG-4 standard proposes a set of tools for designing and animating human face models, which enables accurate renderings on a variety of devices and network conditions; in order for these tools to be effective, they must be coupled with a relevant emotion representation in both the analysis and synthesis elements, so as to avoid cartoon-like, unrealistic results. Synthetic facial animation is expected to enhance the usability of a wide variety of applications, especially in the light of powerful dedicated hardware and the presence of structured information standards.

Factor, Max [next] [back] Facial Angle

User Comments

Your email address will be altered so spam harvesting bots can't read it easily.
Hide my email completely instead?

Cancel or