Fig 2 - uploaded by Sezer Ulukaya
Content may be subject to copyright.
AAM based tracking of landmark coordinates for subject 52 in the CK + database (©Jeffrey Cohn). From left to right: Onset (1st frame), half-apex (5th frame) and apex 

AAM based tracking of landmark coordinates for subject 52 in the CK + database (©Jeffrey Cohn). From left to right: Onset (1st frame), half-apex (5th frame) and apex 

Source publication
Article
Full-text available
When the goal is to recognize the facial expression of a person given an expressive image, there are mainly two types of information encoded in the image that we have to deal with: identity-related information and expression related information. Alleviating the identity-related information, for example by using an image of the same person with a ne...

Contexts in source publication

Context 1
... m represents the number of parameters in the statistical model and L m ( K ) denotes the maximized value of the log likelihood function. Several K values are tested, and the value that minimizes the expression (8) is selected. Once a GMM is fit to the set of neutral face shapes, the neutral face shape dictionary is formed from the mean vectors μ k , k = 1 , . . . , K of each of the K Gaussian mixture components. The covariance matrices Σ k will represent the variation of the face shapes around the mean shapes. In summary, the steps of the dictionary estimation are as follows, the flowchart of which is shown in Fig. 6. 1: Input: All neutral frames in the database, and the CBF features that represent the shape for each face. 2: Perform rigid alignment on all neutral face shapes to minimize in-plane rotation, scale and translation differences. 3: Fit a Gaussian mixture model (GMM) to the aligned shapes, where the number of mixture densities K is determined automatically using Akaike’s information criterion. 4: Output: The mean vectors ( μ k ) and the covariance matrices ( Σ k ) of the K mixture densities form the dictionary of neutral face shapes. In this section we first explain how the best fitting neutral face shape is selected from the pre-trained dictionary for a given expressive face. Then, the estimated neutral face shape is used to extract the motion vectors of facial landmarks, which will be the geometric features to be used for facial expression recognition. Next, we propose an efficient method to fuse the geometric features with appearance based features, the details of which are presented below. Given an expressive face image, such as the last frame of a video clip shown in Fig. 2, we want to decompose the shape vector of the expressive face, which is shown by the black landmarks, into two components. That is, we assume that an expressive shape vector s n , i that belongs to image i of n th sequence can be decom- posed as ...
Context 2
... convey when we are trying to express ourselves with words. Therefore, facial expressions are very important in human-to-human communication to enhance our messages and to display emotions. It is expected that facial expression based emotion recognition will be a part of many human–computer interaction scenarios in the future [1]. Below, we first give a brief overview of the state-of-the-art on facial expression recognition. Then, we state and motivate the problem that we address in this paper. There is a plethora of recent works on automatic facial expression and emotion recognition, some of which have been high- lighted in several survey papers [2–6]. Automatic facial expression recognition is still a challenging computer vision problem. The main issues to tackle while developing a facial expression recognition algorithm are: i) choosing an emotion model ii) extracting suitable facial features iii) choosing a classification method and iv) choosing a suitable database to report facial expression recognition results. Below, we give an overview of each of them briefly. There are two emotion models used while recognizing emotions from facial expressions: the categorical model and the dimensional model . In the categorical model, the facial expression is recognized as one of the distinct labels from a set of emotions. A popular categorization of emotions consists of six basic emotions: happiness, sadness, anger, fear, surprise and disgust, which have been suggested by Ekman and Friesen and have been shown to be uni- versal [7]. Recently, dimensional and continuously labeled emotion recognition has also become popular that use a space with three dimensions [4]. The first dimension ( evaluation or valence ) measures pleasure–displeasure (positive-negativeness) attribute of the emotion. The second dimension activation or arousal measures the likelihood of a person to take an action in the emotional state and it ranges from sleepiness to extreme excitement [4]. The third dimension dominance or power represents the sense of control over the emotion. Using the dimensional model, it is possible to an- notate a facial expression on a continuous scale rather than a categorical scale to describe the wide range of emotions that we face in our daily lives. On the other hand, the dimensional model is not as intuitive as the categorical model [8]. In this paper, we address the problem of categorical emotion recognition. The facial features that can be extracted from a face image generally falls into one of two categories: geometrical features or appearance based features . Geometrical features deal with the shape and motion of facial landmarks located around salient regions such as eyelids, eyebrows and lips. Appearance-based features on the other hand, can quantify texture variations due to facial expressions such as wrinkles between the eyebrows and the forehead. It has been shown that using geometric and appearance features together improve the facial expression recognition performance [6]. A varying number of landmarks on the face can be detected and tracked to serve as geometrical features . Various methods have been used for this task including active appearance models [9,10], constrained local models [11–13], GentleBoost [14] and mixtures of trees [15]. Some appearance based features that have been shown to be effective for facial expression recognition (FER) are Gabor filters [14], local binary patterns (LBP) [16–19], local phase quantization features [17,20], and the scale-invariant feature transform (SIFT) [21,22]. Many different classifiers have been utilized for FER. Sup- port vector machines (SVMs) [14,17,18,23,24] are among the most popular ones. Other classifiers used for FER include neural networks [25], sparse representation classifiers [26,27], hidden Markov models [28], conditional random fields [29,30] and Bayesian networks [31]. Some efforts have also been reported that propose novel classifiers. In [32], Chew et al. propose a novel su- pervised classifier namely the modified correlation filter (MCF), which is shown to have superior generalization performance due to its robustness to noisy training examples. In order to test and compare automatic facial expression recognition algorithms, emotional databases which are open to researchers are needed [6]. There are a number of 2D [6,10,33–35] and 3D [36–39] facial expression databases in the literature. In this work, we utilize three widely-used facial expression databases: CK + [10], MMI [34] and eNTERFACE [40]. The Cohn–Kanade (CK) database [41] has been a very popular one, which consists of facial clips containing the six basic emotions as shown in Fig. 1. The CK database has recently been extended to include more subjects, a new facial expression class (contempt) and facial tracking data and is called the CK + database [10]. The face tracking data provided with the CK + database consists of the locations of 68 facial landmarks, which can be seen in Fig. 2. In the CK + database, there are a total of 123 subjects and 327 sequences with emotion labels. The emotions were posed in a laboratory by the participants who had different ages ranging from 18 to 50 and 69% of them were female, 81% Euro-American, 13% Afro-American and 6% from other ethnic groups. The first frame of each sequence in CK + database shows a neutral expression (onset frame) and the last frame shows an expression at its apex (peak frame) as can be observed in Fig. 2. The seven emotion categories in the database are happiness, sadness, anger, fear, disgust, surprise , and contempt . Another database that we used in this study is the MMI database [33,34]. This database contains recordings of full temporal patterns of facial expressions covering neutral, onset, apex and off- set phases. The facial expressions consist of the six basic emotions as well as expressions with a single action unit, which were posed in a laboratory environment using either natural lighting or two high intensity lamps with reflective umbrellas. Some naturalistic expressions have also been added to the database [34]. There are currently 75 subjects in the whole database aged between 19–62, who have European, Asian or South American ethnic backgrounds. The third database that we used in this work is eNTER- FACE’05 [40], which is an audio–visual database. We used this database to test the generalization potential of the proposed method to less constrained data containing lip motion due to speech. The eNTERFACE database contains a total of 1287 audio– visual clips of 44 subjects (81% men 19% women) from 14 different nationalities. The database contains recordings of sentences ut- tered in English in an acted way, reflecting the six basic emotions, i.e., happiness, sadness, anger, fear, disgust and surprise. Each clip is expected to reflect one of these emotions from the beginning to the end. Therefore, neutral expression does not exist in the database. The visual component of eNTERFACE dataset is quite challenging since the actors are not professional and some emotions are very hard to detect visually, even for humans (e.g. fear). Moreover, 31% of the subjects wear glasses and 17% have beard. An expressive face image carries information about both the facial expression, and identity (including age and gender) of that person. If shape-based features are used to describe the locations of salient points (around eyes, lips etc.), suppressing the identity- related information has been shown to increase the facial expression recognition rate. One way to do this is to subtract the baseline [4,42], that is, the neutral face shape of the person from the expressive shape, and to work with the “motion” information. This solution is feasible only if the neutral face image (or shape) is available, for example as in the CK + database, which is always the first frame of a sequence [10]. If the neutral face image is not available, one approach [43] uses averaging all the frames of a sequence assuming that the average will give us an estimate of the neutral face shape. However, this assumption does not hold if all the frames of a video are strongly expressive. In this work, we propose a general solution to the baseline problem described in the previous paragraph. That is, we present a general method for estimating the neutral (i.e. expressionless) face shape for a given expressive face. First, a dictionary of neutral face shapes is generated. Then, the best fitting dictionary entry is selected as the estimated neutral shape for a given expressive face, which is expected to model the identity-related component of the expressive face. We also propose an efficient method of fusing geometric and appearance based features. Stand-alone and cross- database experiments on the CK + , MMI and eNTERFACE databases demonstrate that the proposed method increases the accuracy of facial expression recognition significantly, when the actual neutral face image is unavailable. In Section 2, the method proposed for estimation of a neutral face shape dictionary using Gaussian Mixture Models is presented. In Section 3, the facial expression recognition method, which uses the neutral shape dictionary is explained in detail. The fusion of geometric and appearance based features is also explained in Section 3. Experimental results are presented in Section 4 and conclu- sions are given in Section 5. In this section, we give the details of estimation of a neutral face shape dictionary using Gaussian Mixture Models (GMM). First, we describe the geometrical facial features that we use, then give the details of GMM fitting method. The geometrical features we use are based on the x and y coordinates of 68 landmark points on the face. Examples of these 68 points can be seen in Fig. 2 and Fig. 3(a). In the CK + database, these points have been manually marked at keyframes, and were tracked automatically for the remaining frames using an Active Appearance Model ...
Context 3
... in this paper. There is a plethora of recent works on automatic facial expression and emotion recognition, some of which have been high- lighted in several survey papers [2–6]. Automatic facial expression recognition is still a challenging computer vision problem. The main issues to tackle while developing a facial expression recognition algorithm are: i) choosing an emotion model ii) extracting suitable facial features iii) choosing a classification method and iv) choosing a suitable database to report facial expression recognition results. Below, we give an overview of each of them briefly. There are two emotion models used while recognizing emotions from facial expressions: the categorical model and the dimensional model . In the categorical model, the facial expression is recognized as one of the distinct labels from a set of emotions. A popular categorization of emotions consists of six basic emotions: happiness, sadness, anger, fear, surprise and disgust, which have been suggested by Ekman and Friesen and have been shown to be uni- versal [7]. Recently, dimensional and continuously labeled emotion recognition has also become popular that use a space with three dimensions [4]. The first dimension ( evaluation or valence ) measures pleasure–displeasure (positive-negativeness) attribute of the emotion. The second dimension activation or arousal measures the likelihood of a person to take an action in the emotional state and it ranges from sleepiness to extreme excitement [4]. The third dimension dominance or power represents the sense of control over the emotion. Using the dimensional model, it is possible to an- notate a facial expression on a continuous scale rather than a categorical scale to describe the wide range of emotions that we face in our daily lives. On the other hand, the dimensional model is not as intuitive as the categorical model [8]. In this paper, we address the problem of categorical emotion recognition. The facial features that can be extracted from a face image generally falls into one of two categories: geometrical features or appearance based features . Geometrical features deal with the shape and motion of facial landmarks located around salient regions such as eyelids, eyebrows and lips. Appearance-based features on the other hand, can quantify texture variations due to facial expressions such as wrinkles between the eyebrows and the forehead. It has been shown that using geometric and appearance features together improve the facial expression recognition performance [6]. A varying number of landmarks on the face can be detected and tracked to serve as geometrical features . Various methods have been used for this task including active appearance models [9,10], constrained local models [11–13], GentleBoost [14] and mixtures of trees [15]. Some appearance based features that have been shown to be effective for facial expression recognition (FER) are Gabor filters [14], local binary patterns (LBP) [16–19], local phase quantization features [17,20], and the scale-invariant feature transform (SIFT) [21,22]. Many different classifiers have been utilized for FER. Sup- port vector machines (SVMs) [14,17,18,23,24] are among the most popular ones. Other classifiers used for FER include neural networks [25], sparse representation classifiers [26,27], hidden Markov models [28], conditional random fields [29,30] and Bayesian networks [31]. Some efforts have also been reported that propose novel classifiers. In [32], Chew et al. propose a novel su- pervised classifier namely the modified correlation filter (MCF), which is shown to have superior generalization performance due to its robustness to noisy training examples. In order to test and compare automatic facial expression recognition algorithms, emotional databases which are open to researchers are needed [6]. There are a number of 2D [6,10,33–35] and 3D [36–39] facial expression databases in the literature. In this work, we utilize three widely-used facial expression databases: CK + [10], MMI [34] and eNTERFACE [40]. The Cohn–Kanade (CK) database [41] has been a very popular one, which consists of facial clips containing the six basic emotions as shown in Fig. 1. The CK database has recently been extended to include more subjects, a new facial expression class (contempt) and facial tracking data and is called the CK + database [10]. The face tracking data provided with the CK + database consists of the locations of 68 facial landmarks, which can be seen in Fig. 2. In the CK + database, there are a total of 123 subjects and 327 sequences with emotion labels. The emotions were posed in a laboratory by the participants who had different ages ranging from 18 to 50 and 69% of them were female, 81% Euro-American, 13% Afro-American and 6% from other ethnic groups. The first frame of each sequence in CK + database shows a neutral expression (onset frame) and the last frame shows an expression at its apex (peak frame) as can be observed in Fig. 2. The seven emotion categories in the database are happiness, sadness, anger, fear, disgust, surprise , and contempt . Another database that we used in this study is the MMI database [33,34]. This database contains recordings of full temporal patterns of facial expressions covering neutral, onset, apex and off- set phases. The facial expressions consist of the six basic emotions as well as expressions with a single action unit, which were posed in a laboratory environment using either natural lighting or two high intensity lamps with reflective umbrellas. Some naturalistic expressions have also been added to the database [34]. There are currently 75 subjects in the whole database aged between 19–62, who have European, Asian or South American ethnic backgrounds. The third database that we used in this work is eNTER- FACE’05 [40], which is an audio–visual database. We used this database to test the generalization potential of the proposed method to less constrained data containing lip motion due to speech. The eNTERFACE database contains a total of 1287 audio– visual clips of 44 subjects (81% men 19% women) from 14 different nationalities. The database contains recordings of sentences ut- tered in English in an acted way, reflecting the six basic emotions, i.e., happiness, sadness, anger, fear, disgust and surprise. Each clip is expected to reflect one of these emotions from the beginning to the end. Therefore, neutral expression does not exist in the database. The visual component of eNTERFACE dataset is quite challenging since the actors are not professional and some emotions are very hard to detect visually, even for humans (e.g. fear). Moreover, 31% of the subjects wear glasses and 17% have beard. An expressive face image carries information about both the facial expression, and identity (including age and gender) of that person. If shape-based features are used to describe the locations of salient points (around eyes, lips etc.), suppressing the identity- related information has been shown to increase the facial expression recognition rate. One way to do this is to subtract the baseline [4,42], that is, the neutral face shape of the person from the expressive shape, and to work with the “motion” information. This solution is feasible only if the neutral face image (or shape) is available, for example as in the CK + database, which is always the first frame of a sequence [10]. If the neutral face image is not available, one approach [43] uses averaging all the frames of a sequence assuming that the average will give us an estimate of the neutral face shape. However, this assumption does not hold if all the frames of a video are strongly expressive. In this work, we propose a general solution to the baseline problem described in the previous paragraph. That is, we present a general method for estimating the neutral (i.e. expressionless) face shape for a given expressive face. First, a dictionary of neutral face shapes is generated. Then, the best fitting dictionary entry is selected as the estimated neutral shape for a given expressive face, which is expected to model the identity-related component of the expressive face. We also propose an efficient method of fusing geometric and appearance based features. Stand-alone and cross- database experiments on the CK + , MMI and eNTERFACE databases demonstrate that the proposed method increases the accuracy of facial expression recognition significantly, when the actual neutral face image is unavailable. In Section 2, the method proposed for estimation of a neutral face shape dictionary using Gaussian Mixture Models is presented. In Section 3, the facial expression recognition method, which uses the neutral shape dictionary is explained in detail. The fusion of geometric and appearance based features is also explained in Section 3. Experimental results are presented in Section 4 and conclu- sions are given in Section 5. In this section, we give the details of estimation of a neutral face shape dictionary using Gaussian Mixture Models (GMM). First, we describe the geometrical facial features that we use, then give the details of GMM fitting method. The geometrical features we use are based on the x and y coordinates of 68 landmark points on the face. Examples of these 68 points can be seen in Fig. 2 and Fig. 3(a). In the CK + database, these points have been manually marked at keyframes, and were tracked automatically for the remaining frames using an Active Appearance Model based approach [9,10,44]. In the MMI database, we track these landmark points fully automatically, using an improved version of the Constrained Local Models based approach [11] (see Fig. 3(b)). The Face Tracker implemented by Saragih et al. [11] is a near real-time, generic face and facial feature tracker based on deformable model fitting by regularized landmark mean-shift. Before the geometrical feature vectors are formed for each frame of a video clip, face shapes as described by ...
Context 4
... and facial tracking data and is called the CK + database [10]. The face tracking data provided with the CK + database consists of the locations of 68 facial landmarks, which can be seen in Fig. 2. In the CK + database, there are a total of 123 subjects and 327 sequences with emotion labels. The emotions were posed in a laboratory by the participants who had different ages ranging from 18 to 50 and 69% of them were female, 81% Euro-American, 13% Afro-American and 6% from other ethnic groups. The first frame of each sequence in CK + database shows a neutral expression (onset frame) and the last frame shows an expression at its apex (peak frame) as can be observed in Fig. 2. The seven emotion categories in the database are happiness, sadness, anger, fear, disgust, surprise , and contempt . Another database that we used in this study is the MMI database [33,34]. This database contains recordings of full temporal patterns of facial expressions covering neutral, onset, apex and off- set phases. The facial expressions consist of the six basic emotions as well as expressions with a single action unit, which were posed in a laboratory environment using either natural lighting or two high intensity lamps with reflective umbrellas. Some naturalistic expressions have also been added to the database [34]. There are currently 75 subjects in the whole database aged between 19–62, who have European, Asian or South American ethnic backgrounds. The third database that we used in this work is eNTER- FACE’05 [40], which is an audio–visual database. We used this database to test the generalization potential of the proposed method to less constrained data containing lip motion due to speech. The eNTERFACE database contains a total of 1287 audio– visual clips of 44 subjects (81% men 19% women) from 14 different nationalities. The database contains recordings of sentences ut- tered in English in an acted way, reflecting the six basic emotions, i.e., happiness, sadness, anger, fear, disgust and surprise. Each clip is expected to reflect one of these emotions from the beginning to the end. Therefore, neutral expression does not exist in the database. The visual component of eNTERFACE dataset is quite challenging since the actors are not professional and some emotions are very hard to detect visually, even for humans (e.g. fear). Moreover, 31% of the subjects wear glasses and 17% have beard. An expressive face image carries information about both the facial expression, and identity (including age and gender) of that person. If shape-based features are used to describe the locations of salient points (around eyes, lips etc.), suppressing the identity- related information has been shown to increase the facial expression recognition rate. One way to do this is to subtract the baseline [4,42], that is, the neutral face shape of the person from the expressive shape, and to work with the “motion” information. This solution is feasible only if the neutral face image (or shape) is available, for example as in the CK + database, which is always the first frame of a sequence [10]. If the neutral face image is not available, one approach [43] uses averaging all the frames of a sequence assuming that the average will give us an estimate of the neutral face shape. However, this assumption does not hold if all the frames of a video are strongly expressive. In this work, we propose a general solution to the baseline problem described in the previous paragraph. That is, we present a general method for estimating the neutral (i.e. expressionless) face shape for a given expressive face. First, a dictionary of neutral face shapes is generated. Then, the best fitting dictionary entry is selected as the estimated neutral shape for a given expressive face, which is expected to model the identity-related component of the expressive face. We also propose an efficient method of fusing geometric and appearance based features. Stand-alone and cross- database experiments on the CK + , MMI and eNTERFACE databases demonstrate that the proposed method increases the accuracy of facial expression recognition significantly, when the actual neutral face image is unavailable. In Section 2, the method proposed for estimation of a neutral face shape dictionary using Gaussian Mixture Models is presented. In Section 3, the facial expression recognition method, which uses the neutral shape dictionary is explained in detail. The fusion of geometric and appearance based features is also explained in Section 3. Experimental results are presented in Section 4 and conclu- sions are given in Section 5. In this section, we give the details of estimation of a neutral face shape dictionary using Gaussian Mixture Models (GMM). First, we describe the geometrical facial features that we use, then give the details of GMM fitting method. The geometrical features we use are based on the x and y coordinates of 68 landmark points on the face. Examples of these 68 points can be seen in Fig. 2 and Fig. 3(a). In the CK + database, these points have been manually marked at keyframes, and were tracked automatically for the remaining frames using an Active Appearance Model based approach [9,10,44]. In the MMI database, we track these landmark points fully automatically, using an improved version of the Constrained Local Models based approach [11] (see Fig. 3(b)). The Face Tracker implemented by Saragih et al. [11] is a near real-time, generic face and facial feature tracker based on deformable model fitting by regularized landmark mean-shift. Before the geometrical feature vectors are formed for each frame of a video clip, face shapes as described by the tracked landmark points are aligned to minimize the scale, rotation, and translation differences that may exist between different subjects and also between different frames of a video clip. For this purpose, the landmarks which are supposed to be affected the least from the facial expressions and are robust to track are chosen, such as the inner corners of the eyes and the nose tip [45]. Translation differences between frames are eliminated by moving the nose tip to the origin. The in-plane head rotation variations are eliminated by making the line connecting the inner corners of the eye parallel to the x -axis. The scale differences are minimized by normalizing the inter-ocular distance to a constant value. An example of face alignment is shown in Fig. 4, which shows the fearful and neutral facial expressions of subject 32 before alignment (a) and after alignment (b). We can see that after the alignment, the nose and the inner corners of the eyes, and the sides of the cheek are aligned well. As can be observed in Fig. 4(c) and (d), the alignment step minimizes the global scale, rotation and translation differences between all frames in the database sim- ilar to affine registration methods in the literature [46,47], while keeping the variations due to facial expressions. The coordinate based features (CBF) are formed by concatenating the x and y coordinates of M aligned landmark points in the peak frame of an image sequence. If the landmark points of the person- specific neutral facial expression are available (which is the first frame in CK + database), they are subtracted from the peak frame and will be named as coordinate based features with neutral subtraction (CBF-NS). Several examples of the CBF-NS features are shown in Fig. 5, which are basically the motion vectors of the facial landmark points. Our goal is to estimate the unknown neutral face shape for any given expressive face aiming to improve the facial expression recognition performance with the help of it. Therefore, we model the density of space of neutral face shapes using Gaussian Mixture Models (GMMs), expecting that the means of each Gaussian component will represent typical face shapes that might be expected in a population, reflecting personal differences (e.g. round face, thin face etc.). The set of mean vectors of the Gaussian components will give us a dictionary of neutral face shapes. In order to fit a GMM to the space of neutral face shapes, we use the neutral images in the CK + database from 123 subjects, which are the first frames of the sequences. Let χ = { s n , 1 } , n = 1 , . . . , N represent the neutral shape data set, where s n , 1 = [ p n 1 , 1 , p n 2 , 1 , . . . , p n M , 1 ] represents the face shape in the first frame of image sequence n , which consists of M normalized coordinates of landmark points after alignment, i.e., p n i , 1 = ( x n i , 1 , y n i , 1 ) . For example, there are M = 68 facial points in the CK + database. The notation will be simplified as χ = { s 1 , . . . , s N } to represent the set of neutral face shapes (i.e. the sample). One way of modeling the distribution of neutral face shapes is to use mixture of densities ...

Similar publications

Article
Full-text available
In this paper we demonstrate how genetic programming can be used to interpret time dependent facial ex-pressions in terms of emotional stimuli of different types and intensities. In our analysis we have used video records of facial expressions made during the Mars-500 experiment in which six participants have been iso-lated for 520 days to simulate...
Conference Paper
Full-text available
Affective technologies enable the automatic recognition of human emotional expressions and non-verbal signals which play an important part in effective communication. This paper describes the use of user-centred design techniques to establish display designs suitable for feeding back recognised emotional and social signals to trainees during commun...

Citations

... Reference [5] designed an improved wave physics model based on depth wave field inference in speech emotion recognition. Literature [6] used the Gaussian mixture model fitting method to design neutral profile dictionary to solve the baseline problem. Literature [7] collected emotional physiological data sets under four induced emotions, and the group-based IRS model improved the performance of emotion recognition. ...
Article
Full-text available
The change of life style of the times has also prompted the reform of many art forms (including musicals). Nowadays, the audience can not only enjoy the wonderful performances of offline musicals but also feel the charm of musicals online. However, how to bring the emotional integrity of musicals to the audience is a technical problem. In this paper, the deep learning music emotion recognition model based on musical stage effect is studied. Firstly, there is little difference between the emotional results identified by the CRNN model test and the actual feelings of people, and the coincidence degree of emotional responses is as high as 95.68%. Secondly, the final recognition rate of the model is 98.33%, and the final average accuracy rate is as high as 93.22%. Finally, compared with other methods on CASIA emotion set, the CRNN-AttGRU has only 71.77% and 71.60% of WAR and UAR, and only this model has the highest recognition degree. This model also needs to update iteration and use other learning methods to learn at different levels so as to make this model widely used and bring more perfect enjoyment to the audience.
... Using a facial image or video, not only the identity but also the age [1], gender [2] and race information can be determined. The emotional and mental state of the person can also be inferred [3][4][5][6][7] from changes in facial expressions over time. ...
Article
Full-text available
Although recent deep‐learning‐based face recognition methods give remarkable accuracies on large databases, their performance has been shown to degrade under adverse conditions (e.g. severe illumination and contrast variations; blur and noise). Under such conditions, soft‐biometric features such as facial dynamics are expected to increase the performance if they are used together with appearance‐based features. We propose a novel hybrid face recognition, which uses appearance‐based features extracted using deep convolutional networks and statistical facial dynamics features extracted from facial landmark positions during smile expression. We evaluated the performances of three different state‐of‐the‐art pre‐trained deep convolutional neural networks (DCNNs) under a variety of severe image distortions with different parameters. The experimental results show that, although the face recognition performance using only DCNN‐based features drops significantly under adverse conditions, the utilization of facial dynamics features together with DCNN‐based features can compensate for the performance loss and increase the accuracy significantly. We believe the proposed system can be useful when face recognition is performed using videos obtained from systems, which may contain blurry and noisy images with a wide range of illumination variations.
... For better feature representation, the blend of geometrical and appearance based features known as coordinate based features with neutral face subtraction (CBF-NS) was used [22]. The dictionary of the neutral faces was learned through GMM along with the CBF-NS features. ...
... In Tables V, VI, and VII, the proposed method of using dynamic kernels for expression recognition is compared with state-of-the-art approaches on MMI, BP4D, and AFEW datasets, respectively. Existing techniques use low-level features like HOG [5], SIFT [22], and geometric feature [39] that cannot handle the variations encountered in expression videos in unconstrained environments [40]. As these features are extracted at every frame, they consider only spatial information which is not sufficient to analyze facial expressions in videos. ...
... However, CNNs and LSTMs need large annotated expression datasets in order to generalize well across facial expressions. Hence, the stateof-the-art methods employ a combination of CNN and HOG features with traditional sequential models like HMM. [32] 53.5 SIFT + SVM [22] 62.5 LDA + NN [42] 67.4 3DCNN + DAP [41] 62.2 Neutral face + sparsity [43] 70.1 CNN + Joint fine-tuning [24] 70 In order to further improve the recognition performance, other modalities like audio features are also fused with video features for integrating complementary information [44]. It can be clearly observed that MIK performs better than the state-ofthe-art with the help of the temporal information embedded in the MBH features. ...
Article
Full-text available
Recognition of facial expressions across various actors, contexts, and recording conditions in real-world videos involves identifying local facial movements. Hence, it is important to discover the formation of expressions from local representations captured from different parts of the face. So in this paper, we propose a dynamic kernel-based representation for facial expressions that assimilates facial movements captured using local spatio-temporal representations in a large universal Gaussian mixture model (uGMM). These dynamic kernels are used to preserve local similarities while handling global context changes for the same expression by utilizing the statistics of uGMM. We demonstrate the efficacy of dynamic kernel representation using three different dynamic kernels, namely, explicit mapping based, probability-based, and matching-based, on three standard facial expression datasets, namely, MMI, AFEW, and BP4D. Our evaluations show that probability-based kernels are the most discriminative among the dynamic kernels. However, in terms of computational complexity, intermediate matching kernels are more efficient as compared to the other two representations.
... In recent years, numerous investigations of observable expression recognition have been made [1]- [3]. However, involuntary happiness, anger, sadness, or other emotion can be concealed, easily causing misinterpretation of emotions. ...
Article
Full-text available
Currently, research on emotion recognition is gaining increasing attention. Inner emotions or thought activity can be determined by analyzing facial expressions, behavioral responses, audio, and physiological signals; facial expressions are one of the forms of non-verbal interactions. We constructed emotion-specific activation maps to establish infrared thermal facial image sequences, which is an alternative approach for determining the correlation between emotional triggers and changes in facial temperature. During the testing process, data stored in The International Affective Picture System were used to create emotional clips that triggered three different types of emotions in the subjects, and their infrared thermal facial image sequences were simultaneously recorded. For processing, an image calibration protocol was first employed to reduce the variance produced by irregular micro-shifts in the faces of the subjects, followed by independent component analysis and statistical analysis protocols to create the facial emotional activation maps. The test results showed that we resolved the problem of selecting local regions when analyzing frame temperature. Emotion-specific facial activation maps provide visualized results that facilitate the observation and understanding of information.
... Clearly also, we can use other forms of robust PCA [50][51][52] and Mestimators [53,54] might also to deal with the problem of outliers. Finally, future research will attempt to extend existing single-level probabilistic methods of modeling shape and/or appearance (e.g., mixtures models [30,55] and extensions of Bayesian methods used in ASMs or AAMs [56,57]) to multilevel formulations and to active learning [58]. The use of schematics such as Figure 1 will hopefully prove just as useful in visualizing these models as they have for mPCA. ...
Article
Full-text available
Single-level principal component analysis (PCA) and multi-level PCA (mPCA) methods are applied here to a set of (2D frontal) facial images from a group of 80 Finnish subjects (34 male; 46 female) with two different facial expressions (smiling and neutral) per subject. Inspection of eigenvalues gives insight into the importance of different factors affecting shapes, including: biological sex, facial expression (neutral versus smiling), and all other variations. Biological sex and facial expression are shown to be reflected in those components at appropriate levels of the mPCA model. Dynamic 3D shape data for all phases of a smile made up a second dataset sampled from 60 adult British subjects (31 male; 29 female). Modes of variation reflected the act of smiling at the correct level of the mPCA model. Seven phases of the dynamic smiles are identified: rest pre-smile, onset 1 (acceleration), onset 2 (deceleration), apex, offset 1 (acceleration), offset 2 (deceleration), and rest post-smile. A clear cycle is observed in standardized scores at an appropriate level for mPCA and in single-level PCA. mPCA can be used to study static shapes and images, as well as dynamic changes in shape. It gave us much insight into the question “what’s in a smile?”.
... Furthermore, the mPCA method uses averages of covariance matrices (e.g., over all subjects in the population or over specific subgroups) and robust averaging of these matrices might also be beneficial. Clearly also, we can use other forms of robust PCA [48][49][50] and M-estimators [51,53] might also to deal with the problem of outliers. Finally, future research will attempt to extend existing single-level probabilistic methods of modeling shape and / or appearance (e.g., mixtures models [29,53] and extensions of Bayesian methods used in ASMs or AAMs [54][55]) ...
... Clearly also, we can use other forms of robust PCA [48][49][50] and M-estimators [51,53] might also to deal with the problem of outliers. Finally, future research will attempt to extend existing single-level probabilistic methods of modeling shape and / or appearance (e.g., mixtures models [29,53] and extensions of Bayesian methods used in ASMs or AAMs [54][55]) ...
Preprint
Full-text available
Single-level Principal Components Analysis (PCA) and multi-level PCA (mPCA) methods are applied here to a set of (2D frontal) facial images from a group of 80 Finnish subjects (34 male; 46 female) with two different facial expressions (smiling and neutral) per subject. Inspection of eigenvalues gives insight into the importance of different factors affecting shapes, including: biological sex, facial expression (neutral versus smiling), and all other variations. Biological sex and facial expression are shown to be reflected in those components at appropriate levels of the mPCA model. Dynamic 3D shape data for all phases of a smile made up a second dataset sampled from 60 adult British subjects (31 male; 29 female). Modes of variation reflected the act of smiling at the correct level of the mPCA model. Seven phases of the dynamic smiles are identified: rest pre-smile, onset 1 (acceleration), onset 2 (deceleration), apex, offset 1 (acceleration), offset 2 (deceleration), and rest post-smile. A clear cycle is observed in standardized scores at an appropriate level for mPCA and in single-level PCA. mPCA can be used to study static shapes and images, as well as dynamic changes in shape. It gave us much insight into the question “what’s in a smile?”
... The feature extraction and applying it to the classifier is time consuming and in contrast, the elongation above only increases the accuracy 0.12%. In [13], a combination of geometric and appearance features are used for facial expression recognition. In this method, appearance features are calculated by transform) descriptor and geometric features are calculated by the direct difference (displacement) of important points of the face, between two frames of normal and emotional expressions and in the direction of x and y where the accuracy has reached 93.88%. ...
... However, we no longer need the middle frames, but still, there is a need for a normal expression frame, and this is a big challenge. Thus, [13] has created a set of general normal images to solve this problem, by using normal expression images and has used it to calculate the differential geometric features. Recognition accuracy of this method has reached 90.36% (by combining geometric and appearance features) and the loss of precision is normal. ...
... In [5], the Euclidean distance difference between important points on the face with their reference points between two frames has been used. For a better comparison of the differential geometric feature extraction, in the proposed algorithm with two methods [5,13], the algorithm used in [5,13] has been implemented for differential feature extraction on the CK + database, and the results are in table 5. The results presented in Table 6 show that the differential geometric feature extraction, by using the proposed method, is more accurate than other methods and this high accuracy is due to the accurate calculation of important points of the face displacement from the emotional to normal expression. ...
Article
In recent years, a growing interest has been created for improvement of human interaction with computers. Hence, automatic recognition of facial expressions has become one of the active research topics. The purpose of this paper is to identify facial expressions, by using differential geometric features. In the proposed method, only the first and last images are used and differential features are extracted from these two images. Differential geometric features are extracted from changes in the important points of the face in the two images. In this method, the distance between the important points of the face and the reference point was calculated in both directions x and y, for two images, and with the difference between the distance, the differential geometric features between the two images were obtained. Based on the results, with this method, recognition accuracy of six facial expressions in the database was 96.44%, CK +.
... The usual choice for obtaining maximum log-likelihood estimates of the parameters is the Expectation Maximization algorithm. 21,17 Then the GMM is used for fitting the cylinder pressure curves. The results are shown in Figure 3. ...
Article
Effective condition monitoring of diesel engine can ensure the reliability of large-power machines and prevent catastrophic consequences. Cylinder pressure is capable of reflecting the whole combustion process of diesel engine, and hence can help to identify the malfunctions of the diesel engine during operation. In this paper, a graphic pattern feature-mapping method is proposed for graphic pattern feature recognition in data-driven condition monitoring. The graphic feature extraction and recognition are linked by labeled feature-mapping. It is used for identifying the running condition of the diesel engine via analyzing the cylinder pressure signal of the diesel engine. The different types of the malfunctions which are caused by different parts of the diesel engine such as induction system, valve actuating mechanism, fuel system, fuel injection system, etc. can be identified just by cylinder pressure signal. The bench experiment of a large-power diesel engine is performed to validate this graphic pattern recognition method. The results show that it has good accuracy on multi-malfunction identification and classification when the engine operates at one speed and one load.
... Inner emotion and cognitive activity can be reflected through facial expressions, behavioral responses, sound, etc. Facial expression represents a type of non-verbal interaction; thus, in recent years, many studies have explored the identification of facial expression under visible light [3,4]. However, these studies use a common blind spot, in which non-spontaneous expression of emotions can be camouflaged, resulting in misjudgment of emotions [5]. In addition to this blind spot, recognition of facial expression is also influenced by environmental illumination and face poses, leading to system identification errors [6]. ...
Article
Full-text available
Background Schizophrenia is a neurological disease characterized by alterations to patients’ cognitive functions and emotional expressions. Relevant studies often use magnetic resonance imaging (MRI) of the brain to explore structural differences and responsiveness within brain regions. However, as this technique is expensive and commonly induces claustrophobia, it is frequently refused by patients. Thus, this study used non-contact infrared thermal facial images (ITFIs) to analyze facial temperature changes evoked by different emotions in moderately and markedly ill schizophrenia patients. Methods Schizophrenia is an emotion-related disorder, and images eliciting different types of emotions were selected from the international affective picture system (IAPS) and presented to subjects during ITFI collection. ITFIs were aligned using affine registration, and the changes induced by small irregular head movements were corrected. The average temperatures from the forehead, nose, mouth, left cheek, and right cheek were calculated, and continuous temperature changes were used as features. After performing dimensionality reduction and noise removal using the component analysis method, multivariate analysis of variance and the Support Vector Machine (SVM) classification algorithm were used to identify moderately and markedly ill schizophrenia patients. Results Analysis of five facial areas indicated significant temperature changes in the forehead and nose upon exposure to various emotional stimuli and in the right cheek upon evocation of high valence low arousal (HVLA) stimuli. The most significant P-value (lower than 0.001) was obtained in the forehead area upon evocation of disgust. Finally, when the features of forehead temperature changes in response to low valence high arousal (LVHA) were reduced to 9 using dimensionality reduction and noise removal, the identification rate was as high as 94.3%. Conclusions Our results show that features obtained in the forehead, nose, and right cheek significantly differed between moderately and markedly ill schizophrenia patients. We then chose the features that most effectively distinguish between moderately and markedly ill schizophrenia patients using the SVM. These results demonstrate that the ITFI analysis protocol proposed in this study can effectively provide reference information regarding the phase of the disease in patients with schizophrenia.
... 2D facial features can be broadly grouped as geometric features and appearance-based features. Geometric features localize the salient facial points and detect the emotion based on the deformation of these facial points [26]. Appearance-based features represent the change in the texture of the expressive face [8,14,20,30]. ...
Article
Full-text available
We present a fully automatic multimodal emotion recognition system based on three novel peak frame selection approaches using the video channel. Selection of peak frames (i.e., apex frames) is an important preprocessing step for facial expression recognition as they contain the most relevant information for classification. Two of the three proposed peak frame selection methods (i.e., MAXDIST and DEND-CLUSTER) do not employ any training or prior learning. The third method proposed for peak frame selection (i.e., EIFS) is based on measuring the “distance” of the expressive face from the subspace of neutral facial expression, which requires a prior learning step to model the subspace of neutral face shapes. The audio and video modalities are fused at the decision level. The subject-independent audio-visual emotion recognition system has shown promising results on two databases in two different languages (eNTERFACE and BAUM-1a).