top of page

Classifying Emotions in Speech with Local Acoustic Features

 

The global feature method overlook the dynamic variation of within the utterance. Therefore, local segmental feature can be used to improve the accuracy of the emotion recognition. 

 

 

Using Hidden Marcov Model 

There are many ways we can derive the emotion from the local feature of the utterance. Here we will look at the Hidden Marcov Model (HMM) emotional classifier using the spectral feature of the utterance, as it is one of the more effective classifier. 

 

 

HMM formulae:

The HMM is explained earlier in the facial gesture section.

Its formula is

 

 

 

 

 

A is the "State Transition Probability Distribution". B is the "Observation Symbol Probability Distribution".

Pi is the "Initial State Probability Distribution".

 

 

 

 

 

 

 

 

 

 

 

with this model the observation can be calculated by this formula 

 

 

 

 

 

 

where forward and backward variable is

 

 

 

 

 

 

 

 

 

 

 

these formulae which can be solved by induction. 

 

 

Application:

To carry out the classification of emotion using HMM, the Mel-frequency cepstral coefficient, which is the representation of the power spectrum, of the utterance is evaluated and derived into five phoneme classes which are vowel, glide, nasal, stop, and fricative sounds. HMM model is created for the five phoneme classes. 

 

The steps:

  1. speech signal is forced aligned with the reference HMM, which is the HMM model that is trained by the database.

  2. For each emotion, phoneme class emotion HMM is applied to the forced aligned data, in the same order of occurrence of the phoneme class.

  3. Standard Viterbi algorithm is performed to find the maximum likelihood sequence of the phoneme classes. Ie: finding the new set S (phoneme class sequences) using the given set O (observer sequences) for each emotion.

  4. The final emotion evaluation is done by comparing the probability as seen below. [image of formula]

 

 

 

 

 

 

Where S is the new derived phoneme class sequences, i is the number of different emotions, E* is the derived final emotion. 

 

 

Results and Comparison 

Result in an experiment to classify 4 emotions shows confusion between anger vs happiness, and neutral vs sadness. See Table 1.

 

 

 

 

 

 

                                                                         Table 1

 

Nevertheless the accuracy performance is still higher than the global prosodic features classification - using SVC classifier which is known to be more accurate than KNN classifier. See table 2.  

 

 

 

 

 

 

 

 

 

 

 

 

                                                                            Table 2

 

 

bottom of page