Global Level Acoustic Features

Classifing Emotions in Speech with Global Acoustic Features

Some Global Acoustic Features are :

Feature A: mean, standard deviation, minimum, maximum, range (of the pitch signal)

slope, and speaking rate.

Feature B: related to rythm - speaking rate, average length between voiced region, number of slopes

from smoothed pitch signal - minimum, maximum, median, standard deviation

from derivative of smoothed pitch signal - minimum, maximum, median, standard deviation

from individual voiced part - mean of minimum, mean of maximum

from individual slope - mean positive derivative, mean negative deriative

as seen, feature set A is clearly a much weaker set of features

compared to the feature set B. These features is used to classify

the emotion of a certain speech.

Figure 1. This figure shows the pitch

signal, smoothed pitch signal, and the

derivative of the smooth pitch signal.

Basic Concept

Here is the basic idea behind the emotion recognition from speech. Firstly, samples of utterences are taken with known emotions. Set of features are extracted from the samples and they form a database which will be used later. Given a speech signal, the same set of features are extracted from it. We then uses classifier methods such as the K-Nearest Neighbour (KNN), Maximum Likelihood Bayes (MLB), or Kernel Regression (KR) to classify the emotion of the signal by the features using the information from the database.

Study has shown that KNN is the better emotional speech classifier compare to the other two. As the MLB shows increase in error when stronger features are used. KR show smaller reduction in error compare to the KNN as shown in Table 1.

Table 1

KNN classifier

KNN classifier is an algorithm which store all available cases in a space which their position is determined by its features. It classifies new cases based on the closest existing cases near by.

The value of K determines how many “closest existing case”

are needed to be considered. For example if K = 3, we look at

the 3 existing case that are closest to the new case and classify

the new case as the class that are found the most in the 3 cases.

In Figure 2, where K = 3, the new case is closest to 2 blue and

1 orange, therefore the new case is classified as blue. The axis

are the features used in the classification.

Figure 2

Feature Selection

Feature selection is an essensial part in the speech emotion classificaiton as some features are more important than others, and some features are not even relevant at all.

A simple feature selecting process can be done by:

order the features according to their priority: testing each feature by classifying sets of utterances with KNN classifier using only the feature itself. Looking at the error gotten from the result of the classification we will be able to determine which of the features are more relevant than the other.
We can combine the more important features together to get the optimal combination which yields the lowest error.
The combinding of the features can be done by promising first selection (PFS) and forward selection (FS).
PFS is simply adding the next most relevant feature and see the error yield. And keep adding the next feature until the error stop decreasing.
FS checks every other unused features and add the feature that will gives the best overall result. This gives a better accuracy however more computational work needed. Also stop when the error starts increasing

Table 2 shows the decrease in error after using feature selection

Drawback of the Feature Selection

Each time the feature selection process adds another feature into consideration, the dimensionality increases. Although the error is decreases, an extra feature affects the reliability of the classification. This is because each time the dimensionality increases, the volume of the space increases by a very large amount making the existing data available become sparse compared to the space.