Bimodal Emotion Recognition from Speech and Text

—This paper presents an approach to emotion recognition from speech signals and textual content. In the analysis of speech signals, thirty-seven acoustic features are extracted from the speech input. Two different classifiers Support Vector Machines (SVMs) and BP neural network are adopted to classify the emotional states. In text analysis, we use the two-step classification method to recognize the emotional states. The final emotional state is determined based on the emotion outputs from the acoustic and textual analyses. In this paper we have two parallel classifiers for acoustic information and two serial classifiers for textual information, and a final decision is made by combing these classifiers in decision level fusion. Experimental results show that the emotion recognition accuracy of the integrated system is better than that of either of the two individual approaches.


I. INTRODUCTION
With the advent of information age and the popularity of Internet, more and more kinds of information come to our life.Phoning has become the main means of daily communication and follow-up contacting.We often play some customer service phone to ask for information about some products, after the call we always asked to evaluate the service attitude of the telephone operator, so that the businesses can know the service quality of the staff.However, manual evaluation often has a problem of objectivity and authenticity.Automatic emotion recognition is one of the key techniques of human-computer interaction [1].
In recent years, several research works have focused on emotion recognition.Hoch et al [2] presented a method to recognize three kinds of emotional states in the automotive environment from speech and expression information.Busso et al [3] analyzed the complementarity of speech emotion recognition and facial expression recognition, presented a multi-modal emotion recognition method from feature level fusion and decision level fusion.Wangner et al [4] combined electromyogram, ECG, skin resistance and breathing these four kinds of physiological parameters to recognize emotional state and got a recognition rate of 92%.However, few approaches have focused on emotion recognition from textual input.Textual information is another important communication medium and can be retrieved from many sources, such as books, newspapers, web pages, e-mail messages, etc.It is not only the most popular communication medium, but also rich in emotion.With the help of natural language processing techniques, emotions can be extracted from textual input.In this paper, a bimodal emotion recognition method is used to extract emotion information from both speech and text input.In this paper, the classifiers recognize emotions according to two simple types: positive and nonpositive.This paper designed two parallel classifiers for acoustic information and two serial classifiers for textual information, and a final decision is made by combing these classifiers in decision level fusion.This method can be applied to a telephone service center dialogue system to recognize customers' negative emotions, such as anger, impatience etc. so that to turn the answering service to manual service automatically to avoid losing customers.

II. EMOTIONAL SPEECH CORPUS
At present, there still not have a public database for Chinese speech emotion recognition research.Generally there are two ways to get emotional speech corpus: a) Recording; b) Clipping.Recording method has better customization, and can record emotional speech which meets the speaker, text, emotion categories and other requirements.According to the general rules of building corpus, four college students around the age of 20 with higher emotional expression ability are invited to participate in recording (2 females, 2 males).After five non-recording people's perception experiments, we removed nearly 40% corpus which are not sure which kind of emotion.Finally we picked out a total of 600 available corpuses, including positive and non-positive each 300, where non-positive include anger, sadness, fear and other negative emotions.

III. PREPROCESSING
The purpose of voice and text preprocessing are different.Voice preprocessing is to get pure voice by eliminating the interference of various factors.Text preprocessing is to get relatively clean data sets by filtering noise data.

A. Speech Signal Preprocessing 1) Pre-emphasis
Since speech signal are affected by the glottis excitation and snout radiation, the high frequency part of the speech signal falls down.Pre-emphasis enhance the high frequency part, make the signal spectrum flat over the entire frequency band.

2) Window Function
Commonly used window functions in voice processing are rectangular window and hamming window.

B. Text Preprocessing
Firstly we use the Chinese auto-segmentation system to do the process of word segmentation, and then move the stop words from target text.Finally we can get relatively clean data sets.The process of text pretreatment is showed in figure1.

A. Acoustic Features
Speech signal is short-time stationary, calculating features based on short time frame are: short-time amplitude energy, short pitch and first-three formant.For the whole speech, every feature is calculated as one-dimensional sequence.However these sequences cannot be directly used as a feature vector for pattern recognition, commonly way to solve this problem is to calculate its statistical value, such as mean and slope.
Speech emotion recognition based on prosodic features has strong robustness and adaptability.Statistical characteristics can better reflect the rhythmic structure of speech.On the basis of previous experiment, we chose 37 identification features, which are shown in Table 1.

B. Textual Features 1) Feature extension
Text orientation classification is different from general classification, words or phrases with semantic orientation or emotional tendencies play a crucial role for classification.In this paper we use three-step feature extension [5] to reconstruct the data sets, which can extent features of the data sets by using list of tendency words, negative words and degree adverbs.This method can enhance expression ability of the textual features by adding words or phrases with semantic orientation to feature sequence.

2) Feature Selection
Commonly used feature selection methods are: document frequency, mutual information, information gain, expects cross-entropy, chi-square statistics etc. However these methods are not much suitable for text orientation classification.In this paper we chose the document frequency feature selection formula presented in literature [6] which considered the words tendentiousness.
( 1) lg( ) _ ( , ) lg( ) Among it, t DF means the number of documents showed in class c of feature t , c N means the whole numbers of documents in class c , t  means the intensity values of orientation, t  means the number of words feature t contains,  means the weighing coefficient which can be adjusted in experiment.When selecting parameter, we set a threshold min _ DF SEN .If the threshold of a feature is less than a certain value, it will be deleted.In this paper, based on experiments we select 0.04 as the value of threshold.When a feature word appeared at multiple classes, we selected it according to its feature score in www.ijacsa.thesai.org each category.If the absolute value of difference value of feature score in two deferent categories is more than 0.12, the word will be selected as textual feature.
V. BIMODAL FUSION RECOGNITION ALGORISM Currently, there are two ways to combine different pieces of information: a) feature level fusion, b) decision level fusion.The problem with feature level fusion is the potential of having to face the nurse of dimensionality due to the increase in the input feature dimension.In our case, we have two parallel classifiers for acoustic information and two serial classifiers for textual information, and a final decision is made by combing the outputs results from these classifiers in decision level fusion.

A. Classification of the Acoustic Set
In this paper, we have two parallel classifiers for acoustic information.They are support vector machines (SVM) and BP neural nets.

1) Support Vector Machines
A great interest in Support Vector Machines (SVM) in classification can be observed recently.They tend to show a high generalization capability due to their structural risk minimization oriented training.Non-liner problems are solved by a transformation of the input feature vectors into a generally higher dimensional feature space by mapping function where linear separation is possible.Maximum discrimination is obtained by an optimal placement of the separation plane between the border of two classed.

2) BP Neural Nets
The BP Neural Nets is proposed by a team of scientists led by Rumelhart and McCelland, and it is one of the most widely used neural network model.BP network can learn and store a large amount of mapping relationship of input-output model without pre-revealing the mathematical equations that describe the mapping relationship.

B. Classification of the Textual Set
In this paper, we use the two-step classification proposed in literature [7].We construct two serial classifiers CF1 and CF2, both of them use equation ( 4) to select features.Firstly use CF1 for classification.For unreliable part of the classification results, we use CF2 for secondary classification.

1) To construct classifier CF1
CF1 is Naive Bayes classifier.Text d is expressed as 12 ( , , , ) t is feature item of the text.Then Naive Bayes classifier is expressed as follows: In formula (4), 2) To construct classifier CF2 Classifier CF2 is expressed as follows: Where

C. Fusion Algorism in Decision Level
In this paper we construct two parallel classifiers for acoustic information and two serial classifiers for textual information, and a final decision is made by combing these classifiers in decision level fusion.Flow chat is showed in figure2.
The idea of weighted score voting strategy: if the three classifiers have the same recognition result, then the sample will be identify as such class; if two of the three classifiers have the same recognition result, then sum the two weights of the classifiers with the same result, and compare it with the weight of the classifier which has a different result, then the sample will be identify as class recognized by the classifier with bigger weight.In this paper, we use 2*2 confusion matrix to evaluate the emotion recognition algorism.The element in row i and column j means the proportion that the real emotion state i is recognized as j.That is to say, the greater values on the diagonal matrix are, the better effect of the emotion recognition algorism is.Experiment 1: recognition rate of single-mode SVM classifier based on acoustic features is shown in Table 2.   Experiment4: recognition rate of decision level fusion algorism is shown in Table 5.From experiment result we can see that the bimodal fusion algorism presented by this paper obtained the expected effect.The advantage of decision level fusion algorithm is that each classifier is independent from each other, when emotional data not available of with low quality in one channel, decision lever can still recognize emotion state, it has good robustness.

VII. CONCLUSION AND PROSPECTS
This paper presents an approach to bimodal emotion recognition from speech signals and textual content.We conduct two parallel classifiers for acoustic information and two serial classifiers for textual information, and a final decision is made by combing these classifiers in decision level fusion.Experimental results show that the emotion recognition accuracy of the integrated system is better than that of either of the two individual.
Emotion recognition cannot only combine speech and text, but also heart rate, blood pressure, skin current etc. physiological characteristics, which can be applied to polygraph, entertainment and many other areas.Affective computing will help artificial intelligence become more and more humanized.
a) Rectangular Window: www.ijacsa.thesai.orgb) of the training data sets and the testing data sets have two kinds of emotion: positive and non-positive.Each training set contains every emotion 200 speech samples and 200 text samples, each testing sets contains every emotion 100 speech samples and 100 text samples.In experiments, we use the same training sets and testing sets to test every single model classifier.www.ijacsa.thesai.org

Table 4 .
From table 3 we can see the recognition rate of single-mode BP Neural Nets classifier based on speech signal is around 77%. From table 4 we can see the recognition rate of single-mode classifier based on textual features is around 89%. From table 5 we can see the recognition rate of decision level fusion algorism proposed by this paper is around 93%, it's better than all the single-mode classifiers.