Deep Study of CRF Models for Speech understanding in Limited Task

—In this paper, we propose to evaluate in depth CRF models (Conditional Random Fields) for speech-understanding in limited task. To evaluate these models, we design several models that differ according to the level of integration of local dependencies in the same turn. As we propose to evaluate these models on different types of processed data. We perform our study on a corpus where turns are not segmented into utterances. In fact, we propose to use the whole turn as one unit during training and testing of CRF models. This represents the natural way of conversation. The language used in this work is the Tunisian Arabic dialect. The obtained results prove the robustness of CRF models when dealing with raw data. They are able to detect the semantic dependency between words in the same speech turn. Results are important when CRF models are designed to take into account the words with deep dependencies in the same turn and with advanced preprocessed data.


I. INTRODUCTION AND RELATED WORKS
Spoken Language Understanding is an important component in spoken dialogue systems. It aims to extract concepts from an utterance to clarify speech meaning. Therefore, the key link in an automatic understanding process revolves around the correspondence between the set of words in the utterance and the set of semantic concepts. In order to resolve this correspondence, the first research works in this field exploited linguistic formalisms such as regular grammars and context-free grammars. Recent works have rather oriented towards the exploitation of machine learning models for concept detection, these models are widely used for the semantic annotation of speech utterances.
Our overview of the literature showed that learning models constitute the dominant context for speech understanding due to the performances recorded particularly in restricted domains. These models enjoy several advantages reported by [1] [2] [3]. Indeed, the intervention of a human expert is limited to the labeling of data, which represents an easier task than the modeling of grammars or patterns. Moreover, these models offer better portability since they are domain and language independent. However, the effectiveness of these models is sensitive to the used corpus, which must be representative and large, in order to determine their parameters [4].
Machine learning models are classified into generative and discriminative models [2] and they are widely applied to speech understanding. HMM (Hidden Markov Models) is an example of generative models and they are used by [5] for speech understanding of Spanish language, using the DIHANA corpus. The DIHANA corpus task deals with requests of information about railway services. This work uses HMM in the most realistic situations where dialogues are not segmented into utterances. The results of their work are very important. They obtained 92% as F-measure. This good result is due to the large size of the corpus used for training models.
In the literature, several studies show that discriminant models perform better than generative [6] [2] [7]. CRF models (Conditional Random Fields), as an example of discriminant models, have been widely exploited in many tasks in natural language processing such as semantic annotation and syntactic analysis [8], [9] [10]. A particular distinction is reported for CRF models whose performance exceeded that of other models, [11] [2] [12]. It is so important to notice that the CRF models have the capacity to integrate correlated characteristics that make it possible to take into account the local context of an utterance. All of these observations encouraged us to exploit these models in the context of the speech understanding of the Arabic dialect.
Several works have shown the robustness of Conditional Random Fields (CRF) models for request information in the French language using the MEDIA corpus [12]. The MEDIA corpus is manually annotated with semantic concepts of touristic information. Turns in this corpus are segmented into utterances, which simplifies the speech understanding. Raymond et al. [12] have used CRF models and domain knowledge through a set of rules made manually. This has reduced the conceptual error rate (from 11.2% to 10.9% as CER), and has increased the performance of the system to 92% as F-measure. This justifies the advantage of segmenting turns into utterances and the important size of the training corpus.
In addition, CRF models offer two major advantages. On the one hand, they allow segmentation and conceptual annotation taking into account the local context of the utterance. On the other hand, they make it possible to guarantee convergence towards the most probable concepts by taking into account all the previous and following observations in the statement [13]. Indeed, these models have the ability to use all the observations of a sequence to predict a conceptual label. This represents an interesting distinction compared to HMM (Hidden Markov Models).
In this paper, we propose to evaluate the performance of CRF models. We designed several models that differ according to the level of integration of local dependencies in the same turn. We also propose to use several processing levels on the corpus. In addition, almost all learning-based understanding www.ijacsa.thesai.org methods have been interested in modeling speech turns segmented into utterances, we suggest to use turn as a whole unit (not segmented into utterances) to test the performance of CRF models, which represents the natural way of conversation. This paper is organized as follows. In Section II, we present CRF models for speech understanding. Section III presents Spoken Dialogue Corpus for Tunisian Dialect. Section IV deals with evaluation metrics. In Section V, we present experiments and discussion. A conclusion is drawn in Section VI.

II. CRF MODELS FOR SPEECH ANNOTATION
Conditional Random Fields (CRFs), initiated by Lafferty, are discriminant models that define the conditional probability of observation sequences according to label sequences [14]. Lafferty defines the conditional sequence labeling probability Y=y1…yn given an observation sequence X=x1…xn as follows: With • z(X) is a factor that normalizes the probabilities.
• t j (y i-1 ,y i ,X,i) represents the transition characteristic function for an observation sequence between the labels at position i and i-1.
• s k (y i ,X,i) represents the characteristic function of the state of the label for a sequence of observations .
• λ j et µ k are real values which make it possible to attribute a weight to each characteristic function to specify its importance. These values therefore make it possible to characterize the discriminating power of the model. These parameters are fixed during the learning phase and make it possible to maximize the likelihood on a set of already annotated data.
Referring to the model defined in Eq. (1), the most likely sequence of concepts for labeling a sequence of input words is: CRF is modeled by undirected graph models (see Fig. 1) to define a probability distribution over a process of label Y given an observation X, by maximizing a conditional probability [14]. In this graph, the set of vertical nodes are two random fields X and Y, respectively describing the set of observations and the set of annotations. Two variables linked in the graph express that one depends on the other. Based on this, each node y t depends on the preceding node y t-1 and the following node y t+1 , and implicitly on the variable x. Therefore, each variable y t must be linked to a variable x to guarantee the dependence between the labels on the one hand and the sequences of observations on the other.
Learning CRF models consists in determining from the learning corpus, the vector = { 1 , 1 , … , 1 , µ 1 , µ, … , µ 2 } that represents the weight vector of the characteristic functions t and s. After the learning step, the exploitation of CRF models on new data consists in finding the most probable sequence of states given a new sequence of observations, which are not encountered in the training corpus. We perform this by applying the Viterbi algorithm, as it is the case with HMM models.

A. Corpus Description
The TUDICOI corpus, used in this work, consists of spontaneous oral dialogues of railway request information, in Tunisian Dialect (TD). The purpose of these requests is to consult the timetables of the train, the type of train, the destination of the train, the route taken by the train, the price and types of tickets and the reservation of tickets. We should notify that several requests can be combined during a dialogue between tellers and customers [15].
The transcribed part of the TuDiCoI corpus consists of 1825 dialogues representing 12182 turns. These turns consists of 6533 customer turns and 5649 agent turns. We list the features of the TuDiCoI corpus in Table I. Note that on average, each dialogue consists of three turns for the customer and three turns for the agent. Additionally, each customer turn is comprised of an average of 3.3 words. It is important to note that this average is low due to the agglutinative aspect of words in the dialect and the frequent use of keywords to request for information.  Table II shows the characteristics of different dialogue corpora used in other projects in different languages. It should be noted that #D designates the number of dialogues, #T the number of speaking turns, #V the size of the vocabulary, #W the number of words. A designates the type of corpus (H-H for Human-Human and H-M for Human-Machine). Finally, L provides information on the language used (Eng. for English, Esp. for Spanish, Ar. for Arabic, TD for the Tunisian dialect and Fr. for French). These corpora vary in size from a few tens to thousands of dialogues.

B. Annotation Schema
We proposed in this work an annotation scheme to perform the manual concept annotation step. Table III summarizes the annotation scheme defined for dialogue acts and semantic concepts [16]. Due to the complexity of the annotation task, effort, and manual verification, we have annotated only 1476 dialogues which represents 5047 customer turns. The characteristics of the annotated corpus are summarized in Table IV. In order to define the parts of the TuDiCoI corpus used for our evaluations, we have divided the annotated corpus into two parts. The first part of the corpus is used for learning and it represents about 80% of the total size, while the second part constitutes 20% of the corpus used for the test. Table V provides information on the characteristics of these two different parts in terms of number of dialogues, speaking turns and words. Since we are interested in literal understanding, which does not depend on dialogical context, we have classified all the speaking turns of the test part into three types, according to the recommendation proposed by the ARPA community [17], namely sets A, D and X. Table VI presents the characteristics of these different sets. This classification makes it possible to give an overview of the types of turns contained in the test part. The first set corresponds to the context-independent customer turns (Set A). This set contains the turns that have no connection with the history of the dialogue. While the second set corresponds to the context-dependent ones (Set D). This set contains the turns that have a relationship with the dialogical context. The third set corresponds to out-of-context turns of the dialogue (Set X). This set includes marginal turns that are not related to the domain. Table VII shows an example for each series.

Set
Transcription  / Translitteration / Translation A

‫وﺷﻔﺖ‬ ‫ﻋﺎﻷﻧﺘﺮﻧﺎت‬ ‫دﺧﻠﺖ‬ / dxalt ςalÂantirnaAt wušuft / I am connected to the Internet and I saw
We utilize these different sets of the TuDiCoI corpus for the evaluation of CRF-based speech understanding method for the TD.
Almost all speech-understanding methods are interested in modeling speech turns segmented into utterances. The alternative we have proposed is to use the turn as a whole unit for training and testing the performance of CRF models [16]. This represents the natural way of conversation.

C. Pretreatments
To evaluate CRF-based speech understanding, we prepared three versions of the TuDiCoI corpus: • The first version (version I) is a raw version which is not pre-processed, thus increasing the complexity of the structure of the dialect turns. In this version, the words do not respect the spelling transcription guide. Therefore, a word can be written in different spellings. Likewise, this version has morphological problems www.ijacsa.thesai.org such as the agglutination of a particle with the word, which follows it. The evaluations carried out on this version of the corpus is used to test the performance of the CRF models on data not processed in advance.
• The second version of the annotated corpus (version II), is partially preprocessed. This version has undergone spelling correction, morphological analysis of verbs and nouns, as well as synonymy analysis processing.
• The third version of the annotated corpus (version III) presents an improvement compared to the second version and which consists in processing the agglutinations of the names of cities, which makes it possible to dissociate the particle, if it exists, from the name that is attached to it.

D. Tabular Corpus
After the manual labeling step into concepts, we converted each version of the annotated corpus into a standard representation adopted by CRF models. This representation uses a set of labels called the BIO notation (Begin Inside Outside) [18], in which: • The label starting with "B-???" indicates the beginning of the conceptual segment.
• The label "I-???" denotes any meaningful word that is part of the conceptual segment.
• The label "O" is assigned for words that do not refer to any conceptual label. The advantage of using BIO notation is that it is able to segment a set of words into several conceptual segments and display them one after the other [19] [7]. An example from TUDICOI corpus in BIO notation is shown in Fig. 2.
The tabular corpus is used for training and testing CRF models. Conceptual labeling using CRF models consists in finding the best sequence of states, given a sequence of input observations. This problem is solved using the Viterbi algorithm due to the linear topology of CRF models [20]. This algorithm makes it possible to give the list of n best results.

IV. EVALUATION METRICS
The evaluation makes it possible to evaluate the conceptual correspondence, which consists in seeking the pairing between a set of words of a turn and a set of semantic concepts. For this, we use the Concept Error Rate (CER). The CER makes it possible to compare the list of reference semantic concepts with the list of concepts emitted by the system according to the following equation: We use other measures to evaluate conceptual labelling such as Precision, Recall, and F-measure. The Precision represents the number of correct concepts found compared to the number of concepts found by the system.
The Recall represents the number of correct concepts found by the system with regard to the reference concepts.
The F-measure combines Precision and Recall according to the following equation: We used in our experiments, the free tool CRF++1 for the training and testing steps. It should be noted that the CRF++ tool implements learning by Newtonian method and uses decoding using the Viterbi algorithm.

V. RESULTS AND DISCUSSION
In order to evaluate the performance of CRF models, we used several models that differ according to the level of integration of local dependencies in the same turn. These dependencies vary according to the unigram (one word) or bigram (two words) interval of the word to label.
After an initial test phase, we limited the number of models tested to four.
• The first (Model 0) is a model that does not take into account any dependence between the words of the same turn. In this case, CRF models play the role of a simple semantic tagger.
• The second (Model 1) is a model that uses a two-word window taking into account the previous word and the next word in the same turn.
• The third (Model 2) is a model that uses a window involving two words before and two words after the current word.
• The fourth model (Model 3) consists of improving the third model by adding two local dependencies. This dependency uses two bigrams taking into account the current word with the precedent word (respectively with the next word).
Then, we use these models for learning the CRF parameters based on different versions of the annotated corpus (version I, www.ijacsa.thesai.org II and III). The Table VIII, Table IX and Table X illustrate the results of the evaluation of the concept labeling in terms of Precision, Recall, F-measure and Concept Error Rate (CER). Based on these experiments, we notice that the CER decreases with the improvement of the quality of data used for learning. This clearly shows that the pre-processing carried out makes it possible to improve the speech understanding. We noticed also that the models that take into account the dependence between the different words of the same turn (Model 2 and Model 3) make it possible to improve the speech understanding. This is justified by the decrease in the CER and the increase in the F-measure.
Besides these results, we justify the robustness of CRF models with not processed data. Table VIII shows that the Fmeasure is 77.75% for the "Model 0" which does not take into account underlying dependencies, and reaches 84.40% for the "Model 3" by introducing the bi-model gram.
The examination of the errors made by the CRF models, directed us to carry out other experiments by exploiting the same test corpus, but with considering the dependence of the turn according to the dialogical level and exploiting the sets A, D and X (Table VI). Based on this classification, we tested the CRF models using these different sets on the different versions of preprocessing corpus. Indeed, we obtained three different versions according to the processing performed for each set A, D and X, starting from the raw version to the fully processed version for each series. The conceptual labeling of the different series is based on the same CRF models based on the different models (Model 0, Model 1, Model 2 and Model 3) presented previously. Table XI summarizes the results obtained in terms of CER, Precision, Recall and F-measure. From these results, we notice that the CER obtained on the type A speech turns (speech context-independent set) is the lowest rate, comparing it with the results of sets D and X. Therefore, we can conclude that a large part of the errors is due to the presence of out-of-context statements of the dialogue (Set X) and of context-dependent statements (Set D). These results are expected since we are interested in this work in the literal understanding, which does not depend on the dialogical context, so, the turns depending on the context increase the error.
There are other sources contribute to the increase CER. It is mainly about the appearance of terms that are not processed in the training corpus. This is due to the presence of certain phenomena linked to spontaneous speech such as hesitation allowing the addition of out-of-vocabulary words. In conclusion, CRF models perform well even with not processed turns. On the other hand, conceptual labeling based on CRF models failed when dealing with new terms that are not in the training corpus. These terms can be non-vocabulary words or domain words. This last case is mainly due to the reduced size of the corpus used to learn the CRF models.

VI. CONCLUSION
In this work, we proposed to evaluate in depth the performance of CRF models in the context of speech understanding in dialogue systems. We tested CRF in different models and in different types of processed data. We proved that these models show robustness against noisy data. They recorded good results for conceptual labeling (F-measure of 86.52%). Thus, we found that CRF models have the ability to detect task-specific compound words and label them correctly. These interpretations confirm the performance of these models even for under-resourced languages. As future work, we planify to compare these results with deep learning models for the same task to compare performance between machine learning using CRF models and deep learning models such as CNN (Convolutional Neural Network).