Voice Recognition Method with Mouth Movement Videos Based on Forward and Backward Optical Flow

Lip reading method with mouth movement videos based on backward optical flow is proposed. Through experiments with 10 of mouth movement videos, it is found that the proposed lip reading method is superior to the conventional optical flow based method.


INTRODUCTION
Although voice recognition is now world widely available, recognition performance is not good enough for normal conversations.For instance, voice recognition performance of the typical Hidden Markov Model: HMM based method [1] (this is referred to the conventional voice recognition hereafter) with the feature of Formant is less than 50 % when the signal to noise ratio is below 5dB.In other words, voice recognition performance is totally affected by noise.In normal conversation among us, not only voice but also mouth movement is used for recognitions.Mouth movement video analysis makes voice recognition much better performance.The proposed lip reading method is for improvement of voice recognition performance.
Usually, Hidden Markov Model based method or neural network based method is used for voice recognitions as well as optical flow [2]- [9] based analysis of the mouth movement videos.Forward direction (from the past to the future) of optical flow is usually used for mouth movement analysis.Voice recognition performance can be improved by adding backward direction (from the future to the past) of optical flow for correction of voice recognition errors through a confirmation of recognized results.In this process, two voice elements are treated as a unit for the proposed backward optical flow.The conventional forward direction of optical flow recognizes by voice element by voice element, though.In order to make sure the recognized results, two voice elements are much easier and efficient manner.This is because transient between voice element and voice element is so important for voice recognitions.This is the basic idea of the proposed lip reading method.
Experiments are conducted with 10 of mouth movement videos which are acquired by different peoples.Voice recognition performance, then is evaluated and is compared to the conventional forward direction of optical flow based method.The experimental results show that the proposed backward optical flow is superior to the conventional method.
The following section describes the proposed method followed by some experiments.Then conclusion is described together with some discussions.

A. Overview of the Proposed Voice Recognitions
Process flow of the proposed voice recognition method is shown in Figure 1.After all, the recognized results from moving picture based and voice signal based methods are compare and check a consistency between both results, then final recognition results are reduced.

B. Optical Flow
Optical flow is defined as object movement representations in vector form in the visual representations.From moving pictures, videos of digital images, optical flow can be extracted as vectors.There are the conventional block matching method and gradient method for extraction of optical flow.Block matching method is usually referred to "Block-based methods" which are minimizing sum of squared differences or sum of absolute differences, or maximizing normalized crosscorrelation while the gradient method is used to be referred to "Differential methods" which are based on partial derivatives of the image signal and/or the sought flow field and higherorder partial derivatives.Other than these, there are "Phase correlation methods" which can get inversion of normalized cross power spectrum between two adjacent images and "Discrete optimization methods" of which the search space is quantized, and then image matching is addressed through label assignment at every pixel.

C. Input Data for Dynamic Programming: DP Matching
and d[k] denote distance, as well as w[k] denotes weighting coefficient, when the coded edge (K denotes the total number of edges) is represented as shown in Figure .4.Even if some of the coded edges are missing, similarity between two coded edges can be calculated results in edge image matching between the query image and the current image.

E. Detailes of Dynamic Programming: DP Matching
Initial condition is assumed to be , then is minimum distance for where x i is input pattern data of voice elements while x i (l) is reference voice elements.Then suffix of the input pattern data is incremented as follows, There are three possible solutions which minimize the distance between input pattern data and the reference pattern data.
Meanwhile, is defined as inner product (dot product) of the Then distance between two x i and x i (l) are as follows,

Where
To find the minimum distance, if the d s is minimum when the l=l 0 , then the input pattern data is classified to l 0 .If a distortion is considered for the input pattern data due to some reasons, then d s is no longer can be calculated with .The reason for that is some of the voice elements will be missing, or some of voice element inserted accidentally as shown in Figure 5. Therefore, distorted input pattern data (Modified pattern) has to be represented as follows, Reference patter in Figure 5 is defined as reference patter for voice elements.In this case, the following function which represents the relation between d s and .
where This is the k-th relation between Then the distance is rewrite with the following equation, Where wj denotes k-th weighting coefficient which allows adjustment, or normalization of the distance d s from -1 to 1. Figure 6 shows an enlarged portion of Figure 5. Weighting coefficients can be determined as shown in Figure 6.Relation between reference pattern and input pattern data (Modified Pattern) Fig. 6.
Enlarged portion of Figure 5 There are some conditions for the distance definition, Start and end of input pattern and reference pattern are corresponded, The voice element orders have to be same for both input and reference patterns, www.ijarai.thesai.org The corresponding reference pattern exists near by the input pattern.
Then, as shown in Figure 6, F is calculated as follows, Then the distance between input pattern data of voice element and the reference voice element pattern is represented as follows, Where I and I' denotes the number of reference voice element patterns, respectively.Thus input voice element pattern is classified to the reference pattern, namely, if the d s is minimum when the l=l 0 , then the input pattern data is classified to l 0 .

F. Voice Elements
In this paper, Japanese language recognition is focused.Japanese, in particular, the following 40 voice sounds are concerned.

III. EXPERIMENTS
First, the reference patters of the aforementioned 40 voice sounds are prepared with four different speakers.Sounds and moving pictures are prepared as the reference patterns.
For the optical flow based voice recognition, moving vectors of the aforementioned four features, top, bottom, left end, and right ends of mouth which are extracted from the moving pictures are used.Features are represented as the symbol.One small example of a portion of the time series of symbolized voice elements are shown in Figure 7.
In accordance with the distance, the first (L1), the second (L2), and the third (L3) candidates are determined.From the calculated distance, likelihood, or probability is also calculated for each candidate.The probability is calculated by voice element by voice element and also is evaluated for both vowels and consonants.The proposed method is based on forward and backward optical flow as explained in the second section.The probability evaluations have been done for the proposed method and compared to forward optical flow based method as well as the conventional voice recognition method.
Probability or likelihood is corresponding to the percent correct classification: PCC.If the PCC is evaluated with the first candidate only, then PCC for the conventional voice recognition method is not so good, below 43% for vowels and 14.3% for consonant + vowel while that for the proposed method with forward optical flow is 71.4% for vowel and 57.1% for consonant + vowel.Therefore, it is found that PCC is improved remarkably by taking moving picture analysis with the forward optical flow into consideration by the factor of approximately 30%.Symbolized voice elements for "a", and "ra" Furthermore, the proposed method with backward optical flow for confirmation and correction of recognized results which are obtained from the proposed method with forward optical flow only is superior to the proposed method with forward optical flow only.This implies that PCC is improved remarkably by taking confirmation and correction of recognized results which are obtained from the proposed method with forward optical flow only into account by the factor of about 20%.PCC of vowel is always better than that of consonant + vowel, obviously.In particular for the conventional voice recognition method, there is around 30# of difference between vowel PCC and PCC of vowel + consonant.
If PCC is evaluated with the first to the third candidates, both of the proposed method with forward optical flow only www.ijarai.thesai.organd that with forward and backward optical flow shows 100% of PCC.This implies that the effect of considering not only voice signals but also moving pictures on PCC of voice recognition is significant As the results, it is found that the voice recognition performance can be improved by adding moving picture analysis to the voice signal analysis.This is same thing for human to human conversations.By looking at the speakers mouth movement, voice recognition is helped and reconfirmed recognized results at the same time.

Fig. 1 .
Fig. 1.Process Flow Of The Proposed Voice Recognition Time series of moving pictures and voice signals are acquired first.Using the conventional HMM based voice recognition method, time series of voices are recognized.This is referred to voice based recognition, hereafter.On the other hands, lip reading is performed based on forward optical flow with time series of moving pictures of mouth movement which are acquired at the same time of voice signals.This is done by voice element by voice element as usual.Meanwhile, two voice element based backward optical flow is applied to the time series of moving pictures of mouth movement.Then the result from the voice element based forward optical flow is corrected by using the two element based backward optical flow results.Through this voice element based optical flows, Dynamic Programming: DP matching based recognition is performed.Because extracted voice elements have missing portion of elements.Furthermore, recognition needs some insertions of voice elements.DP matching allows insertion and also recognition without some

Figure 2
Figure 2 shows an example of one cut of the moving picture of mouth movements.Time series of images are acquired.Voice element can be extracted from the time series of images.From the piece of the time series of images, four feature points, top, bottom, right end, and left end are extracted as input data for DP matching.

Fig. 2 .Fig. 3 .
Fig. 2.Example of a piece of moving picture of time series of images of mouth movements

Fig
Fig. 4. Coded edge information Subset summation of ]] [ [ m c s of numerator of equation (1) is expressed with equation (4) when k=m,

Fig. 5 .
Fig. 5.Relation between reference pattern and input pattern data (Modified Pattern)

TABLE I .
PROBABILITY EVALUATION FOR THE FIRST TO THIRD CANDIDATES FOR THE PROPOSED AND THE METHOD WITH FORWARD OPTICAL FLOW ONLY AS WELL AS THE CONVENTIONAL VOICE