Segment, Track, Extract, Recognize and Convert Sign

This paper summarizes various algorithms used to design a sign language recognition system. Sign language is the language used by deaf people to communicate among themselves and with normal people. We designed a real time sign language recognition system that can recognize gestures of sign language from videos under complex backgrounds. Segmenting and tracking of non-rigid hands and head of the signer in sign language videos is achieved by using active contour models. Active contour energy minimization is done using signers hand and head skin colour, texture, boundary and shape information. Classification of signs is done by an artificial neural network using error back propagation algorithm. Each sign in the video is converted into a voice and text command. The system has been implemented successfully for 351 signs of Indian Sign Language under different possible video environments. The recognition rates are calculated for different video environments.


INTRODUCTION
Sign language is a vision based language of hearing impaired people, which involves the use of hands, the face and body.Sign language recognition system works on five essential parameters; hand shapes, hands movement, hand and head orientation, hand and head location and facial expressions.Out of these five parameters the foremost fundamental requirement is hand shapes.The second most important parameters are hand movements, hand and head orientation and their location in the video frame.These parameters can be incorporated into a sign language recognition system to improve the performance of the system by segmenting and tracking hands and head of the signer in the video.Segmented hand shapes and their locations are used to design a feature vector.The created feature vector is used to train the neural network.In the last decade extensive research had been done to understand hand gestures from the movement of human hands [1,2].Compared to tracking techniques to rigid objects [3], tracking non-rigid objects such as bare human hands are a very complex and challenging task.
Going back to early days of sign language, Stokoe et.al [4] showed that signs were made of basic articulatory units.Initially, they were referred as cheremes.At present, they are being referred as phonemes, in resemblance with words in spoken language.
The major difficulty in sign language recognition, compared to speech recognition, is to recognize [5] different communication attributes of a signer, such as hands and head movement, facial expressions and body poses simultaneously.All these attributes are considered as a good recognition system.
The second major problem, faced by sign language recognition system designers, is tracking the signer in the clutter of other information available in the video.This is addressed by many researchers as signing space [6].A sign language space can be created with entities such as humans or objects stored in it around a 3D body centered space of the signer.The entities are executed at a certain location and later referenced by pointing to the space.Another major challenge faced by researchers is to define a model for spatial information, containing the entities, created during the sign language dialogue.Additional difficulties arise in the form of background in which signer is located.Most of the methods, which have been developed so far, are used simple backgrounds [7,8,9,10] in controlled set-up such as simple backgrounds, special hardware like data gloves [11,12,13], restricted sets of actions, restricted number of signers, resulting different problems in sign language feature extraction.
Our research is directed towards recognizing gestures of Indian Sign language [33] under real time conditions such as varied lighting, different backgrounds and make the system independent of the signer.For segmentation and tracking, we have employed active contour models, which are capable of segmenting and tracking non rigid hand shapes and head movements.Features are extracted from segments of hand and head shapes including tracking information, in the form of hand locations from each video frame.This feature vector is used to train the artificial neural network, the network outputs and determine the voice command relevant to the trained sign.
Active contours, popularly known in the research community as 'snakes', is an active research area with applications to image and video segmentation predominantly to locate object boundaries.They are also used for video object tracking applications.Active contours come under the category of model based segmentation and tracking methods, which are giving good results in the last few years [14,15,16].The active contours were first introduced by Terzopoulos [17,18].The basic idea, behind active contours, is to start with www.ijacsa.thesai.org a curve anywhere in the image and move the curve in such a way that it sticks to the boundaries of the objects in the image.Thus it separates the background of the image from its objects.The original "snakes" algorithm was prone to topological disturbances and is exceedingly susceptible to initial conditions.However, with invention of level sets, [19] topological changes in the objects are automatically handled.
For tracking non-rigid moving objects, the most popular model based technique, is active contours [20].Active contours can bend themselves based on the characteristics of the objects in the image frame.Previously created active contour models were only capable of tracking an object in a video sequence with a static background [21].Jehan-Besson et.al [22] tracked objects in successive frames by matching the object intensity histogram using level sets.Though, the tracking error was in minimum level, it increased the object experiences intensity transformation due to noise or illumination changes.Almost all active contours methods discussed above were suffered from the problems related to cluttered backgrounds, lacking in texture information and occlusions from other objects in a video sequence.These problems can cause the performance of tracking non-rigid objects to decrease drastically.
Segmentation and tracking results are fused together to form a feature matrix, which is unique from the methods proposed for sign language feature extraction in [23,24,25].Finally, an artificial intelligence approach is proposed to recognize gestures of sign language as in [26,27,28,29].Hidden Markov Models (HMM) were used extensively for classification for sign language [30,31,32].
The rest of the paper is organized as follows: sect. 2 introduces the proposed system for recognizing gestures of Indian Sign Language (INSLR), Sect.3 presents proposed segmentation and tracking using active contours, Sect.4 gives the creation of feature matrix, Sect.5 pattern classification using neural network, Sect.6 discusses experiments under various test conditions, and the last Sect.7,briefly concludes with future scope of our research.

II. PROPOSED SYSTEM FOR GESTURE RECOGNITION
The proposed system has four processing stages namely, hand and head segmentation, hand and head tracking, feature extraction and gesture pattern classification.Fig. 1 shows the block process flow diagram of the proposed method.
From the Fig. 1 we can comprehend the overall working process of the system.Video segmentation stage divides the entire frames of images of the signers into hands and head segments of the signer.Hand and head tracking module gives the location of each hand and head in each video frame.Shape features, extracted from segmentation and location information from tracking, are fused into a feature matrix for each sign and are optimized before saving to the database.This process is repeated for all the available signs using a video of a signer.The sign language database can be restructured to add new signs into the database.Testing is done by giving a sign video sequence that is not initially trained and does not have a feature vector in the database.Unlike American Sign Language or British sign language, Indian sign language does not have a standard database that is available for use.Hence, we have created our own video database of Indian signs in collaboration with Indian Deaf Society [33] and shanthi ashram school for deaf children.We have created 480 signs of alphabets, numbers, words and sentences by multiple signers of Indian sings and list is growing.A total of twenty different signers volunteered under different conditions with a total number of 4800 gesture videos for a total of 480 signs.To test the robustness of our system we also used sign language videos from YouTube, which are under different video backgrounds and in real environments taking the total video database count to 6800.Neural network is trained using the feature vector from the database and tested using the feature vector from the input of the system.The output of the neural network results in a voice and text message to the corresponding input sign.

A. Active Contours-Level Sets
James A. Sethian and Stanley Osher [34] presented boundaries of ( ) implicitly and model their propagation using appropriate partial differential equations.The boundary is given by level sets of a function ( ) In level sets method, the interface boundary is characterized by a zero level set function ( ) is defined for all values of (1) The sign of ( ) defines whether is inside the contour or external to it.The sets { ( ) } and { ( ) }.The level set evolves based on the curvature of the image objects and assuming the curve moving towards the outward normal ⃗⃗ ⃗ defined in terms of parameter as Usually the curve evolution is a time dependent process and the time dependent level set function as ( ) { ( ) } .One way to solve is to approximate spatial derivatives of motion and update the position of the curve over time.This method of solving the level sets is prone to unsteadiness due to erroneousness detection of position of the curve.
A different approach was proposed from the theory of level sets in [34].Start with a zero level set ( ) of higher dimension function and entrench the object curvature.Initializing the level set function at , we have Where is signed distance function (sdf) from to the curvature of the image object.If is a positive value, is outside the object boundary.if is a negative value, is inside the object boundary.The goal is to construct an equation for evolution of ( ) to embrace the object boundaries from zero level set ( ) we can propagate the zero level set ( ) , by solving a convection equation containing the velocity field , which propagates all the level sets as (4) The motion is in normal velocity of the curve which is given by eq.
Eq.6 drives the contour to level out with the high curvature regions together with a diffusion term.

B. Hands and Head Segmentation and Tracking
This section presents the video object segmentation, tracking of hands and head simultaneously from a range of video backgrounds under different lighting conditions with diverse signers.A video sequence is defined as a sequence of image frames ( ) where the images change over time.Alternatively, a succession of image frames can be represented as ( ) where The basic principle behind our proposed segmentation and tracking technique is to localize the segment and track one or more moving objects of the frame from the cues available from previous segmented and tracked objects in frames such that subsequent contours are available.The sign videos are composed of many moving objects along with the hands and head of the signer.Signer's hands and head are considered as image foreground denoted by ( ) and rest of the objects are as the image background ( ) , for the image ( ) in the video sequence.Foreground contour of the hands and head might be denoted by ( ) .Our proposed video segmentation algorithm segments hands and head of signers using colour, texture, boundary and shape information about the signer given precedent understanding of hand and head shapes from ( ) and ( ) ..

C. Colour and and Texture Extraction Modules
Colour plays a vital role in segmentation of complex images.There are various colour models, but the RGB colour model is most commonly used for video acquisition.Processing on a RGB colour frame increases the size of feature vector and thereby making the tracking process lethargic.Instead of working with gray scale images which store intensity information about each pixel, we used each of the three colour planes separately to extract each colour feature vector.This allows us to work with only one plane at a time depending on the background colour level.We choose manually the colour plane which highlights the human object from a background of clutter.Once a colour plane is identified, texture features are calculated using coorelogram of pixel neighbourhood [35,36].Texture is an irregular distribution of pixel intensities in an image.Allam.et.al [37] established that co-occurrence matrix (CM's) produce better texture classification results than other methods.Gray Cooccurrence matrix (GLCM) presented by Haralick.et.al [38] is most effectively used algorithm for texture feature extraction for image segmentation.
Let us consider a color plane of our original RGB video.The R color plane is now considered as R coded 2D image.The element of co-occurrence matrix defines the joint probability of a pixel of R color intensity at a distance and orientation to another pixel at R color intensity .(7) where | | gives the distance between pixels.For each co-occurrence matrix, we calculate four statistical properties: contrast (C), correlation (CO), energy (EN) and homogeneity (H) defined as follows (8) (9) (10) (11) The logic, in which the above parameters are used as texture feature, is described as follows; the contrast represents inertia and variance of the texture the correlation term gives correlation between different elements of GLCM, CO is high for more complex textures.From eq.41 mean values along directions.represent variances, energy term describes the consistency of the texture, homogeneity is taken as a measure of coarsenesses of the texture.We used four different orientations { } and two distance measures ( ) for calculation of GLCM matrix.
Finally, a feature vector ( ) is produced which is a combination of any one or all of the color planes and texture vector.Thus ( ) { ( ) ( ) ( )} the feature vector contains color and texture values of each pixel in the image.This is a five dimension feature vector containing the first vector for any of the three color planes and the next four vectors for texture.We can also use all three color planes to represent color, and then the feature vector becomes a seven dimension feature vector.
Most of the image sequences contain many classes of colour and texture.Hence, we classify them as background and foreground pixels using Mixture of Gaussians (GMM) [39] clustering algorithm.
Givendimensional feature vector ( ) and the GMM algorithm classify thisvector into categories.Thus, we model the object and background in an image frame using two mixture of Gaussian probability density functions (pdf): and .The probability of finding a value of feature vector in the reference frame i.e. the first frame ( )  The estimation problem comes down to solve the following equations ( 16) (17) We will assume, at this point, is that objects in the video sequence pretty much remain same when compared to background, that varies due to camera movement or changes in background scenes.This can be taken care by periodically updating the background clusters with some threshold if the changes in consecutive background frame cross the specified threshold.
For a pixel in any frame ( ) where , the above procedure is used to calculate the object mixture probability ( ) ̅̅̅̅̅̅̅̅ and background mixture probability ( ) ̅̅̅̅̅̅̅̅̅̅ on a neighbourhood of pixels which is eight in this paper.To track the objects in the current frame ( ) , the initial contour position in previous frame ( ) is moved towards new object boundaries in current frame ( ) by minimizing the following energy functional of the level set (18) Where and are object and background PDF's in the reference frame ( ) .
( ) ̅̅̅̅̅̅̅̅ and ( ) ̅̅̅̅̅̅̅̅̅̅ are object and background PDF's in the current frame ( ) .( ( ) ̅̅̅̅̅̅̅̅ )and ( ( ) ̅̅̅̅̅̅̅̅̅̅ ) represent KL-distance symmetric which is computed between two PDF's by the following equation (19) Where the integrals are calculated over domain .The energy function in eq.18 is used to track the object boundaries in ( ) by calculating the local statistics of the pixels within the object separating them from background.We can implement this by assigning pixel to object in the current frame when ( ( ) ̅̅̅̅̅̅̅̅ ) ( ( ) ̅̅̅̅̅̅̅̅̅̅ ) and to the background otherwise.

D. Object Boundary Module
In the earlier module the focus was on extracting region information with the objective of minimizing the object contour energy function, which segments the objects of interest in the video sequence.But poor lightning can impact image region information in a big way.Hence, we extracted boundary edge map of the image objects which only depends on image derivatives.The way out would be to couple the region information in the form of colour and texture features to boundary edge map to create a good tracking of image objects.
We define the boundary as pixels that are present in edge map of the object.The boundary pixels can be calculated by using gradient operator on the image.To align the initial contour from previous frame to the objects in the current frame to pixels on the boundary we propose to minimize the following energy function (20) Where ( ) is the length of the object boundary.The function is an edge detection function.The boundary energy reaches to a minimum when the initial contour aligns itself with the boundary of the objects in the image.The minimization of energy in eq.20 also results in a smooth curve during evolution of the contour [41].

E. Shape Information Module
Before Even with colour, texture and boundary values of pixels in the image, the greatest challenge comes when object pixels and background pixels share the same colour and texture information.This happens because we are trying to track non-rigid objects that are hands and head of the signer along with finger positions and orientations, which change frequently in sign video.The problem influences the propagation of contour and results in meagre tracking of nonrigid objects in video sequences.The active contour can be influenced by giving information regarding the shape of the object computed from the previous frames.The method followed in [42] is used to construct the influence of shape of non-rigid objects in the image sequence.As for the first frame ( ) , where prior shape information is not available, we just used the region and boundary information for segmentation.The segmented objects in frame one are used to initialize contours in the next frame for tracking.For ( ) the tracking of ( ) is given by the level set contour ( ) which minimizes the energy function (21) where the minimum is calculated over Euclidian Similarity Transformations which is a combination of translational and rotational parameters.Minimizing over groups of transformations to achieve rigid object interactions was proposed by chan and zhu [33].We propose to use a nonrigid shape influence term in this paper.Now let us recollect , which indicates the active contour and be the active contour for the shape from the first frame.Let ( ) be a level set distance function associated with contour , and ( ) is a level set function with contour from first image frame ( ) .Let be a pixel in the image space fixed, ( ) ( ) is actually a function of contour .The initial contour aligns itself with the object contour in the first frame that is the initial contour for the next frame in the video sequence coming from the contour in the previous frame.Hence the shape interaction term proposed in this paper has the from (22) Thus by applying shape energy to the level set we can effectively track hands and head in sign videos and we could differentiate between object contour modifications due to motion and shape changes.

F. Level Set Energy Function For segmentation and Tracking
Combining Colour, Texture, boundary and Shape Modules By integrating the energy functions from colour, texture, boundary and shape modules we formulate the following combined energy functional of the active contour as (23) where are weighting parameters that provide stability to contribution from different energy terms.All terms are positive real numbers.The minimization of the energy function is done with the help of Euler-Lagrange equations and realized using level set functions.The resultant level set formulation is (24) where (25) (26) (27) The numerical implementation of eq.24 is computed using narrowband algorithm [43].The algorithm approximates all the derivatives in eq.24 using finite differences.The level set function is reinitialized when the zero level set clutches the boundary of the object in the image frame.

IV. FEATURE VECTOR
After Successful segmentation and tracking the feature extraction, a feature vector is being created, that is stored in the database at training stage as templates or can be used as inputs to pattern classifiers such as Hidden Markov Models (HMM), Neural Networks, Fuzzy Inference Systems or Decision Trees, which have limited memory.Hence, it becomes important to design a feature matrix that optimizes the performance of the classifier.
The feature matrix Mat f derived from a video sequence is a fusion of hand and head segments representing shapes in  each frame along with their location in the frame.The shape information for an n th frame is presented as n  , which is a binary matrix equal to frame size.Tracking active contour results in location of hands and head contours in each frame and giving their location Temporal dimension is mostly neglected during feature extraction in video processing.In other words, video is considered as an extension of images.In addition to 2D nature of image, the temporal dimension is managed using a technique called temporal pooling [44].Here, the extracted features are temporally pooled into one feature value for the whole video sequence as illustrated in fig. 2. Temporal pooling is largely done by averaging.Other simple statistical techniques are employed such as standard deviation, median or minimum/maximum.Temporal pooling is engaged to reduce the dimensionality of the feature vector for a particular video sequence.V. PATTERN RECOGNITION-NEURAL NETWORK For handling large data such as in our system it is customary to use solutions that are fast enough without losing its accuracy.One of the few systems that can handle our large data matrices is Feed-Forward artificial neural network (ANN).Artificial neural network (ANN) is extensively used in detection of cancer [45], classifications [46], face recognition [47] and finger print recognition [48] ,to name a few.The feature matrix NMat  The back propagation algorithm follows the following steps in determining the output as discussed in [49].The output of j th unit in the first hidden layer is calculated from the values in the hidden layer as (28) Where M is number of neurons in the hidden layer, ij w is the weight vector connecting i th unit in between hidden layer and input layer and j th unit in between hidden layer and output layer. is the threshold value of j th unit in hidden layer.() f  is the activation function, which we choose as sigmoid function, defined as (29) The output layer output is calculated as

Averaging on each row of the
where M is the number of neurons in the j th layer for k th neuron.
During back propagation process the error gradient from the output layer is calculated from the relation (31) where () k e itr is the error in the output layer given as ( ) ( ) ( ) e itr y itr y itr  (32) where () Tk y itr is the desired or targeted output of the system.Adjustment of the weights of the neurons is done iteratively using the equation (33) Where µ is adaptive learning rate of the neural network during training stage.After (k-1) th iteration and the k th feedforward propagation, let the output vector be where the value of α is.The learning rate is controlled by error obtained during iterations.
The error gradient for the hidden layer is calculated using the relation The above discussed procedure is run iteratively to reach the expected output.

VI. EXPERIMENTS AND RESULTS
The first frame (1) I is segmented by calculating the feature vector consisting of single colour plane (R or G or B) and texture information along with boundary edge map.Then the proposed method is applied to the remaining frames of the video sequence.The segmentation and tracking result of the previous frames is used as a mask or initial contour for the current frame along with location of the mask and so on.The energy minimization function in eq.23 is employed to process the level sets and to produce an optimal contour for segmentation and tracking for each frame.In all video sequences initial contour is cropped manually that can be placed near to the object of interest.
In the first experiment we started with a video sequence which is shot in a lab using web cam with dark background and with an additional constraint that signer should also wear a dark shirt.This video sequence is part of the database we have created for sign language recognition project.Our Sign language database consists of 351 signs with 8 different signers.Figure 4 (frame size ) shows our segmentation and tracking algorithm with values of The object and background clusters are made of three and two clusters.
The experiments are performed in R colour plane.As such we can do it any colour plane or with colour videos.The problem with full colour sequences pertaining to sign language is that the sign language videos contain large sequence of frames with lot of information to be extracted.Fig. 4(a) shows four frames from a video sequence of the signer performing a sign related to English alphabet 'X'.This simple sign contains 181 frames.Fig. 4(b) shows the results obtained from our proposed method.The inclusion of prior shape term in the level sets segments the right finger shape in spite of being blocked by the finger from left hand.This segmentation and tracking result will help in good sign recognition.Fig. 4(b) and 4(c) shows the effectiveness of the proposed algorithm against the Chan-Vese (CV) model in [50].
Table.I gives the results of testing the proposed network.For simple backgrounds the neural network is classifying signs with a recognition rate of 97.23%.
We observe occlusions of hands and head very frequently in sign language videos.Most sign language recognition systems insist that the signer should face the camera directly to avoid occlusions of hands largely.This problem is solved using our proposed level set method.It is shown in fig. 7. We initialized contour for only right hand of the signer in the first frame.
With the left hand coming in the path of right hand as can be observed from the original sequence in figure 7(a), it's difficult to segment and track the shape of right hand.But the results in figure 7(b) show the segmentation and tracking of right hand only with occlusions from left hand.This breakthrough will help in designing sign language systems with utmost robustness.
In the next experiment, the supremacy of our proposed technique is shown, where the video sequence contains fast moving objects in contrast to hand and head movements.The video sequence is shot on an Indian road and in the natural environment.Fig. 8 shows the original sequence in column (a) along with the results of CV model [50] in column (b) and our method in column (c).
The recognition rate is defined as the ratio of the number of correctly classified signs to the total number of signs: (36)   Observation of third row exposes the disadvantage associated with CV method without prior shape term in energy level set for tracking.In this video (see Fig. 8) the background object suddenly appears in the frame to which the method in [50] technique provides much resistance and the final segmentation and tracking result include the suddenly appeared object.But providing prior shape information along with object colour, texture and boundary edge map proves the strength of our method.Also, we get the unwanted tracks in the form of leaves of trees in the background for left hand of the signer in row three for the method used in [50], which is not an issue with our method.
Using the sequences in fig.7 and fig.8, which represents a more real time scenario for testing our proposed system, we created a data base of around 110 signs.These videos are taken under different background clutters like in fig.7 and fig.8.We have chosen two samples of each sign for training the network and two signs of similar gesture for testing the network.The proposed network is shown in fig.8 containing 220 input neurons, 181 hidden neurons and 110 output neurons.The video sequences, used here, contain continuous gestures with cluttered backgrounds, such as offices, on road, restaurants, and videos from [33].www.ijacsa.thesai.orgThe neural network is trained with a feature matrix extracted from sign videos with complex backgrounds.The training graph for the network in fig. 9 is shown in fig.10.We observe that even though the input has two sets of signs for each gesture, it took more number of iterations to train due to occlusions and cluttered background.
Table.II gives the results of testing the proposed network.For complex and variable backgrounds, the neural network is classifying signs with a recognition rate, of which is slightly less than the recognition rate for simple backgrounds i.e. 95.68%.
In the third and final experiments, the proposed neural network is trained using a combination of two signs from simple backgrounds and two signs from complex backgrounds for each sign.A total of 200 signs are trained four from each gesture.The neural network employed for this task consists of 200 input neurons, 652 hidden neurons and 50 output neurons to classify 50 signs and convert them in to audio and text.The network architecture for this job is shown in fig.10. www.ijacsa.thesai.orgThe neural network is trained with a feature matrix, combining the feature matrix from simple backgrounds and feature matrix from complex backgrounds.The training graph produced for this feature matrix is shown in fig.12.
Table.III gives the results of testing the proposed network.For complex backgrounds and simple backgrounds combined the resultant recognition rate, and has slightly decreased due to shape occlusions for longer periods of time.This can be avoided by using regularly updating the shape module after losing shape information during longer occlusion period.For all our experiments, if the occlusion period is more than 10 frames, we have to reinitialize the contour.The recognition rate for our third experiment is 85%.This paper carries us slightly closer to building sign language recognition system that performs well under natural backgrounds.The proposed method combines effectively the colour, texture, boundary and prior shape information to produce an effective video segmentation and tracking of sign language videos under various harsh environments such as cluttered backgrounds, poor lighting, fast moving objects and occlusions.The colour and texture information is extracted statistically by creating a feature vector and classifying each pixel in the frame to object and background pixel.Boundary information is provided by divergence operator along with the curvature of the object under consideration.Including shape information from the previous frame it is done a whole lot of difference to the level set minimization to segment correctly and track effectively the occulted hand from other hand and also head some times.We have effectively demonstrated by experimentation of the proposed method by applying it to

Figure 1 .
Figure 1.Sign Language Recognition System Architecture Inserting in eq.4 we get a level set equation of the form (5) Eq.5 is a type of Hamilton-Jacobi equation.The speed term is dependent of object curvature [ | | ], which can be formulated as

Figure 2 .
Figure 2. Feature Vector Design using Temporal Pooling For an n th frame in the video sequence, the first row of feature vector vect f consists of pixel values with the shape of active contour i.e. the pixels representing segmented head and hand shapes.The second and third row consists of ( ) ( ) ( , ) nn xy location information of those pixels that are segmented.The feature vector for each frame is a three dimensional vector vect frepresenting shape and location information about each segmented hand and head contours.For an entire video in a sign language recognition system, each number of pixels in a frame leave us with a new reduced feature vector Nvect f , which uniquely represents a particular video sequence.Finally each row in the new feature matrix NMat f consists of feature vector representing each sign video sequence, stored in the database as templates.

f
is given as input for training the feedforward neural network shown in fig.3.Generally, a Feed-Forward neural network is a combination of three layers of neurons: input layer, hidden layer and output layer.The neurons in these layers are activated by using a be the input to the neural network derived from feature matrix NMat f .Where M and N denote the number of columns and rows of NMat f .itr is the number of iterations called Epochs in neural network terminology.The neural network outputs are denoted by ,

Figure 4 .
Figure 4. Experiment one showing our proposed segmentation algorithm on sign language videos under laboratory conditions.Frames 10, 24, 59,80 are shown.Row (a) shows four original frames.Row (b) shows the results from proposed algorithm and Row (c) Shows results of CV algorithm in [50].

Figure 5 .
Figure 5. Neural Netwok used for testing sign videos with simple backgrounds.The neural network shown in fig.5is trained for simple backgrounds using four sets of sign videos for each gesture consisting of different signers.The epochs versus error graph for training the network is shown in fig.6.

Figure 6 .
Figure 6.Neural Netwok Training graph for simple backgrounds with four samples per gesture.

Figure 7 .
Figure 7. Showing the influence of prior shape knowledge.Here only the occluded right hand is segmented and tracked.Column (a) shows original image sequence and column (b) the segmentation and tracking result.

Figure 8 .
Figure 8. Frames of a sign video sequence on a Indian road and under natural environment.Column (a) is original sequence of frames 39, 54,79 and 99.Column (b) method used in [50] and Column (c) our proposed tracking method.

Figure 9 .
Figure 9. Neural Netwok Architecture for cluttered backgrounds with two samples per gesture.

Figure 10 .Figure 11 .
Figure 10.Neural Netwok Training graph for cluttered backgrounds with two samples per gesture.TABLE II.RESULTS OF TESTING, USING GESTURES WITH CLUTTERED BACKGROUNDS Number of input neurons : 220 Number of hidden neurons: 652 Number of output neurons: 110 Activation Function: sigmoid Learning rate: 0.05 Momentum Factor: 0.9 Error Tolerance: 0.0001 Number of training samples used: 220 Number of testing samples: 220

Figure 12 .
Figure 12.Neural Netwok Training graph for cluttered as well as simple backgrounds with each two samples per gesture.

TABLE I
The recognition rate is defined as the ratio of the number of correctly classified signs to the total number of signs:

Table . I
give the results of testing the proposed network.For simple backgrounds the neural network is classifying signs with a recognition rate of 97.23%.

TABLE III .
RESULTS OF TESTING USING GESTURES WITH SIMPLE CLUTTERED BACKGROUNDS Number of input neurons : 200 Number of hidden neurons: 652 Number of output neurons: 50 Activation Function: sigmoid Learning rate: 0.05 Momentum Factor: 0.9 Error Tolerance: 0.0001 Number of training samples used: 200 Number of testing samples: 500