The Performance of Individual and Ensemble Classifiers for an Arabic Sign Language Recognition System

The objective of this paper is to compare different classifiers’ recognition accuracy for the 28 Arabic alphabet letters gestured by participants as Sign Language and captured by two depth sensors. The accuracy results of three individual classifiers: (1) the support vector machine (SVM), (2) random forest (RF), and (3) nearest neighbour (kNN), using the original gestured dataset were compared with the accuracy results using an ensemble of the results of each classifier, as recommended by the literature. SVM produced higher overall accuracy when running as an individual classifier regardless of the number of observations for each letter. However, for letters with fewer than 65 observations each, which created a far smaller dataset, RF had higher accuracy than SVM did when using the ensemble approach. Although RF produced higher accuracy results for classes with limited class observation data, the difference between the accuracy results of RF in phase 2 and SVM in phase 1 was negligible. The researchers conclude that such a difference does not warrant using the ensemble approach for this experiment, which adds more processing complexity without a significant increase in accuracy. Keywords—Ensemble; Stacking; Support vector machine; SVM; Random forest; RF; Nearest neighbour; kNN; ArSL recognition system; Depth sensors


I. INTRODUCTION
Researchers in the Arab world, as well as researchers worldwide, are always investigating the use of assistive communication tools that could help the hearing-impaired in their daily lives when using their local languages and dialects.Although research has been done on using sign language recognition systems, limited research has addressed gesture recognition of Arabic Sign Language (ArSL).Also, few attempts have been made to develop a recognition system that can use a machine learning approach to interpreting ArSL letters [1].
Machine learning is -an evolving branch of computational algorithms that are designed to emulate human intelligence by learning from the surrounding environment‖ [2] [3].Machine learning is more than just calculating averages or performing data manipulation; it involves creating predictions about observations based on previous information [3].Using machine learning in gesture recognition involves four steps: 1) choosing appropriate sensors for collecting the gestured letters; 2) analysing and extracting features from the data, which are values related to describing the gestured letters; 3) classifying the data by recognizing and interpreting the gestures using one or multiple algorithms; and 4) displaying the recognised gesture's name by text or audio [4].Also, machine learning can use either supervised or unsupervised learning to transfer sign language gestures into text format [5].The supervised learning term refers to the fact that the algorithm was fed by a dataset in which the correct answers were given; then, the dataset was divided into two subsets: a -training dataset,‖ which is used to build predictive models, and a -testing dataset,‖ which is used to assess the performance of the model in the training step [6].On the other hand, in unsupervised learning, the machine is not provided with knowledge about the model.The implemented algorithms classify the data to any instantaneous incoming hand or finger features [5].
In classifying segments, the observed gestured letters are placed into different classes based on the same or related values [7].The collected data are divided into two sets: training and a testing set [7].Therefore, classification is the process of assigning a new gestured letter to a specific class on the basis of training set values.
Many classifier algorithms exist, such as the neural network, support vector machine (SVM), nearest neighbour (kNN), and random forest (RF).Each has a different method for predicting or choosing the set to which a particular observation belongs [5].
Classifying data in machine learning can use either raw data with one algorithm or a combination of the results (predictions) of multiple algorithms, called -an ensemble,‖ which is fed into an algorithm.Different ensemble models are available, with the most popular being: majority voting, bagging, boosting, and stacking [5].
Majority voting, considered the simplest, is a decision rule that chooses alternatives that have popular or majority votes [8] [9].Bagging is a method of decreasing the variance of a prediction, boosting is a method of decreasing the bias of a predictive model and improving the predictive force, and stacking is similar to boosting by applying several models on the original data [9].However, stacking takes the final www.ijacsa.thesai.orgprediction using functions such as the sum, the average, or the weight of the predictions that other algorithms have generated [10].
Different types of stacking exist: some types use the original data together with classifiers as input to an ensemble model, whereas some do not.In addition, some use hard labels from classifiers, whereas others use probabilities [10].Although using the ensemble approach requires mathematical complexity, it may increase the accuracy of the recognition.To classify gestures, one can use an individual classifier or an ensemble of the output of multiple classifiers.Results within the literature of classification using multiple learning algorithms or an ensemble model usually had higher accuracy rates, yielding better predictive performance than those obtained from the other formed learning algorithms [5].Despite the complexity, the possible reasons for using an ensemble approach are: the data volume is too large or small, or not enough data are available to divide and conquer the data or for data fusion [11].Therefore, in some cases, if the data to be analysed are too large, the use of one classifier may not effectively process the data.Similarly, ensemble systems can be used to address the exact opposite problem of not having enough data [12].
Analogically speaking, creating an additional step by feeding a classifier an ensemble of the data is like seeking a second and third opinion when it comes to a medical consultation: it increases reliability and reduces the risk of a wrong diagnosis [12].
The research methodology of Al-Masre and Al-Nuaim for gesture recognition used only one classifier (SVM) as a supervised machine learning hand-gesturing model [13] to classify the 28 letters (considered classes) of the Arabic alphabet -Figure 1.‖In addition, to overcome the time complexity of interpreting the data for their model, the researchers used the principle component analysis (PCA) algorithm to simplify the large dataset by reducing features.Recognition results were at 86% for the ArSL letters tested in their experiment [13].
Although this research also used SVM to classify the 28 ArSL letters as in Al-Masre and Al-Nuaim [13], and to overcome the limitation of using the PCA algorithm, the proposed model focused on including all of the features of the collected data while adding a classification step, as recommended by the literature, to produce higher recognition accuracy.The extra step used the same classifiers that used the original dataset to classify the combined results (ensemble).Therefore, it is the objective of this research to compare the recognition accuracy of three different popular individual classifiers using the original gestured dataset with the accuracy results of the same three classifiers using an ensemble of the results of the same classifiers.
In an attempt to investigate if adding a classification step produces higher accuracy, this research combined the results from three individual classifiers that used raw gestured data.The extra step would classify the combined (ensemble) data using the same three classifiers that used the original data.
The rest of the paper is organised as: Section 2 and 3 present the literature surveying the overview of relevant work and the three classification algorithms used.Section 4 presents the research design and methodology used to complete the experiment.Finally, Section 5 discusses the results and presents the conclusion.Rob Schapire, and others [15].Schapire (1999) came up with an algorithm to apply such a combination called boosting, which is used with machine learning [15].
Ensemble learning has attracted considerable attention due to its good generalisation performance.The main issues in constructing a powerful ensemble include training a set of diverse and accurate base classifiers outputs and effectively combining them [12].
Ensemble majority vote, computed as the difference between the vote numbers that the correct class received and those of another class that received the most votes, is widely used to explain the success of ensemble learning.This definition of the ensemble margin does not consider the classification confidence of base classifiers [12].
Other ensemble algorithms appeared within the literature and were used in the machine learning field, such as boosting, AdaBoost, bagging, a mixture of experts, and stacked generalisation [16].
Using the stacking method, one can train a learning algorithm to combine the predictions of other learning algorithms.Firstly, all of the used algorithms are trained using the original data.Then, one makes a final prediction using all the predictions of the other algorithms (re-sampling) as inputs.The re-sampling method can be one of the following: sum, maximum, minimum, and weighted majority voting of the predictions that the other algorithms have generated as extra inputs [17].
The basis of ensemble methodology is simply creating a predictive model by integrating multiple models.It can be used to improve prediction performance; for example, researchers www.ijacsa.thesai.orgfrom various disciplines, such as statistics, computer vision, and artificial intelligence, can use it [12].
Li, Hu, Wu, and Yu (2014) explored the influence of the classification confidence of the base classifiers in ensemble learning and had some interesting conclusions.First, they extended the definition of an ensemble margin based on the classification confidence of the base classifiers.Then, an optimisation objective was designed to compute the weights of the base classifiers by minimizing the margin-induced classification loss.They attempted several strategies to use the classification confidences and the weights.They observed that weighted voting based on classification confidence is better than simple voting if all of the base classifiers are used [17].
Farooq and Sazonov (2016) studied the ensemble performance of three classifiers-logistic regression, linear discriminant analysis, and decision trees-using three different ensemble approach: (1) boosting, (2) stacking, and (3) bagging.According to their results, the ensemble performance was enhanced by 4% compared to the individual algorithms [18].
In addition, Woźniak, Graña, and Corchado (2014) presented the idea of creating a multiple classifier system (MCS).They stated that no single classifier modelling approach that is optimal for all pattern-recognition tasks exists.Thus, MCS exploits the strengths of the different classifier models to create a high-quality compound recognition system, thus overcoming the performance of separate classifiers [19].
Ensembling is also known under various other names, such as multiple classifier systems, a mixture of experts, or a committee of classifiers [11].Ensemble systems have shown to have higher performance in many applications compared to a single classifier's performance [11].
Most of the ensemble methods use a special mathematical model.Moreover, in applying the stacking method, researchers can use different types or scenarios-for example, combining the results of classifiers as a class label name, combining them as class prediction values, or combining the original dataset with class prediction values [20].

A. Support Vector Machine (SVM)
The SVM algorithm is used to classify data by drawing a clear line between observation data, which are actually points on a plane.The margin space around the line should be as wide as possible to avoid the misclassified values of a testing set [21].In addition, the SVMs can efficiently perform non-linear classification using what is called the kernel function, implicitly mapping its inputs into high-dimensional feature spaces [22].
Predicting the values and setting the kernel function parameters with correct values are the main objective of the SVM learning algorithm.Many statistical packages establish those parameters to give the best prediction, such as the R studio statistical package [23].
Using SVM requires choosing the parameter C (cost function) or a penalty term.It is used because SVM relies on predictions to make a decision about the best boundary that could cause an error.If the value of C is very large, then the decision boundary will be close to the data points nearest the support vectors.That means the misclassification probability increases as the value of C decreases [23].

B. k Nearest Neighbor (kNN)
The Nearest Neighbour (NN) algorithm for learning has worked on numeric feature values.NN treats values as distance metrics and uses them as standard definitions between instances [24].A k-Nearest Neighbours algorithm (kNN) is a non-parametric method used for classification where the input consists of the k closest training examples in the feature space [25].As a classifier, kNN allocates a pattern to the class of the nearest pattern value [26].It starts with every observation in the training set as a prototype and then successively merges any two nearest patterns of the same class as long as the recognition rate is not reduced [27].

C. Random Forest (RF)
The term -random forest‖ refers to a collection of many decision trees (forest) where, when building at each node, there is some randomness in selecting the attribute to split.Thus, the RF breaks down a dataset into smaller and smaller subsets while an associated decision tree is incrementally developed at the same time [28] To build a decision tree, two types of entropy need to be calculated using frequency tables.Entropy refers to the probability distribution of the information contained in each observation (gain).Thus, the main RF algorithm steps in Biau [29] show that after calculating the entropy of the observations, the dataset is then split into the different attributes (trees).In choosing the attribute with the largest information gain as the decision node (root) and as the left node, which has an entropy of 0, the remaining nodes require further splitting.Thus, the algorithm is run recursively on the non-leaf branches until all data are classified [29].
Various methods exist for evaluating the quality of algorithm prediction to guarantee the selection of the bestperforming classification algorithm.Among these are [30]:  Confusion matrix (CM): shows the number of accurate and inaccurate predictions that the classification model makes compared to the actual outcomes (actual value) in the dataset.
 Receiver Operating Characteristic (ROC): also used for evaluation.ROC is a chart that shows a false positive rate (1-specificity) on the X-axis against a true positive rate (sensitivity) on the Y-axis.
 The area under the curve (AUC): determined by calculating the area under ROC curves; the quality of the classification model is measured, where the AUC should be between (0.5 and 1).When the area is close to one, it means that the classifier performance is acceptable; otherwise, if the area is less than 0.5, then the classifier performance is unacceptable because the classifier cannot distinguish between classes [31].www.ijacsa.thesai.org

A. Hardware and Software
Applying machine learning to classification becomes easier with the development of depth cameras and sensors to provide more accuracy in identifying the individual body parts of a naturally looking human [32].Sign language relies on different body parts, which necessitates the use of multiple sensors.In this research, Kinect™ and Leap Motion Controller (LMC) sensors were used to create a model for recognizing ArSL gestures.Microsoft Kinect Version 2.0-which Microsoft released-has a Red Green Blue (RGB) depth camera and a human skeletal tracking algorithm that offers information about human body joints [33].Meanwhile, LMC Version 2.0 provides a skeletal-tracking algorithm that offers information about hands and fingers as well as overall hand-tracking data, even if the hands cross over each other.-Figure 2.‖ presents the 11 joints that needed to be retrieved via Kinect and the 12 points that needed to be retrieved via LMC in this research.The Microsoft Kinect and LMC open-source software development kit (SDK) library were used to develop the proposed prototype with options for reading and managing visual depth information [34].Visual Studio 2013 with C# was also used to calibrate the two devices, and the SQL Server Management Studio 2010 was used to create a relational database.

B. Data Collection
A prototype system was developed to collect data using the two sensors.The main window interface in the prototype provides real-time joint detection by representing the user's joint points as well as a histogram to give visual sign indications.
As participants gesture each letter they can individually click a button to save the body pose for each gesture.
-Figure 3‖ provides an example of a three-dimensional (3D) human skeleton where a line between each corresponding point was drawn (vector).To standardise the distance or depth metrics between the two devices, the length of each vector was converted from meters-which Kinect uses-to millimetres, which LMC uses, to standardise the length units in millimetres.

C. Feature Extraction
A feature represents a piece of information in any multimedia type, such as image, text, and video.It could be the direction of a certain object, such as the hand bones' direction [39].For this research, the depth values that the two sensors captured were used to create two feature types, as seen in -Figure 4.‖ Type one was denoted as -H‖ in the database; it included three angles for each hand bone, which were angles between the bone and the three axes of the coordinate system (X,Y,Z).Type two was denoted as -A‖ in the database; it included one angle between each of the two bones.These angles are the main factor for a comparison between the two gestures.Then, the prototype was considered ready to use in the experimental environment, as seen in -Figure 5.‖ Twenty participants were asked to gesture the 28 Arabic alphabet letters.Each participant stood in front of the devices, which were connected to a personal computer, and he or she made around 28 to 40 gestures and mimicked sign gestures spanning seven days.Around 200 right gestures were collected daily for different letters from different participants.www.ijacsa.thesai.orgTherefore, the number of gestured letters (observations) also varied between participants; for example, some participants gave five or more gestures for a specific letter.Table 1 shows the number of observations for each class (letter) in descending order.The collected dataset had 235 features, presented in -Figure 6‖ as columns: the values of H0 to H180 were from type one, and the values of A1 to A54 were from type two.The dataset was reduced by selecting the body parts on which each gesture relied while removing all values that would not affect the interpretation of the ArSL letters.For example, the feature -A1‖ was an angle between the shoulder and right hand and would not affect the recognition of any ArSL letter depending on the hand bones only (at this point, the features became 102 values).In addition, the features with zero variance were removed; for example, when the variance of all values in feature -A9‖ was calculated, the result was zero, so that did not affect the recognition either (the features became 90 values).
The dataset observations are presented in -Figure 6‖ as rows, which include 1456 observations.Certain observations were removed as well, such as: 1) the rows that had the same values and 2) the rows that had multiple missing values (null values, where the device did not capture observation values well).The dataset was cleaned out for the 90 features' values, and the number of observations was changed to 1398.
-Figure 6‖ shows the dataset structure, where each observation was considered a letter from a specific participant and contained many features. RF, which many researchers recommend for its high accuracy [36]  kNN, which is commonly used for its ease of interpretation and low processing time [25] The results of the three classifiers were combined, and results were reused as a new dataset to train the same classifiers.The result of this combination is called an -ensemble schema dataset.‖Therefore, the training datasets were classified as an original dataset and an ensemble schema dataset.
The stacking schema was used for this research with only the classifiers' predictions (class labels were the letter names) as input for the ensemble model, without the original data, as seen in -Figure 7.‖ 1) The database was divided into two sets, training and testing set.
2) The training set was fed into classifiers to train them to recognise the class labels (letters).www.ijacsa.thesai.org 3) The testing set was used to evaluate the classifiers' prediction ability (if it could recognise letters in the testing set accurately).
4) The CM showed the number of accurate and inaccurate predictions that the classifier made compared to the actual outcomes (actual value) in the testing set.Then, all classifiers' performance was evaluated by calculating the area under the ROC curves.
The implementation details of each phase are as follows in -Figure 8‖ and -Figure 9‖: Phase 1: The raw database of 1456 observations (considered the letters) became 1398 after removing the rows that had the same values.The dataset was separated into a splitting ratio of a 75% to 25% training set with 1047 observations and a testing set with 351 observations.This training set was divided once more with the same splitting ratio into observations, such that:  by using 730 observations, the model was trained to learn individually along with the right letter; and  by using 317 observations, the model had to predict (classify) letters using the SVM, kNN, and RF algorithms.
Then, the prediction results from all three algorithms (317 predictions for each) were combined to become the training set of the three classifiers in phase 2. In addition, by using 351 observations, the model had to predict (classify) letters using the SVM, kNN, and RF algorithms as well.Then, the prediction results were combined to become the testing set of the three classifiers in phase 2.
Phase 2: The prediction data produced from phase 1 were used for the training step and then for the testing set of the 351 observations.

E. Classification Results
The results of the classification in phase 1 for each classifier in detail are (Table 2): 1) kNN's parameter k was assigned a value equal to the square root of the available total number of observations.The value of k could be adjusted from 1 to 10.The value of k=1 was chosen for less computation and an accuracy of 85.484%.
2) SVM's two parameters-cost and gamma-were set to 2 and 0.01, respectively, to get the highest accuracy.In addition, kernel = -radial‖ because it uses curves instead of straight lines to separate the different labels; accuracy was 88.803%.
3) RF's two parameters: n(tree) (total number of trees to build) was set to 2000 and the node size (maximum children each tree can have) was set to five, which achieved an accuracy of 86.809%.
The results of the classification in phase 2 for each classifier in detail are (Table 2): 1) kNN had an accuracy of 87.151%, where the parameter k=1.
2) SVM had an accuracy of 86.880%, where the kernel = -linear‖; and SVM's two parameters, cost and gamma, were set to 1 and 0.01, respectively.
3) RF had an accuracy of 88.048%, with RF's two parameters of n (tree) and node size, set to 200 and 1 respectively.The classifiers' performance in the two phases was evaluated using AUC for individual letter accuracy; these results are shown in -Figure 10‖ and -Figure 11.‖ SVM achieved optimum results for this experiment when trained on the original dataset and not on the ensemble schema dataset, and this could be attributed to the variation in the number of observations for each class (Table 1).However, the devices had a low-speed response compared to human movement and a low precision of capturing the frames of a specific gestured letter.This was especially true for complex letters such as the following: ‫ذ‬ (Thal), ‫ط‬ (Tah), and ‫ظ‬ (Thah), where fingers overlapped, and the participant had to repeat the gesture or drop it altogether.This ultimately resulted in the variance between the numbers of observations for each class.
The variations in observation numbers were examined to assess if they affected the results.The discrepancy between the overall results of the algorithms used was investigated when it was trained on the original dataset and on the ensemble schema dataset.The researchers proposed that the SVM could have achieved better results when applied to the original dataset due to the variance between the numbers of observations for each class.This was sometimes less than 10 in the training set, such as ‫ذ‬ (Thal), ‫ط‬ (Tah), and ‫ظ‬ (Thah), which had fewer than 20 observations.Running the three classifiers on these observations could have affected the overall results.
The three classifiers, kNN, SVM, and RF, were run on classes that had more than 65 observations each.The selection of 65 as a number is statically justified because the observations for each class were divided into training and testing sets, with the former requiring no fewer than 50 observations so that the model-which was based on the training set-would be satisfactory.In this particular case, it covered the highest observations under eight classes (letters), which are as follows: ‫ت‬ (Ta) with 79 observations, ‫ك‬ (Kaf) with 74 observations, ‫ب‬ (Ba) with 71 observations, ‫ج‬ (Jiem) with 70 observations, ‫س‬ (Sien) with 70 observations, ‫ق‬ (Qaf) with 68 observations, ‫ل‬ (Lam) with 68 observations, and‫ر‬ (Ra) with 68 observations.Moreover, the three classifiers (kNN, SVM, RF) were also re-run on the remaining 20 classes with fewer than 65 observations.Table 3 and Table 4 demonstrate the discrepancy noted earlier, which shows how the classifiers have changed in their overall accuracy results.
The eight ArSL letters that had more than 65 observations for each letter were analysed (Table 3).It was concluded that all of the classifiers' performance was enhanced when using a high number of observations.The accuracy results in phase 1 for kNN, SVM, and RF were 93.566%, 96.119%, and 93.846%, respectively.The results in phase 2 for kNN, SVM, and RF were 95.524%, 94.336%, and 95.699%, respectively.The remaining 20 classes of the ArSL Arabic alphabet, which had fewer than 65 observations for each letter, were also analysed (Table 4).The accuracy results in phase 1 for kNN, SVM, and RF were 85.216%, 88.221%, and 86.178%, respectively, and in phase 2, the results were 87.163%, 87.500%, and 88.413%, respectively.Recognition accuracy results for each phase is as follows (Table 5): 1) Among individual classifiers, overall, SVM had higher accuracy in phase 1.
2) For the ensemble approach, overall, RF had higher accuracy in phase 2.
3) For all classes and classes with more than 65 observations, SVM had a higher accuracy in phase 1 than RF did in phase 2.
4) RF achieved higher accuracy in phase 2 for classes with fewer than 65 letters compared to SVM in phase 1, but the difference was negligible.This research used two depth sensors to capture all upper human skeleton joints, upon which most sign-language gestures rely.The supervised machine learning algorithms of kNN, SVM, and RF classified the depth values of gestures representing all ArSL letters.
It is essential to enhance the recognition accuracy of ArSL when using a supervised machine-learning approach, as it is important to get more accurate recognition results while avoiding complexity schema (the ensemble needs results from the three classifiers to classify the dataset), which requires more computation time.
The classification was performed using R packages, where three classifiers, SVM, kNN, and RF, were used to implement the general classification implementation process in two phases to recognise and interpret incoming gestures.In phase 1, the three classifiers of kNN, SVM, and RF were trained on the original dataset, whereas, in phase 2, the three classifiers were trained on an ensemble dataset, where the results of these three classifiers were combined into an ensemble schema dataset to classify the classes again.In addition, the various numbers of observations for each letter were analysed to check if various numbers affected the classifiers' accuracy performance.
As shown in Table 5, the recognition accuracy results were different among the three classifiers and among the two phases and for the different number of observations (classes with all observations, classes with fewer than 65 observations, and classes with more than 65 observations).
The researchers concluded that the implementation of SVM produced a higher overall accuracy when running as an individual classifier, no matter the number of observations.However, for small datasets, RF's ensemble approach could be used, as it had higher accuracy than SVM did in phase 1.
Although RF produced higher accuracy results for classes with limited class observation data, the difference between the accuracy results of RF in phase 2 and SVM in phase 1 was negligible.Such a difference does not warrant using an ensemble approach, which adds more processing complexity, as required with the ensemble approach.www.ijacsa.thesai.orgWith such a result, SVM used as an individual classifier would be the more efficient choice because it produces higher recognition accuracy with less complexity.Future work on this subject could address how this prototype can be used to collect and classify dynamic gestures (multiple frames) that represent the sign of one word or phrase.

Fig. 1 .
Fig. 1. the 28 Arabic Sign Language Alphabet II.LITERATURE REVIEW Many researchers have investigated the combination of voting schema since 1998, such as Kearns and Valiant [14],Rob Schapire, and others[15].Schapire (1999) came up with an algorithm to apply such a combination called boosting, which is used with machine learning[15].

Fig. 2 .
Fig. 2. Depth sensors' joint points detect based on Cartesian coordinate system

Fig. 3 .
Fig. 3. Window in the prototype Windows Media 3D from the Microsoft Development Network (MSDN) was used to visualise the captured data in the 3D space of human body joints by drawing one skeleton from the details retrieved from the two devices.

Fig. 4 .
Fig. 4. Example of three angles for one joint (three angles) and one angle between two bones

Fig. 6 .
Fig. 6.Original dataset structureD.Classification ImplementationA dataset of 1456 gestured letters (observations) of the ArSL was collected.This original dataset was passed through three individual classifiers: SVM, which gave the highest accuracy results of ArSL letter classification in the experiments[13]

Fig. 7 .
Fig. 7. Diagram of ensemble using stacking concept In stacking's simplest form, the results from three different classifiers generated a new dataset named the -ensemble schema dataset.‖Classification passed two phases implementing the following steps:

TABLE I .
NUMBER OF OBSERVATIONS IN EACH CLASS