Indonesia Sign Language Recognition using Convolutional Neural Network

In daily life, the deaf use sign language to communicate with others. However, the non-deaf experience difficulties in understanding this communication. To overcome this, sign recognition via human-machine interaction can be utilized. In Indonesia, the deaf use a specific language, referred to as Indonesia Sign Language (BISINDO). However, only a few studies have examined this language. Thus, this study proposes a deep learning approach, namely, a new convolutional neural network (CNN) to recognize BISINDO. There are 26 letters and 10 numbers to be recognized. A total of 39,455 data points were obtained from 10 respondents by considering the lighting and perspective of the person: specifically, bright and dim lightning, and from first and second-person perspectives. The architecture of the proposed network consisted of four convolutional layers, three pooling layers, and three fully connected layers. This model was tested against two common CNNs models, AlexNet and VGG-16. The results indicated that the proposed network is superior to a modified VGG-16, with a loss of 0.0201. The proposed network also had smaller number of parameters compared to a modified AlexNet, thereby reducing the computation time. Further, the model was tested using testing data with an accuracy of 98.3%, precision of 98.3%, recall of 98.4%, and F1-score of 99.3%. The proposed model could recognize BISINDO in both dim and bright lighting, as well as the signs from the first-and second-person perspectives. Keywords—Indonesia sign language (BISINDO); recognition; CNN; lighting


I. INTRODUCTION
Humans use language to communicate with others. However, a communication disorder may occur because of various factors that cause an impairment in understanding oral speech [1]. Such factors can arise from a hearing disorder or deafness. Thus, deaf people use sign language or hand gestures to communicate. However, most non-deaf people experience difficulties in understanding sign language. A computerized sign recognizer could be employed as an important tool to enable mutual understanding between deaf and non-deaf people.
Various studies have been proposed to recognize hand gestures or sign languages in different countries because each country has a different sign, such as the American sign language [2], Arabic sign language [3], Bengali sign language [4], Peruvian sign language [5], and Chinese sign language [6] using various methods.
Indonesia has two sign languages: Indonesia Sign Language System (SIBI) and Indonesia Sign Language (BISINDO). In 1994, SIBI became the language used in formal schools for students with impairments. However, the deaf prefer to use BISINDO instead of SIBI in their daily lives.
Certain studies have been performed to recognize the SIBI. Hand gestures recognition approaches can be divided into vision based and sensor-based [7]. In vision-based approaches, images are acquired through a video camera. Meanwhile, sensor-based recognition needs an instrument to capture the motion, position, or velocity of the hands. Studies in Indonesian sign languages implemented the vision-based approach. A. Anwar et al. used a leap motion controller to recognize Indonesian sign language using feature extraction captured from hand movement [8]. In [9], a Myo Armband tool was used, which has five sensors, namely accelerator, gyroscope, orientation, orientation Euler, and electromyography (EMG). Both vision and sensor-based approaches need the data acquisition and classification stages. Various classification methods have been proposed to recognize patterns carried by input data. The k-nearest neighbor classification method was used to recognize the SIBI [10]. In this study, the distance between the coordinates of each bone distal to the position of the palm was measured using Euclidean distance. Meanwhile, Khotimah et al. implemented weighted k-nearest neighbor classification for dynamic sign language recognition [11]. Rosalina et al. used artificial intelligence to recognize SIBI [12]. Other studies utilized Hidden Markov Model [13] and Naïve Bayes [14] methods. Meanwhile, [15] used the generalized learning vector quantization model to recognize BISINDO and [16] utilized Scale Invariant Features Transform (SIFT) algorithm to recognize Indonesian Sign Language numbers. Iqbal et al. implemented a mobile device using a Discrete Time Warping for recognizing SIBI [17].
Most studies above discussed SIBI; however, BISINDO is the most common sign language used by the deaf in Indonesia. Thus, this study aims to convert hand gestures to text in BISINDO to improve communications between deaf and nondeaf people. In addition, the methods used in other studies depended on feature extraction. To improve performance, this study proposes a method to recognize BISINDO using a convolutional neural network (CNN) which uses the convolution layer as the feature extraction layer [18]. In other studies, a CNN was used by [2] to recognize American Sign Language. They employed a CNN to extract the features from the sign images, and the classifier used was a multiclass support vector machine. Hayani et al. also utilized a CNN coupled with an Adam optimizer to recognize Arabic sign www.ijacsa.thesai.org language [3]. Hossen et al. used a deep convolutional neural network to recognize Bengali sign language [4].
However, not many previous works have addressed converting BISINDO sign language to text. Furthermore, there is a need to develop CNN models that have lower computation costs for converting sign language to text. This study addressed both needs by developing a new CNN architecture to perform the BISINDO hand gesture to text, and reduced computation costs by using fewer parameters than the common CNN architectures. The experimental research objective of this study was to compare the BISINDO recognition performance of this simplified CNN model to AlexNet and VGG-16 which are other architectures commonly used in CNNs. We tested the performance using BISINDO standard hand signs recorded by a webcam under bright and dim lightning, and from first and second-person perspectives.
This paper is organized as follows: Section II provides a brief summary of BISINDO, followed by a description of the CNN architecture in Section III. The methods used in this study are described in Section IV. The results and discussion are presented in Section V. Finally, the paper is concluded in Section VI.

II. INDONESIAN SIGN LANGUAGE
Sign language is a language that is expressed using body gestures and facial expressions as a symbol of the meaning of spoken language [19]. The sign languages of Indonesia can be categorized into two types: SIBI and BISINDO. SIBI was adopted from American Sign Language and is used as the formal sign language in schools for deaf students. However, the deaf prefer to use BISINDO instead of SIBI owing to its better applicability. The signs for the letters and numbers in the BISINDO language are shown in Fig. 1.  [20] and Numbers [21].

III. CONVOLUTIONAL NEURAL NETWORK (CNN)
A CNN is typically used to detect or recognize images. It has an architecture that consists of a feature extraction layer and a fully connected layer. The feature extraction layer comprises a convolution layer and pooling layer. The general architecture of the CNN is illustrated in Fig. 2. The convolution layer extracts the features of images. This results in a linear transformation from the input, which is suitable for the spatial information of the filter. The weights in this layer determine the kernel convolution. Thus, kernel convolution can be trained based on the CNN input. The pooling layer comprises a filter with a stride and a certain size that passes through the path in the feature map. It aims to reduce image size. There are two types of pooling layers: max pooling and average pooling. In this study, max pooling was utilized by determining the maximum value in the vector dimension. After passing the convolution and pooling layers, the output of this process is used as the input to the fully connected layer. However, before this process, the input must be converted into one dimensional data. Finally, the process is performed using Softmax. Softmax calculates the probabilities for all target classes to determine the classes based on the input [22].

IV. METHODS
This section provides detailed descriptions of several steps used in our methods. This study was performed using primary data obtained from people who had no prior knowledge of sign language. Here is an overview of the steps. A webcam was used to gather sets of hand sign data from ten people to use as training data. The data were obtained by considering two conditions: lighting and perspective of the person. Then, a new CNN model was designed and trained, which was named model C. For comparison, we trained modified versions of AlexNet and VGG-16. Then, the three models were tested and evaluated against the test data. www.ijacsa.thesai.org

A. Data
The data used in this study were obtained using the webcam Logitech C922 with a resolution of 1080p and 30 fps. Ten respondents were asked to perform hand gestures, which consisted of 26 letters and numbers from 1 to 10, adhering to the BISINDO standard. Data were acquired 30 cm from the camera, as shown in Fig. 3. A green screen was placed as a background to minimize noise. Data were obtained by considering two conditions: lighting and perspective of the person.

B. Architecture of CNN
The CNN architecture used in this study consisted of three architectures, namely, models A, B, and C. Model A was a modified version of AlexNet [23]. The original AlexNet has 24,884,005 parameters, whereas the modified one has 1,432,261. Model B was a modified architecture of the VGG-16 [24]. It was modified to 2,140,405 parameters from its original value of 33,748,837. AlexNet and VGG-16 were chosen because they are the most common architectures used in CNNs. The architectures of models A and B are shown in Fig. 4 and 5, respectively.
This study proposed a new architecture, namely model C. Model C is a simpler architecture that consists of convolutional layer 1, max pooling 1, convolutional layer 2, convolutional layer 3, max pooling 2, convolutional layer 4, max pooling 3, flattened layer, and 3 fully connected layers. The visualization of model C is shown in Fig. 6.

C. Evaluation
This study utilized accuracy, precision, recall, and F1 scores to evaluate the performance of the three models. These parameters were calculated as follows: (1) where True Positive (TP) is the number of positive data correctly predicted as positive, true negative (TN) is the number of negative data correctly predicted as negative, false positive (FP) is the number of negative data incorrectly predicted as positive, and false negative (FN) is the number of positive data incorrectly predicted as negative.   In addition, precision and recall are also utilized as evaluation parameters. These can be calculated as.
( 2) and (3) The balance between precision and recall is determined using the F1-Score, which is obtained as follows.

B. Data Preprocessing
Before using the data in the CNN, the image data were preprocessed. This stage was performed by resizing the image and scaling the features. The image was resized to the same size of 60 × 60 pixels. Thereafter, feature scaling was performed by dividing the values at each point in the image by 255 such that the data value interval in the image was 0-1. Fig. 8 shows the preprocessed results of the image data.

C. Data Split
The preprocessed data were then fed as input to the CNN. In total 39,455 data were obtained, which was further divided using the stratified shuffle split method into three parts: training data, validation data, and test data. The division of the data was: 60 % training data, 20 % validation data, and 20 % test data, as shown in Fig. 9.  419 | P a g e www.ijacsa.thesai.org

D. Training Process
The training process was conducted using the CNN algorithm. The training parameters for the three models are listed in Table I.   TABLE I. PARAMETERS OF TRAINING

Parameter Value
Image size 60 x 60 Optimizer Adam

Epoch 100
Learning Rate 0.001 The loss and accuracy of the training results using models A, B, and C are shown in Fig. 10, 11, and 12, respectively.
As shown in Fig. 10, model A exhibited training and validation losses of 0.011 and 0.096, respectively. Further, the training and validation accuracies were 0.997 and 0.984, respectively. As shown in the loss graph, the model tends to fluctuate, indicating instability. Nevertheless, the model can learn the patterns as shown by the loss values, which tend to zero in each epoch, and the accuracy is improved. In contrast, model B has a high loss value and low accuracy, as shown in Fig. 11. This implies that the model cannot learn the patterns given by hand gestures because the loss values are high. Fig. 12 shows that model C has training and validation losses of 0.020 and 0.079, respectively. In addition, the training and validation accuracies were 0.995 and 0.984, respectively. Thus, model C can learn the hand gestures given because the loss value goes to zero and the accuracy increases. A comparison of these models is shown in Table II.     Table II, Models A and C have low loss values and high accuracy compared to Model B. Overall, Model A has the lowest training loss value, and high training and validation accuracy. Model C has the lowest validation loss, and high training and validation accuracy. In addition, the total number of parameters used in Model C was 177,383 while Model A had 1,432,261 parameters. Therefore, the computation time in Model C was the smallest compared to the other models. In addition, although Model C still exhibited a fluctuation in validation loss and validation accuracy (Fig. 12), it is lesser than that of Model A (Fig. 10). Thus, Model C has more stable validation loss. Based on these results, Model C exhibited the best performance compared to the other models. Consequently, these models were used to test whether the model is optimal and can generalize the testing data.

E. Testing
Testing was performed after training to determine the ability of the model to predict the class of hand gestures. The test results are shown in Table III. As shown in Table III, model A has an accuracy of 0.986, F1 score of 0.996, precision of 0.987, and recall of 0.987. The results of testing using Model C are very similar to model A, with an accuracy of 0.983, F1 score of 0.993, precision of 0.983, and recall of 0.984. Since model B failed to learn, its test results were very low. Thus, Models A and C obtained the best results. However, Model C has fewer parameters, thereby requiring less time to predict the data compared to Model A. The average prediction time per data for Model C was half the time for Model A: 0.0001 s for Model C and 0.0002 s for Model A. Therefore, Model C is twice as efficient as Model A while achieving near-equivalent performance levels.

1) Test results by lighting:
This study used two lighting conditions: bright and dim. The performances for both conditions are shown in Table IV. As shown in Table IV, both Models A and C could recognize the testing data in the two different lighting conditions, and they both had high performance. Meanwhile, Model B performed poorly in recognizing the signs.

2) Test results by perspective:
This study used the firstand second-person perspectives. The position of the camera was considered to be from the direction of the object considered (first-person perspective) and from the directions of others who observe the hand gesture (second-person perspective). The performances for both conditions are shown in Table V. Table V shows that Model A and C can recognize the signs in both the first and second-person perspectives with high performance levels. There was a slight improvement with the second-person perspective.

F. Hand Gesture Prediction Results
The performance of the proposed model for predicting hand gestures was evaluated as well. Each class of hand gestures was performed, and the results obtained are shown in Fig. 13. The proposed model can recognize new data. Further, the hand gesture in the dim condition yielded a higher accuracy than in the light condition for the first-person perspective. In contrast, the second-person perspective exhibited the same performance under both dim and bright conditions. Certain samples of hand gesture recognition are listed in Table VI.   As shown in Table VI, the proposed CNN model C works well in predicting hand gestures that were not included in the training data. This implies that the CNN can be implemented to recognize hand gestures. However, certain prediction errors occurred in certain classes, such as 2, M, N, V, and J. The occurrence of prediction errors due to hand gestures from these classes is almost the same or similar to other classes. The numbers 2 and V have the same hand gesture, thus, an error occurred in the CNN while predicting the class. From the firstperson perspective, no difference was observed between the letter M and N hand gestures, thereby resulting in an error in the prediction. Further, the hand gesture for the letter J is not static, thus a prediction error occurred wherein the letter I was predicted because the initial movement of the signal letter J resembles that for the letter I.

VI. CONCLUSION
The results of this study demonstrated that our new simplified CNN model exhibited good performance in recognizing BISINDO hand gestures. The CNN architecture used was a simple architecture consisting of convolutional layer 1, max pooling 1, convolutional layer 2, convolutional layer 3, max pooling 2, convolutional layer 4, max pooling 3, flattened layer, and 3 fully connected layers. The parameters used were the Adam Optimizer, an iteration parameter of 100 epochs, and a learning rate of 0.001. During the training process, the last epoch resulted in a training loss value of 0.0201, validation loss value of 0.0785, and training accuracy value of 0.9948 with a validation accuracy value of 0.9839. The results of hand signal recognition testing using the CNN model on test data obtained performance results of 98.3%. Thus, this new simplified CNN model can recognize the BISINDO hand gestures well under dim and bright lighting and from the first-and the second-person perspective.
In the future, we will improve Model C to address those performance factors. We also expect to conduct the process of data retrieval with different backgrounds and do further research on real-time implementations of BISINDO hand gestures.