A Deep Learning based Approach for Recognition of Arabic Sign Language Letters

—No one can deny that the deaf-mute community has communication problems in daily life. Advances in artificial intelligence over the past few years have broken through this communication barrier. The principal objective of this work is creating an Arabic Sign Language Recognition system (ArSLR) for recognizing Arabic letters. The ArSLR system is developed using our image pre-processing method to extract the exact position of the hand and we proposed architecture of the Deep Convolutional Neural Network (CNN) using depth data. The goal is to make it easier for people who have hearing problems to interact with normal people. Based on user input, our method will detect and recognize hand-sign letters of the Arabic alphabet automatically. The suggested model is anticipated to deliver encouraging results in the recognition of Arabic sign language with an accuracy score of 97,07%. We conducted a comparison study in order to evaluate proposed system, the obtained results demonstrated that this method is able to recognize static signs with greater accuracy than the accuracy obtained by similar other studies on the same dataset used.


INTRODUCTION
According to the statistical information issued by the World Health Organization in 2021, people with hearing disabilities and deaf persons represent more than 5% of people on the planet [1]. Additionally, it is predicted that in the following 30 years, this number will double. As a result, governments and researchers therefore place a high priority on helping those individuals to participate in their societies. Nowadays, these individuals are the primary users of sign language recognition typically used to communicate both inside and outside of their community.
In this context, we can understand the importance of sign language as a form of non-verbal communication that based on signs and gestures. So, Sign Language Recognition systems (SLR) have been given attention recently. Despite the fact that movement expressions in sign language are the most structured [2], the difficulty of sign language recognition lies in the way that it is expressed. There are two main ways to represent words in sign language; these two ways can be considered complementary to one another: The first way uses particular body movements (e.g. human hands and arms to convey meanings) [3] and facial expressions (e.g. eyebrow raising and mouth shaping) [4]- [6], whereas the second way uses a fingerspelling approach (e.g. orientation, hand posing, and trajectory) [7]. Unfortunately, there is no universal sign language. It's typically different from one country to another and from one language to another, including the level of the alphabet, where each letter has its own conventional sign. But, with the recent advance in the domain of sign recognition, the researchers have prompted the creation of robust sign language recognition systems for various sign languages, for instance, Brazilian Sign Language [8], British Sign Language [9], Chinese Sign Language [10] and the American Sign Language [11]. However, Asian, English, and Latin sign languages have all been the subject of substantial research, whereas Arabic has received relatively little attention. This can be due to the Arabic language's complexity or the absence of a widely used dataset for the Arabic sign language available to academics. As a result, researchers were forced to build their own datasets, which is a tedious task.
It is crucial to create systems that can translate sign languages into text or speech to help people, who are not deaf or mute, to communicate. Typically, to create any sign language recognition system we can use an image-based approach or a sensor-based technique [12]. Specifically, sensor-based systems work by connecting a glove to a number of sensors to read the gesture, which the system can then recognize; such solutions suffer from a lack of usability. On the other hand, image-based solutions have alleviated this issue and offered a solution in which signs are translated using the cameras. Ideally, those systems can reduce the need for human intervention in the exchange of information between normal people and deaf [13]. Image-based solutions relying on image processing can be carried out in two stages: detection and classification. Each image is first pre-processed during the detection phase, after which the regions of interest are found. The classification process can then be carried out using the results of the preceding procedure. During the recognition stage, each segmented hand sign has a set of features that are extracted in order to perform in the recognition process. Therefore, it is possible to understand the differences between the different signs by using these features as a guide.
Despite extensive research in this field, and as indicated in [14], each study has limitations in terms of the image processing, cost, and sign classification. In general, the system performance depend on the accuracy of the image preprocessing stage, which separates the hand region from the full image, therefore, the main focus of this work is on the recognition of Arabic Sign Language (ArSL) alphabets based on fingerspelling [15]. In order to improve the images of hand gestures accuracy, we proposed to implement a novel ArSL recognition system based on deep convolutional neural network, which incorporates a novel pre-processing stage that www.ijacsa.thesai.org can detect sign gestures by feeding it images of hand gestures performed in varied lighting and orientation, this stage can also alleviate in a small way the problem of strong similarities within the sign language alphabet that present highly similar visual properties, for example, the letter pairs DELL/DHELL, RAA/ZAY, Ayn/Ghayn, ... Using a dataset of 28 classes of different images of Arabic signs, the objective of this research is to demonstrate how the pre-processing images can be aided in the representation of gestures in feature extraction and how proposed architecture can offer improved accuracy in comparison to other existing methodologies. This work has the following contributions:  Create a new database of 5000 RGB images and combining them with existing Arabic Sign Language ArSL dataset to input into proposed model and use them as training data.
 Integrate proposed pre-processing stage to enhance the representation gestures.
 Training and comparison of others' works as well as the interpretation of static Arabic sign letters using proposed architecture.
 Analysis of performance of the proposed model in terms of Arabic sign language interpretation.
The rest of the paper is divided into the following sections. An overview of the connected works is given in Section II. The proposed methodology for recognizing hand signs is presented in Section III. The experimental results are described in Section IV. Finally, conclusions and perspectives are presented in Section V.

II. RELATED WORK
Recently, several research papers focused on ArSLR systems. For recognizing the ArSL with gestures, there are three partitions of the method: recognizing the Arabic alphabet, recognizing individual words, and recognizing entire sentences. In this research we focused on the recognizing of the Arabic alphabet. In this context, we provide a summary of some previously employed methods in this filed. Automatic Sign Language Recognition is a field to replace sign language interpreters. There have been many studies performed in this way, and several technologies have been built, such as preprocessing to extract hand gesture, feature extraction to reduce an input data into relevant features, and classification to identify the class of each hand sign. In [16] the authors have summarized the work carried out previously, which focuses on making a difference between dynamic sign language and isolated, static, alphabetic non-Arabic and Arabic sign languages, as well as classification methods used in recognition and that relied on traditional machine learning methods or deep learning methods.
An automatic system for recognizing numbers from 0 to 10 and Arabic letters was implemented in [17]. The authors used a dataset with 7869 images overall. Seven layers composed the suggested model, which was trained repeatedly using various training-testing configurations. Finally, the authors offered a comparison with various strategies based on KNN and SVM, demonstrating the benefit of the suggested model. An automated method for translating Arabic sign language was suggested by the authors in [18]. This system relies on the building of two datasets for Arabic alphabet gestures. In order to extract Arabic sign gestures from images or videos, this system suggests a hand coverage-based manual detection approach. It then uses a range of statistical classifiers, compares the results, and produces a more accurate classification.
With the use of Microsoft Kinect, Hisham et al. [19] introduced a dynamic Arabic sign language recognition. Two machine learning methods, Bayesian Network and Decision Tree are used for recognition, and the Ada-Boosting methodology is then used to improve system recognition. In [20], the authors used deep transfer learning for developing a robust recognition system for Arabic sign language. To prevent overfitting and improve overall system performance, they used data-augmentation. For the target recognition task, several network architectures have been studied. The Arabic sign language (ArSL2018) public dataset was used in this experiment. Other researchers, such as Saleh et al. [21], enhanced the accuracy of Arabic sign language hand gesture recognition through the use of deep CNN and transfer learning. A novel ArSLR system was proposed in [22] to recognize and classify Arabic alphabetical letters, which uses microscopic images along with an unsupervised deep learning algorithm built on a deep belief network (DBN). In [23], authors developed a system that automatically recognizes 28 letters in Arabic Sign Language using a CNN model with a grayscale image as input. In [24], the authors applied ontology to the sign language domain in order to address some sign language problems. They used simple static signs composed of Arabic alphabets. An architecture of CNN was trained and evaluated using a dataset that was collected and a pre-made dataset for Arabic sign language. A new framework for the automatic recognition of Arabic sign language was proposed by Duwairi et al. [25] and is based on transfer learning applied via popular deep learning models (AlexNet, VGGNet, and Inception Net) for image processing. They proposed using VGGNet architecture, which performed better than previous trained models for automatically recognizing Arabic alphabets in sign language. The authors in [26] suggested a system that can be used by the impaired people. The proposed system converts hand gestures in Arabic sign language into vocalized speech. They used Deep Convolutional Network to extract features from the data gathered by the sensor devices and they employed DG5-V hand gloves to capture the hand movements in the dataset. Finally, they applied CNN method for classification. Furthermore, recent advances in sign language have produced a number of models that are suitable for a variety of tasks; however neither of them actually possesses the necessary generalization capability.

III. PROPOSED METHODOLOGY
Using two-dimensional images, we seek to recognize the hand gestures of the ArSL and convert them into alphabet in LSA language. The main objective of this research is to build a deep CNN capable of accurately recognizing the ArSL alphabets. An overall view of the representation of the system is presented in Fig. 1.  www.ijacsa.thesai.org The general architecture of the proposed system comprises two stages: image pre-processing and a proposed model for recognizing Arabic sign language. In the following section we will cover different stages.

A. Proposed Pre-processing
After the Arabic sign language images are captured by the camera, they are subjected to a number of main processes during the pre-processing stage. The data are images depth that contain a hand. In order to extract the exact position of the hand we used the mediapipe framework as a solution to detect the hand [27]. The latter has a number of techniques for detection and tracking. Mediapipe Hand is one of its techniques. It consists of two different models working together, the first is namely Palm Detection Model which uses the whole image, produces an oriented hand bounding box made up of rigid objects (e.g. palms and fists) this model has the ability to detect occluded and self-occluded hand, and the second is namely Hand Landmark Model which produces high-fidelity 3D hand keypoints using the cropped image area defined by the palm detector. After the localization phase we cut the image by transforming it into black images that contain just the hand landmarks with a line connecting them after we resize the images to 240*240 (Fig. 2.). Finally, we divided our dataset into three parts-training, validation and test for training our model. Fig. 2 shows the input image Fig. 2(a) and the outcomes of the image pre-processing with edge detection in Fig. 2(b).

B. Proposed CNN Architecture
The convolutional neural network CNN model we used in this study is composed of several layers. Fig. 3. shows the proposed architecture of the CNN, which consists of an input layer that accepts images with a size of 240 × 240 x 3 pixels, which corresponds to the size of the images that the system accepts.
The feature extraction section consists of three convolutional layers (Conv2d, Conv2d_1, Conv2d_2) in which each convolution filter has dimensions of 3 by 3. There are thus 16 filters for Conv2d, 32 filters for Conv2d_1, 64 filters for Conv2d_2. www.ijacsa.thesai.org Note here that each convolution operation is followed by an activation layer which uses the corrected linear unit function (ReLu). After each activation of the neurons, we will use a Maxpooling layer with a size of 3x3 in an attempt to reduce the size of the images without modifying their important characteristics. This will help to significantly reduce the computational power required by the model. Next, we will use a flatten layer followed by a dropout layer to disable 25% of the neurons. This type of layer helps to reduce overlearning by randomly disabling neurons in the network. Finally, we will use an activation layer using the RELU function followed by a dense classification layer using the softmax activation function. An overview of the parameters in each layer and the overall parameters in the proposed network are shown in Table I.

A. Dataset
To evaluate proposed system, we create a combination of the Arabic Sign Language ArSL dataset [28] and our dataset created by professional camera from different signers and different luminosity intensities.
The Arabic Sign Language ArSL dataset is mainly composed of two folders, the first one containing 1160 images each having a size of 416×416 pixels, and the second one containing 1160 text files describing respectively the content of each image in the first folder. The second dataset was created in such a way as to have 28 classes, each containing about 135 images describing an Arabic letter. In order to combine the two datasets, a modification to the format of the Arabic Sign Language ArSL dataset was necessary. We therefore grouped the images of the same letter under the same folder in order to get rid of the need for text files and to unify the format of both datasets. This made it easier for us to combine the two datasets later on. Note here that the Arabic Sign Language ArSL dataset had a problem with the nonexistence of the images of the letter "Noun" in their appropriate class which required a manual correction of this by moving them to their original class.

B. Results
As previously mentioned, the dataset used for training contained more than 5000 images, distributed in a unified format on 28 different classes of Arabic Sign Language gestures. Three distinct sets of data were created from the dataset, with 60% of the data used for training, 20% for testing, and 20% for validation. The proposed technique would then involve the following steps: We started by preprocessing stage which we outcome black images that contain just the hand landmarks with a line connecting. After that, we inputted the data into proposed model, where the proposed model is iterated through several epochs and at the end of each iteration, accuracy values are provided. Finally, after running the last training epoch, the final accuracy is shown, and the training model is finished. The training and validation accuracy can be seen in Fig. 4(a) and training and validation loss in Fig. 4(b). The architecture of CNN was trained for 20 epochs with a batch size of 32. At the 10th epoch, the model was able to validate with an accuracy of 97%, while the 20th epoch saw the highest accuracy of 97.07%. In Fig. 4(a), it can also be www.ijacsa.thesai.org noted how the model training progressed, which demonstrates also how the model stays ahead in every iteration of training and how the loss decreased (see Fig. 4(b)). According to the obtained result we observe no sign of the overfitting.
The confusion matrix shows how well the system performs in terms of correctly and incorrectly classifying developed data. The CM obtained by the CNN model using our methodology is shown in the following Fig. 5. The results obtained show that almost all gestures are correctly classified by our model. The classification report, as evident in Fig. 6, confirms what the confusion matrix states. Thus, it can be seen that the rate is actually low, and the majority of the class is properly categorized. The accuracy for the letters ("THA", "KHAA", "SEEN", "SAD", "ZAY") is high and ranges between 95 and 100%, whereas the accuracy for other gestures, such as the letters ("FAA", "JEEM", "RAA", "DEEL", "MEEM"), ranges between 88% and 94%. The principal reason for this difference is that each letter requires a different finger position. This confirms that accurate results depend greatly on how gestures are represented when features are extracted.
According to the obtained result, we can observe that the proposed method allows us to reduce the total number of incorrectly recognized signs to about 32. The misclassification rate is related as previously cited, to the fact that some letters have strong similarities in gestures (RAA/DELL, DELL/DHELL, RAA/ZAY, etc.), which has probably forced proposed model to extract similar features, rendering their classification process more difficult. The following figure (Fig. 7) displays an extract of misclassified images.

C. Comparative Study
To show the performance of the proposed system, we have compared the results obtained from recent literature [18], [29] and [30] with our method. We tested all methods on the same dataset used in this work. Table II reports the results of the three methods under consideration.   As compared with the results of the models listed above. The best result was reached with our suggested architecture, which has proven effective in the recognition of sign language due to a validation accuracy achieved of 97.70%. www.ijacsa.thesai.org V. CONCLUSION In this paper, we used deep learning techniques to perform Arabic sign language recognition based on the hands of deaf people in real time in order to translate the signals into alphabets and improve communication between deaf and nondeaf people.
We started by collecting data and creating our own dataset, then applying a preprocessing that consists in cropping the hand from the image, and extracting its shape, then we proposed four CNN models with different architectures of which three architectures are proposed in three different scientific papers and the last one is proposed by ourselves, and applied them afterwards on our datasets with the aim of making a comparative study in order to choose the best model. The comparative study allowed us to conclude that our architecture is the best performing, where we obtained a score of 97.07%.
In the future work, we would like to extend the dataset for example, increase the variety of images in terms of noise, orientation, etc. and implement new techniques to convert hand movements into written text, and to use NLP (Natural Language Processing) techniques in order to process the obtained texts and present them with a comprehensible and adequate way.