Abstract

Deaf and dumb people struggle with communicating on a day-to-day basis. Current advancements in artificial intelligence (AI) have allowed this communication barrier to be removed. A letter recognition system for Arabic sign language (ArSL) has been developed as a result of this effort. The deep convolutional neural network (CNN) structure is used by the ArSL recognition system in order to process depth data and to improve the ability for hearing-impaired to communicate with others. In the proposed model, letters of the hand-sign alphabet and the Arabic alphabet would be recognized and identified automatically based on user input. The proposed model should be able to identify ArSL with a rate of accuracy of 97.1%. In order to test our approach, we carried out a comparative study and discovered that it is able to differentiate between static indications with a higher level of accuracy than prior studies had achieved using the same dataset.

1. Introduction

The majority of deaf and hearing-impaired people continue to prefer using sign language as their primary mode of communication, as well as a natural mode of communication between them and nondeaf people. It is also the only formal means of communication for dumb and hearing-impaired people to express themselves through body language and facial expressions. It is a well-organized code based on hand gestures [1]. The primary difficulty is that the majority of people are unable to comprehend sign language, and the great majority of deaf persons are dumb in both their native language and their ability to read and write it. This creates an imbalance between the deaf community and the hearing world, limiting access to services and preventing deaf people from achieving their full potential in education and other areas.

In recent years, tremendous advancement has been achieved in the area of human-computer interaction (HCI), which has led to the emergence of a growth mechanism for human communication. As a result of this development trend, technology has contributed greatly to social applications that assist people with special needs, such as automatic sign language translation. The availability of sign language explanations determines deaf and dumb people’s access to various aspects of life and education. In the field of serving special needs with modern technological means, Arabic research is severely lacking. In addition, the Arabic computer applications that help people with special needs in education, training, and communication in general have a flaw. There is also a lack of automation in ArSL. In general, there is a scarcity of Arabic research in this field [2].

The development of fully automated sign language interpretation systems is in its infancy. All translation services are currently human-based, which makes them extremely costly due to the personal experience required. As a result, this highlights the need to increase automatic sign language recognition in order to assist and serve the deaf and dumb in understanding their mother tongue. These systems act as an interpreter among persons who are deaf, dumb, and normal people, allowing them to communicate more effectively [3]. To convey information, sign language recognition systems primarily use hand gestures. Hand gesture recognition is a rapidly expanding field with numerous applications in fields such as entertainment, computer games, object grasping, and command descriptions.

The majority of gestures are made with the hand, but they can also be made with the face and the body. A hand gesture is created not only by the shape of the hand but also by the gesture of the hand and where it is positioned in relation to other parts of the body. Communication using hand gestures is the most significant component of sign language since they are employed in every element of human communication. For example, they can be used to accompany speech or to communicate on their own in a setting with a lot of background noise because signers communicate the majority of their information with their hands [4]. So, in order to incorporate all members of the community, regardless of their abilities, we intend to construct a robust automatic sign language translation and recognition system to address these issues. In addition, the proposed automatic translation of ArSL to Arabic text is intended to increase the quality of Arabic text translation, which is determined by pattern recognition theories.

In accordance with the World Health Organization (WHO), and its 2021 report, there are more than 466 million people worldwide who suffer from hearing loss (roughly 5% of the world’s population). Those with crippling hearing loss, 32 million of whom are children, and 1.1 billion young adults “between 12 and 35 years old” are at risk of hearing loss due to music noise. According to the report, that number is expected to double over the next 30 years. It is mentioned that the number of deaf people may reach 90 million by 2050, which is a large number, according to the statistics of the “International Federation of the Deaf.” 80% of the deaf and dumb live in developing countries, so governments give priority to helping these people to integrate and participate in societies [5, 6].

Even ArSL has regional and linguistic variations that make it difficult to generalize about how it is used. This study focuses on the most widely understood variety of ArSL among Arabs. Figure 1 shows the expectations of an increase in the number of people with disabling hearing loss (DHL) during the next thirty years [5, 6].

In Arab society, in order to understand the deaf and dumb, it is necessary to learn the language and design of deep models for ArSL, teaching this language through these models to more categories of the deaf community and dumb as a family, friend, and neighbor to be integrated into society [1]. The aim of the design of the deep models was to complement the proposed system design that aimed to substitute human interpreters to some extent and also to develop a system to translate video frames or images based on sign language into text based on the Arabic language to enhance their communications.

The motivations of this research are to help deaf and dumb people to understand Arabic text, help interested people to learn the sign language and interact with deaf and dumb people, and translate sign language to Arabic text without a human translator. In response to this problem, a new sign language recognition approach is proposed to improve the quality of Arabic sign translation to Arabic text. In addition, the proposed system is based on the theory of pattern recognition to construct a translation system using computer vision techniques, digital image processing, and machine learning [1].

Due to its central social importance and inherent ambiguity, developing a framework for deaf people to understand and interpret sign language is a difficult task. The following are some of the problems and difficulties associated with ArSL.

In this situation, sign language is important for communicating with hearing-impaired persons. Sign language recognition systems are getting attention. Proposing a new sign language recognition model and considering alternate methods are crucial. Any sign language recognition system may use a sensor-based or image-based approach [7]. Sensor-based systems interpret gestures by connecting many sensors to a glove. Image-based systems identify signs via image processing. This work employed image-based analysis.

Sign language has the most structured movement expressions of any language [8], yet its unique expressive mode makes it most challenging to learn and use. There are two major methods to demonstrate words in sign language, which are complimentary to each other. The first method employs body movements (such as the hands and arms of humans being used for transmitting meanings) [9] and facial emotions (such as eyebrows lifting and mouth shaping) [1014], whereas the second approach utilizes fingerspelling, which takes into account orientation, hand pose, and trajectory [1518].

Developing tools to translate sign language into text or voice to help nondeaf and dumb individuals communicate is crucial. Many sign language recognition systems are based on Indian, Japanese, British, Brazilian, Chinese, Australian, and American sign languages [1930]. However, sign language research has also focused on Asian [31], English, Turkish [32], and German sign languages [33], as well as Arabic, albeit without achieving the desired results in terms of system performance. Unfortunately, no sign language is universal. The alphabet, where each letter has its own symbol, varies by nation and language. Researchers may not have a reliable ArSL database because the Arabic language is complicated. Researchers had to create datasets manually, which is cumbersome. Table 1 provides further information on several international sign languages.

Sign language recognition systems automate communication between hearing-impaired people and ordinary people [48], and ArSL recognition detects and classifies. The detection phase preprocesses images, finds interesting areas, and classifies them. Recognition uses each segmented hand sign’s characteristics. These traits may distinguish indications. Each study has cost, image preprocessing, and sign classification restrictions [49, 50]. To understand ArSL, a deep convolutional neural network is fed hand motion images from different angles and lighting.

A substantial collection of sign language videos or images tagged with sign language labels is required to construct an ArSL recognition system. Classification-stage machine learning models may be trained and tested on this dataset. Figure 2 depicts two possible methods of data collection for ArSL recognition: (1) a vision-based approach and (2) a sensor-based approach [2].

Using computer vision methods, a vision-based approach to ArSL recognition analyzes and interprets sign language motions from image or video data. Gestural detection, feature extraction, and classification are only a few of the common intermediate phases in this process. To detect gestures, images or videos must be analyzed to identify and follow the hands or other important body components. Methods including skin color segmentation, hand form classification, and optical flow analysis may do this. After the hands and other important portions of the body have been recognized, feature extraction methods are employed to glean information from the video data, such as the hands’ shape, motion, and orientation. This may be accomplished through the utilization of techniques such as CNNs, the histogram of oriented gradients (HOG), and local binary patterns (LBPs). Finally, the obtained data are used by machine-learning methods including support vector machines (SVMs), decision trees (DTs), and artificial neural networks (ANNs) to classify the sign language gesture [4].

Using a vision-based technique for ArSL detection is complicated by the fact that sign language motions may seem quite different depending on the signer and the situation. Methods including data augmentation, multimodal fusion, and transfer learning, which may boost the recognition system’s resilience and generalizability, have been investigated by researchers as potential solutions to this problem [2].

Using sensors to detect and record information about sign language motions is at the heart of the sensor-based method to ArSL recognition. There are a number of benefits and drawbacks to using this method instead of or in addition to vision-based methods. In low-light or obstructed environments, for example, sensor data may be more trustworthy than that obtained using vision-based methods. Sensors can record data that the human eye cannot, such as pressure, force, or electromyography (EMG) signals [52].

There are two primary types of systems: the first kind of system [52] is based on gloves, while the second type of system [7] is based on Microsoft Kinect. Gloves [52] may be used in systems that read hand gestures because they include electronic and mechanical components. In spite of the fact that this method may generate satisfactory outcomes [53], wearing a glove that is attached to a variety of sensors in order to collect information may be unpleasant for signers who have hearing or speech impairments. Kinect sensors are put to use in order to identify the indicators that fall into the second group. These sensor devices are first developed by Microsoft for its Xbox game as an alternative to using a controller to play video games [54]. Recently, the application of this technology has expanded to encompass additional recognition technologies, such as the capability to identify sign language. The current trend is predicted to continue for the immediate future.

This research utilizes 28 Arabic sign images to illustrate how preprocessing images may assist in describing motions in feature extraction and how our suggested architecture outperforms previous techniques. Contributions are as follows:(i)We need about 7,057 Arabic alphabet sign language images to train our model. In partnership with Resala Association in Egypt, we developed a database of 10,000 red-green–blue (RGB) images and combined them with a 5600-image collection from our previous work [50, 51].(ii)Our ArSL comprehension technique was evaluated.(iii)Training and comparison of our architecture and others for static Arabic sign letter interpretation were performed.(iv)The produced datasets were used to create a new algorithm for Arabic alphabet sign identification. The recognition step uses 70% of the dataset for training and 30% for testing. It determined performance evaluation.(v)The model was tested using the ArSL alphabet alone.

The remaining parts of this work are divided as follows: In Section 2, we give a literature overview of the related studies. Recognition of the hand signs is shown using the method described in Section 3. Section 4 presents the results of the experiments. The authors conclude and provide some final analysis in Section 5.

This section reviews ArSL recognition methods. ArSL recognition system replaces sign language interpreters. This area pioneered sign language recognition preprocessing, feature extraction, and classification.

In [55], the authors evaluated classification algorithms used in recognition, such as traditional machine learning and deep learning, and discussed previous work on differentiating between static alphabetic and dynamic sign languages for Arabic and non-Arabic sign languages. In [56], the authors developed a fully automated method for recognizing 28 Arabic signs for letters and numbers. There are a total of 7869 images used. Several cycles of training and testing on various combinations of training and testing data are used to perfect the 7-layer model. In conclusion, the authors provided evidence that the proposed model is superior to KNN- and SVM-based approaches. In [50], the idea of using computers to translate between Arabic sign language and spoken language is presented. For the ATASAT system to function, it needs two datasets of Arabic alphabet movements. Arabic sign movements are extracted from images or movies using manual detection based on hand coverage, then statistical classifiers are used, and the results are compared to further refine the classification.

In [51], the authors offered a machine-learning-based alphabet recognition system for ArSL. Each group of 10 students examines 2,800 pictures and 28 different alphabets. There are 2,800 different pictures for all 26 letters (28 × 100). Features are extracted from a hand shape description, and classification is accomplished via k-nearest neighbors (KNN) and multilayer perceptron (MLP) algorithms. The precision is 97.548%. In [57], the authors proposed a deep transfer learning-based recognition system for ArSL. Specifically, they used data augmentation to lessen the likelihood of model overfitting and boost the system’s overall performance. To accomplish the job of target identification, several network architectures are explored. The dataset for ArSL used in this study is made accessible to the public (ArSL2018).

According to the research of Hisham and Hamouda [58], ArSL may be identified in real time using Microsoft Kinect. The system’s recognition skills have been vastly improved using a decision tree, a Bayesian network, and Ada-boosting. In the study [59], the authors provided a novel ArSL recognition approach that makes direct use of microscopic images and is based on an unsupervised deep-learning method, a deep belief network (DBN). It has been used to recognize and sort the Arabic alphabet. Saleh and Issa [60] improved the accuracy of detecting 32 hand gestures in ArSL by using transfer learning and fine-tuning deep CNNs. Aldhahri et al. [61] used a CNN model fed with a grayscale image to create a system that can automatically recognize 28 letters in ArSL.

In order to address the challenges that are specific to sign language, the authors of [62] applied ontology to this field of study. They employed simple, permanent signs composed of Arabic letters. During training and testing the deep CNN architecture, both a preexisting and newly collected dataset of ArSL are used. ArSL is difficult to recognize automatically; however, a new framework suggested by Duwairi and Halloush. [63] may help. Popular deep-learning models (such as AlexNet, VGGNet, and Inception Net) are used in this framework to do image processing through transfer learning. As the VGGNet architecture outperformed previously trained models, they proposed using it to automatically recognize Arabic alphabets in sign language. This is done to put their plan into action.

There might be many positive outcomes for society as a whole as a result of the development of this system, some of which are summarized as follows:(i)There are not many people who are fluent in ArSL, so creating a system like this would greatly help the hearing-impaired community communicate with individuals who are not native speakers.(ii)The community of deaf people be expanded so that those who are deaf are able to speak with hearing people and take part in the activities of hearing people and simplify the process of communicating with deaf and hard-of-hearing people for those who are able to hear and increase the quality and availability of educational options for the deaf community.

3. Proposed Methodology

This section explains the procedure that we have developed for recognizing the static alphabet used in the ArSL language. The primary objective of this research is to produce deep CNN that is competent in recognizing the ArSL alphabets to a high degree of precision. Figure 3 presents a conceptual overview of the system for your reference. The proposed system has a basic workflow that is made up of two stages: the initial stage is image preprocessing, and the following stage is a proposed model for classification and text generation of ArSL. In the next part of this article, we discuss the various phases.

The proposed approach for recognizing the ArSL alphabet is depicted in Figure 3. In the proposed approach, the ArSL of the alphabet is used to recognize gestures. The first step, which can be covered in the following section, is to gather the dataset for the system and do any necessary image preprocessing [64, 65]. In the process of preparing the dataset, the high-fidelity hand and finger tracking framework known as Mediapipe is utilized. It uses multiple machine-learning methods to deduce the 21 3D landmarks of a hand from a single frame in real time.

3.1. Collect Dataset and Image Preprocessing Phase

When the images of ArSL have been extracted from the camera, they are sent on to the preprocessing phase, which is comprised of a number of primary processes [66, 67]. The images used as data are those of a depth that feature a hand. We used the Mediapipe architecture as a method for detecting the hand in order to accomplish our goal of determining the precise location of the hand [68]. The latter makes use of a variety of methods for detecting and following its prey. One of its methods is called the Mediapipe hand. It is made up of two separate models that collaborate with one another. The first of these is the palm detection model (PDM), which analyzes the whole image to provide a palm or fist bounding box in the correct orientation. This model can recognize both self- and other-occluded hands. The hand landmark model (HLM) is the second model, and it uses the palm detector’s clipped image region to generate a high-quality 3D hand key. During the localization step, we scale the images to 240 by 240 pixels and then cut the image by changing it into black images that include just the hand landmarks and a line linking them (Figure 4). Last but not least, we cut our dataset up into three parts: training, validation, and testing for the purpose of training our model. Figure 4 presents the image that is received as input in Figure 4(a), as well as the results of the image preprocessing that included edge detection in Figure 4(b).

3.2. Mediapipe

Machine learning is used in the Mediapipe to recognize the user’s hand’s key points. One of the models in the Mediapipe machine learning pipeline is a palm identification model that can analyze the whole picture. It takes in the whole image and spits back a bounding box that you can manually rotate, a model for identifying landmarks on the human hand that may provide extremely accurate 3D hand keypoints by using the cropped image area produced by the palm detector [68]. In addition, the hand landmarks from the previous frame are used by the machine-learning pipeline to crop the hand, and palm detection is only employed to localize the hand when the landmark model fails. Mediapipe’s hand landmark detector can identify 21 different points on the user’s hand, as shown in Figure 5.

The work of Arsheldy Alvin et al. [69], in which hand movements are identified using the Mediapipe and k-nearest neighbors technique, should be consulted in order to get an understanding of how Mediapipe can be used for the detection of hand gestures. Figure 6 shows the detection of hand landmarks and the drawing of points.

The machine-learning solutions provided by Mediapipe are open source, compatible with several platforms, and fully configurable. In this research, we have used Mediapipe for the purpose of recording the hand keypoints and TensorFlow for the purpose of both training and detecting the machine-learning algorithm. The Mediapipe application may run on either the central processing unit (CPU) or the graphical processing unit (GPU). The operation of Mediapipe does not need any extra computing power.

3.3. Classification and Text Generation Phase by the Proposed Model Architecture

In this research, we use a model of CNN. This model is made up of numerous layers. The proposed design of CNN is shown in Figure 7, and it comprises an input layer that can handle images that are 64 pixels wide by 64 pixels high by 3 pixels depth. Within the image, it is a representation of the maximum file size that can be utilized by the system. The dimensions of the images which can be uploaded to this system are displayed within the image itself.

Each convolution filter in the feature extraction portion has dimensions of three by three, and the feature extraction section as a whole comprises three convolutional layers (Conv2d_1, Conv2d_2, and Conv2d_3). Hence, there are sixteen filters for Conv2d_1, thirty-two filters for Conv2d_2, and sixty-four filters for Conv2d_3. Take note that after each convolution operation comes an activation layer, which applies the recalculated linear unit function (ReLU). After the activation of each neuron, we will employ a max-pooling layer with a size of 3 × 3 in order to minimize the size of the pictures without altering the essential qualities that distinguish them. Because of this, the amount of processing power that is needed by the model will be greatly reduced. Next, in order to deactivate 25% of the neurons, we will first utilize a flatten layer and then go on to a dropout layer. By periodically turning off neurons across the network, this particular sort of layer serves to prevent the network from accumulating too much knowledge. In the end, we will use an activation layer constructed using the ReLU function, which will be followed by a dense classification layer constructed with the softmax activation function. The information shown in Table 2 provides a summary of the parameters found in each layer of the proposed network, as well as the overall number of parameters.

4. Experimental Results

4.1. Datasets

To evaluate our system, we make a combination of the ArSL dataset [41] and 7,057 Arabic alphabet sign language images. In partnership with Resala Association in Egypt, we combined them with a 5600-image collection from our previous work [50, 51]. Our dataset is compiled by using professional cameras using data from a variety of signers with varying levels of luminance. These data are then used to train our model.

The dataset for ArSL is primarily made up of two folders: the first folder has 7,057 images, each of which is 240 pixels by 240 pixels in size, and the second folder contains 7,057 text files that correspondingly describe the content of each image in the first folder. The second dataset is constructed in such a manner that includes 28 classes, each of which have around 252 images depicting an Arabic letter. In order to integrate the two datasets, it is required to make certain adjustments to the structure of the dataset that is in ArSL. So, in order to get rid of the requirement for text files and to unify the structure of both datasets, we group the images of the same letter together and place them in the same folder. Because of this, later on, we are able to merge the two datasets with more ease. It is important to note that the dataset for ArSL had an issue with the images of the letter “Thal and Jeem” not existing in the proper class. This problem needs to be manually corrected by shifting the images back to where they belong in the dataset. Figure 8 provides a visual representation of the breakdown of the total number of samples into several classes, each of which corresponds to a letter of the Arabic alphabet.

According to the study of the databases of signs of the Arabic alphabet in sign language, we notice that the same characters have strong similarities in gestures (Beh, Teh, Theh, …) which probably influence our model in the feature extraction process, complicating their classification process. Table 3 shows the similarity between some ArSL characters according to Mediapipe’s hand landmark from the authors’ point of view. It also describes the similarity between its representation in sign language and its writing style in Arabic.

4.2. Results and Discussion

The dataset that is utilized for training consisted of more than 7,000 images, which are organized into 28 different classes of ArSL motions using a consistent file structure. This information is presented in the section that came before this one. The dataset is then divided into a training set, a validation set, and a testing set, with the training set receiving 60% of the data, the validation set receiving 20% of the data, and the remaining 20% being reserved for the testing set. The following are the actions that would then make up our proposed method: We began by performing the preprocessing stage, from which we obtained black and white images consisting of only the hand landmarks connected by a line. After that, we fed the data into our model, where iterations (epochs) are run through our model and accuracy values are provided at the conclusion of each iteration. After that, we ran our model. After all of the training epochs have been completed, the final accuracy is calculated and shown.

Figure 9 displays the accuracy and validation accuracy obtained through training and testing. Figure 9(a) displays the training loss and validation loss obtained through training. Figure 9(b) displays accuracy and validation accuracy obtained through testing.

The CNN’s architecture is trained for 32 batches over 20 epochs. By the 10th epoch, the model achieved a validation accuracy of 94.66%, and at the 20th epoch, it achieved a record-high 97.1% accuracy. Figure 9(a) displays the development of the model throughout training, revealing that the model maintains its lead during each training iteration, while the loss decreases, as shown in Figure 9(b). Based on the outcome, we do not see any evidence of overfitting.

Table 4 shows the classification report and indicates that the error rate is quite low and that the vast majority of classes have been assigned to the correct groups. Letters such as “Alef,” “Beh,” “Seen,” and “Noon” have a high accuracy of 95–100%, while letters such as “Jeem,” “Dal,” “Reh,” and “Feh” have a low accuracy of 88%–94%. One major cause of this variation is that the position of the fingers is different for each letter. This demonstrates the significance of gesture representation in feature extraction for obtaining reliable outcomes.

According to the findings, we are able to determine that there are around 32 indicators that are not correctly detected overall. The fact that the same letters show strong similarities of gestures (RAA/DELL, LAAM/NOON, ALIF/BAA...) can be attributed to the high rate of misclassification. This has probably prompted our model to extract comparable features, which has made the classification process more difficult. A portion of the images that are incorrectly labelled is displayed in Figure 10.

4.3. Comparative Study

We compare our findings to those found in the current literature to demonstrate the efficacy of the proposed model. The outcomes of our model and many others in the literature are shown in Table 5.

In contrast to the findings of the aforementioned models, our proposed architecture obtained the best result, and this architecture has been shown to be efficient in sign language recognition, since a validation accuracy of 97.1% is achieved.

5. Conclusion

We use deep learning to recognize ArSL from deaf people’s hands in real time to transform signs into alphabets and increase communication between deaf and hearing people. First, we collect data and create our own dataset. Next, we apply a preprocessing step that includes clipping the hand from the image and retrieving its shape. Finally, we propose four distinct CNN models that each have their own unique architecture: three from scientific papers and one from ourselves. We apply them to our datasets to compare them and choose the best model. Our architecture performs best in the comparative research, scoring 97.1%. In the future, we hope to apply new techniques to convert gestures of the hands into written form, employ natural language processing (NLP) to process and display the writings, and extend the dataset’s variety with regards to noise, orientation, and so on.

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author Abdelmoty M. Ahmed, Email: [email protected] on reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through Large Groups Research.