Abstract

Mobile technology is very fast growing and incredible, yet there are not much technology development and improvement for Deaf-mute peoples. Existing mobile applications use sign language as the only option for communication with them. Before our article, no such application (app) that uses the disrupted speech of Deaf-mutes for the purpose of social connectivity exists in the mobile market. The proposed application, named as vocalizer to mute (V2M), uses automatic speech recognition (ASR) methodology to recognize the speech of Deaf-mute and convert it into a recognizable form of speech for a normal person. In this work mel frequency cepstral coefficients (MFCC) based features are extracted for each training and testing sample of Deaf-mute speech. The hidden Markov model toolkit (HTK) is used for the process of speech recognition. The application is also integrated with a 3D avatar for providing visualization support. The avatar is responsible for performing the sign language on behalf of a person with no awareness of Deaf-mute culture. The prototype application was piloted in social welfare institute for Deaf-mute children. Participants were 15 children aged between 7 and 13 years. The experimental results show the accuracy of the proposed application as 97.9%. The quantitative and qualitative analysis of results also revealed that face-to-face socialization of Deaf-mute is improved by the intervention of mobile technology. The participants also suggested that the proposed mobile application can act as a voice for them and they can socialize with friends and family by using this app.

1. Introduction

Historically the term deaf-mute referred to the person who was either deaf using sign language as a source of communication or both deaf and unable to speak. This term continues to be used to refer to the person who is deaf but has some degree of speaking ability [1]. In deaf community, the word deaf is spelled in two separate ways. The small “d” deaf represents a person’s level of hearing through audiology and being not associated with the other members of the deaf community whereas the capital “D” Deaf indicates the culturally Deaf people who use sign language for communication [2].

According to world federation of the deaf (WFD) over 5% of the world’s population (≈360 million people) has disabling hearing loss including 328 million adults and 32 million children [3]. The degree of hearing loss is categorized into mild, moderate, severe, or profound levels [4]. Hearing loss of a person has a direct impact on his/her speech and language development. People with severe or profound hearing loss have higher voice handicap index (VHI) scores than those who suffer from mild hearing loss [5]. A person with mild hearing loss has less problems in speech development as he/she might not be able to hear certain sounds and the speech clarity is not affected that much. A person with severe or profound hearing loss can have a severe problem in speech development and usually relies on sign language as a source of communication.

Deaf people face many irritations and frustrations that limit their ability to do everyday tasks. Research indicated [6] that Deaf people, especially Deaf children, have high rates of behavioral and emotional issues in relation to different methods of communication. Most people with such disabilities become introverts and resist social connectivity and face-to-face socialization. The inability to speak with family and friends can cause low self-esteem and may result in social isolation of Deaf person. It is not only that they lack social interactions but communication is also a major barrier to Deaf-mute healthcare [7]. In such conditions, it becomes difficult for the caretaker to interact with the deaf person.

Different medical treatments are available for the deaf community in order to get rid of their deafness but the cost of these treatments are expensive [8]. A report of world health organization (WHO) 2017 [9] states that there are different types of costs associated with hearing loss, which are as follows: (1) direct costs: they include the cost associated with hearing loss incurred by healthcare systems; some other types of direct costs include the education support for such children; (2) indirect costs: they include the loss of productivity and usually refer to the cost of individual being unable to contribute to the economy; and (3) intangible costs: they refer to the stigma experienced by the families that are experiencing the hearing loss. This report concludes that unaddressed hearing loss poses substantial costs to the healthcare system and to the economy as a whole.

Many communication channels are available, through which Deaf-mute people can deliver their messages, e.g., notes, helper pages, sign language, books with letters, lip reading, and gestures. Despite these channels, there are many problems which are encountered by Deaf-mutes and normal people during communication. The problem is not confined only to a Deaf-mute person who is unable to hear or speak, but another problem is lack of awareness of Deaf culture by normal people. Majority of hearing people have either no/little knowledge or experience of sign language [10]. There are also more than 300 sign languages and it is hard for a normal person to understand and become used to these languages [11]. The above-mentioned problems can be solved by involving the assistive technology as it can be used as an interpreter for converting the sign languages into text or speech for better communication between the Deaf community and hearing individuals [12]. Other technologies such as speech technologies can assist in different ways to help people with hearing loss by improving their autonomy [13]. A common example of speech technology is speech recognition, also termed as automatic speech recognition (ASR). It is the process of converting the speech signal into sequences of words with the help of an algorithm [14]. The ASR process comprises three steps, i.e., (1) feature extraction, (2) acoustic model generation, and (3) recognition phase [15, 16]. For feature extraction, MFCC is the most commonly used technique [17, 18]. The success of MFCC makes it the standard choice in the state-of-the-art speech recognizers such as HTK [19].

The main purpose of this research paper is to use a mobile-based assistive technology for providing a simple and cost-effective solution for Deaf-mute with little or complete speech development. The proposed system used HTK based speech recognizer to identify the speech of Deaf-mute and provide a communication platform for them. The next two sections explain the related work and proposed methodology of our system. Section 4 states the experimental setup and results of the proposed system.

The Deaf community is not a monolithic group; it has a diversity of groups which are as follows [20, 21]:(1)Hard-of-hearing people: they are neither fully deaf nor fully hearing, also known as culturally marginal people [22]. They can obtain some useful linguistic information from speech.(2)Culturally deaf people: they might belong to deaf families and use sign language as the primary source of communication. Their voice (speech clarity) may be disrupted.(3)Congenital or prelingual deaf people: they are deaf by birth or become deaf before they learn to talk and are not affiliated with Deaf culture. They might or might not use sign language based communication.(4)Orally educated or postlingual deaf people: they have been deafened in their childhood but developed the speaking skills.(5)Late-deafened adults: they have had the opportunity to adjust their communication techniques as their progressive hearing losses.

Each group of a Deaf community has a different degree of hearing loss and use a different source of communication. Table 1 illustrates the details of Deaf community groups with their degree of hearing loss and source of communication with others.

Hearing loss or deafness has a direct impact on communication, educational achievements, or social interactions [23]. Lack of knowledge about Deaf culture is documented in society as well as in healthcare environment [24]. Kuenburg et al. also indicated that there are significant challenges in communication among healthcare professionals and Deaf people [25]. Improvement in healthcare access among Deaf people is possible by providing the sign language supported visual communication and implementation of communication technologies for healthcare professionals. Some of the implemented technology-based approaches for facilitating Deaf-mutes with easy-to-use services are as follows.

2.1. Sensor-Based Technology Approach

Sensors based assistance can be used for solving the social problems of Deaf-mute by bridging the communication gap. Sharma et al. used wearable sensor gloves for detecting the hand gestures of sign language [26]. In this system, flex sensors were used to record the sign language and to sense the environment. The hand gesture of a person activates glove, and flex sensors on glove convert those gestures into electrical signals. The signals are then matched from the database and converted into corresponding speech and displayed on LCD. The cost-effective sensor-based communication device [27] was also suggested for Deaf-mute people to communicate with the doctor. This experiment used a 32-bit microcontroller, LCD to display the input/output, and a processing unit. The LCD displays different hand sign language based pictures to the user. The user selects relevant pictures to describe the illness symptoms. These pictures then convert into patterns and pair with words to make sentences. Vijayalakshmi and Aarthi used flex sensors on the glove for gesture recognition [28]. The system was developed to recognize the words of American Sign Language (ASL). The text output obtained from sensor-based system is converted into speech by using the popular speech synthesis technique of hidden Markov model (HMM). The HMM-based-text-to-speech synthesizer (HTS) was attached to the system for converting the text obtained from hand gestures of people into speech. The HTS system involved training phase for extraction of spectral and excitation parameters from the collected speech data and was modeled by context-dependent HMMs. The synthesis phase of HTS system was used for the construction of HMM sequence by concatenating context-dependent HMMs. Similarly, Arif et al. used five flex sensors on a glove to translate ASL gestures for Deaf-mute into the visual and audio output on LCD [29].

2.2. Vision-Based Technology Approach

Many vision-based technology interventions are used to recognize the sign languages of Deaf people. For example, Soltani et al. developed a gesture-based game for Deaf-mutes by using Microsoft Kinect which recognizes the gesture command and converts it into text so that they can enjoy the interactive environment [7]. Voice for the mute (VOM) system was developed to take input in the form of fingerspelling and convert into corresponding speech [30]. The images of fingerspelling signs are retrieved from the camera. After performing noise removal and image processing, the fingerspelling signs are matched from the trained dataset. Processed signs are linked to appropriate text and convert this text into required speech. Nagori and Malode [31] proposed the communication platform by extracting images from the video and converting these images into corresponding speech. Sood and Mishra [32] presented the system that takes images of sign language as input and displays speech as output. The features used in vision-based approaches for speech processing are also used in different object recognition based applications [3339].

2.3. Smartphone-Based Technology Approach

Smartphone technology plays a vital role in helping the people with impairments to get themselves interacted socially and to overcome their communication barriers. Smartphone technology approach is more portable and effective as compared to sensor or vision technology. Many of the new smartphones are furnished with advanced sensors, high processors, and high-resolution cameras [40]. A real-time emergency assistant “iHelp” [41] was proposed for Deaf-mute people where they can report any kind of emergency situation. The current location of the user is accessed through built-in GPS system in a smartphone. The information about the emergency situation is sent to the management through SMS and then passed on to the closest suitable rescue units, and hence the user can get rescue through the use of iHelp. MonoVoix [42] is an Android application that also acts as a sign language interpreter. It captures the signs from a mobile phone camera and then converts them into corresponding speech. Ear Hear [43] is an Android application for Deaf-mute people. It uses sign language to communicate with normal people. The speech-to-sign and sign-to-speech technology are used. For a hearing person to interact with Deaf-mute, the text-to-speech (TTS) technology inputs the speech signal, and a corresponding sign language video is played against that input through which the mute can easily understand. Bragg et al. [44] proposed a sound detector. The app is used to detect the red alert sounds and alert the deaf-mute person by vibrating and showing a popup notification.

3. Proposed Methodology

Nowadays many technology devices such as smartphone-enabled devices prefer speech interfaces over visual ones. The research [49] highlighted that off-the-shelf speech recognition system cannot be used to detect the speech of deaf or hearing loss people as these systems contain a higher ratio of word error rate. This research recommended using human-based computations to recognize the deaf speech and using text-to-speech functionality for speech generation. In this regard, we proposed and developed an Android based application named as vocalizer to mute (V2M). The proposed application acts as an interpreter and encourages two-way communication between Deaf-mute and normal person. We refer to normal person as the one who has no hearing or vocal impairment or disability. The main features of the proposed application are listed below.

3.1. Normal to Deaf-Mute Person Communication

This module takes text or spoken message of a normal person as an input and outputs a 3D avatar that performs sign language for a Deaf-mute person. ASL based animations of an avatar are stored in a central database of application. Each animation file is given 2–5 tags. The steps of normal to Deaf-mute person communication are as follows:(1)The application takes text/speech of normal person as an input.(2)The application converts the speech message of a normal person into text by using the Google Cloud Speech Application Program Interface (API) as this API detects normal speech better compared to Deaf persons’ speech.(3)The application matches the text to any of the tags associated with an animation file and displays the avatar performing corresponding sign for Deaf-mute.

3.2. Deaf-Mute to Normal Person Communication

Not everyone has knowledge of sign language so the proposed application uses disrupted speech of a Deaf-mute person. This disrupted form of speech is converted into recognizable speech format by using speech recognition system. HMM-based speech recognition is a growing technology as evidenced by the rapidly increasing commercial deployment. The performance of HMM-based speech recognition has already reached a level that can support viable applications [50]. For this purpose, HTK [51] is used for developing speech recognition system as this toolkit is primarily designed for building HMM-based speech recognition systems.

3.2.1. Speech Recognition System Using HTK

ASR system is implemented by using HTK version 3.4.1. The speech recognition process in HTK follows four steps to obtain the recognized speech of Deaf-mute. The steps are training corpus preparation, feature extraction, acoustic model generation, and recognition as illustrated in Figure 1.

(a) Training Corpus Preparation. The training corpus consists of recordings of speech samples obtained from Deaf-mute in  .wav format. The corpus contains spoken English alphabets (A–Z), English digits (0 to 9), and 15 common sentences used in daily routine life, i.e., good morning, hello, good luck, thank you, etc. The utterance of one participant is separated from the others due to the variance in speech clarity among Deaf-mute people. The training utterances of each participant are labeled to simple text file (.lab). This file is used in acoustic model generation phase of the system.

(b) Acoustic Analysis. The purpose of the acoustic analysis is to convert the speech sample (.wav) into a format which is suitable for the recognition process. The proposed application used MFCC approach for acoustic analysis. MFCC is the feature extraction technique in speech recognition [52]. Main advantages of using MFCC are (1) low complexity and (2) better performance with high accuracy in recognition [53]. The overall working of MFCC is illustrated in Figure 2 [19].

The features of each step of MFCC are listed below.

(1) Pre-Emphasis. The first step of MFCC feature extraction is done by passing the speech signal through a filter. The pre-emphasis filter is the first-order high-pass filter. It is responsible for boosting the higher frequencies of a speech signal.where represents the pre-emphasis coefficient, is the input speech signal, and is the output speech signal with a high-pass filter applied to the input. Pre-emphasis is important because the components of speech with high frequency have small amplitude w.r.t components of speech with low frequency [54]. The silent intervals are also removed in this step by using the logarithmic technique for separating and segmenting speech from noisy background environments [55].

(2) Framing. Framing process is used to split the pre-emphasized speech signal into short segments. The voice signal is represented by frame samples and the interframe distance or frameshift is (). In the proposed application, the frame sample size and frameshift . The frame size and frameshift (in milliseconds) are calculated as

(3) Windowing. The speech signal is a nonstationary signal but it is stationary for a very short period of time. The window function is used to analyze the speech signal and extract the stationary portion of a signal. There are two types of windowing:(i)Rectangular window,(ii)Hamming window.

Rectangular window cuts the signal abruptly so the proposed application used Hamming window. Hamming window shrinks the values towards zero at the boundaries of the speech signal. The value of Hamming window () is calculated asThe windowing at time is calculated by

(4) Discrete Fourier Transform (DFT). The most efficient approach for computing Discrete Fourier Transform is to use Fast Fourier Transform algorithm as it reduces the computation complexity from to . It converts the discrete samples of speech from the time domain to the frequency domain as calculated bywhere is the Fourier transform of and is the length of the DFT.

(5) Mel-Filter Bank Processing. Human ears act as band-pass filters; i.e., they focus on only certain frequency bands and have less sensitivity at higher frequencies (roughly >1000 Hz). A unit of pitch (mel) is defined for separating the perceptually equidistant pair of sounds in pitch into an equal number of mels [56] and it is calculated as

(6) Log. This step takes the logarithm of each of the mel-spectrum values. As human ear has less sensitivity to the slight difference in amplitude at higher amplitudes as compared to lower amplitudes. Logarithm function makes the frequency estimates less sensitive to the slight difference in input.

(7) Discrete Cosine Transform (DCT). It converts the frequency domain (log mel-spectrum) back to the time domain by using DCT. The result of the conversion is known as mel frequency cepstrum coefficient (MFCC) [57]. We calculated the mel frequency cepstrum byIn the proposed methodology, the value of = 12 because a 12‐dimensional feature parameter is sufficient to represent the voice feature of a frame [17]. The extraction of cepstrum via DCT results in 12 cepstral coefficients for each frame. These set of coefficients are called acoustic vectors (.mfcc). The acoustic vector (.mfcc) files are used for both the training and testing speech samples. The HTK-HCopy runs for conversion of input speech sample into acoustic vectors. The configuration parameters, used for MFCC feature extraction of the speech sample, are listed in Table 2.

(c) Acoustic Model Generation. It provides a reference acoustic model with which the comparisons are made to recognize the testing utterances. A prototype is used for the initialization of first HMM. This prototype is generated for each word of the Deaf-mute dictionary. The HMM topology comprises 6 active states (observation functions) and two nonemitting states (the initial and the last state with no observation function) which are used for all the HMMs. Single Gaussian observation functions with diagonal matrices are used as observation functions and are described by a mean vector and variance vector in a text description file known as prototype. This predefined prototype file along with acoustic vectors (.mfcc) of training data and associated labels (.lab) is used by the HTK tool HInit for initialization of HMM.

(d) Recognition Phase. HTK provides a Viterbi word recognizer called HVite, and it is used to transcript the sequence of acoustic vectors into a sequence of words. HVite uses the Viterbi algorithm in finding the acoustic vectors as per MFCC model. The testing speech samples are also prepared in the same way of preparing the training corpus. In the testing phase, the speech sample is converted into series of acoustic vectors (.mfcc) using the HTK-HCopy tool. These input acoustic vectors along with HMM list, Deaf-mute pronunciation dictionary, and language model (text labels) are taken as an input by HVite to generate the recognized words.

3.3. Messaging Service for Deaf-Mute and Normal Person

The application also provides messaging feature to both Deaf-mute and normal people. A person can choose between the American Sign Language or English keyboard for sending the messages. The complete flowchart of “V2M” is illustrated in Figure 3.

4. Experimental Results and Discussions

4.1. Experimental Setup

The proposed application V2M required a camera, a mobile phone for the installation of the V2M app, laptop (acting as a server), and an instructor to guide the Deaf-mute student. The complete scenario is shown in Figure 4.

A total of 15 students from Al-Mudassir Special Education Complex Baharwal, Pakistan, participated in this experiment and the participated students were between the ages of 7 and 13 with some speech training in school. The instructor guided all students in using the mobile application. The experiment consisted of two phases.

4.1.1. Speech Testing Phase

In this phase, instructor selected a “register voice” option from a menu of the app and entered a word/sentence or question (label) in a text field of the “register sample” dialog box, for which the training speech samples of participants were taken (see Figure 5(b)). At first, the instructor needed sign language for asking the participants to speak a word/sentence or an answer. The system took 2 to 4 voice samples of each word/sentence. Whenever the participant registered his/her voice, the system acknowledged by a visual support (as in Figure 5(c)). For testing, the researcher asked questions via the V2M app, and it displayed an avatar that performed sign language for a Deaf-mute participant in order to understand the questions (see Figure 5(d)). In response, the participant selected the microphone icon (as shown in Figure 5(e)) to speak his/her answer. The app processed and compared the recorded speech sample with the registered samples. After the comparison, it returned the text and spoke out the answer of the participant (see Figure 5(f)).

4.1.2. Message Activity Phase

The participants took minimal support from an instructor in this phase. They easily composed and sent the messages by selecting sign language keyboard (see Figure 5(g)).

4.2. Qualitative Feedback

Researchers formalized questionnaire survey to evaluate the effectiveness of the Deaf-mute application. The survey comprised 12 questions for participants to answer and the reason for this short-length selection of questions was not to overwhelm Deaf-mute students with longer interviews. Secondly, these students had no experience of using any Deaf-mute based application. The qualitative feedback is summarized into following categories (paraphrased from the feedback forms).

Familiarity with Existing Mobile Apps. All participants have not heard or used any mobile applications which are dedicated to Deaf-mutes.

Ease of Use and Enjoyment. All participants enjoyed using the app. They liked the idea of using an avatar for performing sign language. Out of 15 students, 12 students performed the given tasks quite easily and 3 students have not used or interacted with mobile devices before. Initially, they found this app difficult but it became easier for them after app functions were performed 2-3 times in front of them. Overall they found this app user-friendly and interactive.

Application Interface. Participants liked the interface of the app. They learned the steps of app quite fast and they also liked the idea of an avatar performing greeting gesture at home screen.

Source of Communication. All participants were using sign language as a primary source of communication. They recommended the intervention of mobile application as a source of communication for them. They acknowledged that the mobile app can be used to convey the message of deaf-mute to a normal person.

4.3. Results and Comparative Analysis

The application training and testing corpus are obtained from the speech samples of Deaf-mutes. Training corpus is comprised of English alphabets (A–Z), English digits (0 to 9), and 15 common sentences used in daily routine life, i.e., good morning, hello, good luck, thank you, etc. All participants uttered each alphabet, digit, and statement 2–4 times. The total training utterances are 2440. The HTK speech recognizer was used in training process and speech recognition. HMM was used at the backend of speech recognizer HTK. For testing, each participant was asked 10 questions to answer. There are a total of 390 testing utterances. The application recorded the answer (speech sample), processed it, and displayed (text/speech) result for normal person understanding. The accuracy of simulation results of proposed application is calculated by using precision and recall. For the V2M app, the precision is calculated by a fraction of correctly identified speech signals to a total number of speech samples whereas recall is a percentage of the number of relevant results. Precision, recall, and accuracy are calculated by using the following formulas:True positive (tp) refers to words that are uttered by the person and detected by the system.False positive (fp) refers to words not uttered by the person but detected by the system.False negative (fn) refers to words that are uttered by the person but the system does not detect it.True negative (tn) refers to everything else.

The experimental results of the proposed methodology in terms of precision, recall, and accuracy parameters are illustrated in Table 3.

It is observed from Table 3 that the number of speech samples has direct impact on precision and recall of the application. Overall average precision is 56.79% and recall is 46.79% when registered sample count in all statements is 2 () for each participant. However, the average precision is 93.16% and recall is 83.19% for registered sample count 3 (). The average accuracy in terms of precision and recall is above 97% when registered sample count in all statements is 4 () for each participant. The -score of best precision and recall is calculated:Hence it is deducted that the precision of application decreases by taking the limited number of speech samples () of the deaf-mute. The application outperforms when the number of speech samples for each statement is greater than 2 (). The speech recognition methodology of proposed application is compared with other speech recognition systems as shown in Table 4.

5. Conclusion

Deaf people face many irritations and frustrations that limit their ability to do everyday tasks. Deaf children have high rates of behavioral and emotional issues in relation to different methods of communication. The main inspiration behind the proposed application is to remove the communication barrier for Deaf-mutes especially children. This app uses the speech or text input of normal person and translates it into sign language via 3D avatar. It provides speech recognition system for the distorted speech of Deaf-mutes. The speech recognition system uses MFCC feature extraction technique to extract the acoustic vectors from speech samples. The HTK toolkit is used to convert these acoustic vectors into recognizable words or sentences by using pronunciation dictionary and language model. The application is able to recognize Deaf-mute speech samples of English alphabets (A–Z), English digits (0 to 9), and 15 common sentences used in daily routine life, i.e., good morning, hello, good luck, thank you, etc. It provides message service for both Deaf-mutes and normal people. Deaf-mutes can use customized sign language keyboard for composing the message. The app also can convert the received sign language message to text for a normal person. The proposed application was also tested on 15 children aged between 7 and 13 years. The accuracy of proposed application is 97.9%. The qualitative feedback of children also highlighted that it is easy for Deaf-mutes to adapt the mobile technology and mobile app can be used to convey their message to a normal person.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors are thankful to Maaz Khalid and Mustabshira Zia for their valuable assistance in this study. The authors also gratefully acknowledge Al-Mudassir Special Education Complex Baharwal, Pakistan, for providing them a platform to test the proposed technique of this article. The authors are appreciative of the hard work and dedication of instructors and children who participated in this study. This work was financially supported by the Machine Learning Research Group, Prince Sultan University, Riyadh, Saudi Arabia RG-CCIS-2017-06-02. The authors are grateful for this financial support and the equipment provided to make this research successful.