Abstract

The use of slang, abusive, and offensive language has become common practice on social media. Even though social media companies have censorship polices for slang, abusive, vulgar, and offensive language, due to limited resources and research in the automatic detection of abusive language mechanisms other than English, this condemnable act is still practiced. This study proposes USAD (Urdu Slang and Abusive words Detection), a lexicon-based intelligent framework to detect abusive and slang words in Perso-Arabic-scripted Urdu Tweets. Furthermore, due to the nonavailability of the standard dataset, we also design and annotate a dataset of abusive, offensive, and slang word Perso-Arabic-scripted Urdu as our second significant contribution for future research. The results show that our proposed USAD model can identify 72.6% correctly as abusive or nonabusive Tweet. Additionally, we have also identified some key factors that can help the researchers improve their abusive language detection models.

1. Introduction

The birth of social media entirely revolutionized the ways and purpose of mass communication [1]. In the early days, mass communication media was used with the ethical and moral responsibilities as governed by social norms. Besides that, mass communication media was effectively used for education and training. Social media currently allows every connected individual to express their feelings about anything using Twitter, Facebook, Instagram, blogs, or other social media sites [2]. Recent studies about social media show peoples’ lack tolerance that turns into aggression through which people use such language that may offend others’ feelings [3]. However, most of the social media sites have policies for content publications and penalties for policy violations.

Nonetheless, in the case of informal and vulgar textual content, the policies’ violation is hard to detect manually [1] due to an immense number of posts. Moreover, these social media sites also allow users to post their textual content in native languages. According to the survey conducted in 2018, the English language is used in only 32% of all Tweets [4]. Twitter is a microblogging and social networking service that allows users to express their views using a Tweet of 280 characters [2].

Hate speech and offensive and abusive language detection on social media is an active research field [5]. There are several studies available for hate speech abusive language detection for English [6, 7], Danish [8], Arabic [1, 9], Indonesian [10], and others. These studies used different methodologies to detect abusive and offensive language, such as lexicon-based detection in the English language [11], n-gram for English [12] and Roman Urdu [5] language, pattern matching [13], blacklist [7], and others. However, to the best of our knowledge, abusive, offensive, and slang word detection from the Urdu language in Perso-Arabic script has not been performed due to its complexity.

The Urdu language is challenging due to its morphological and syntactical complexity as the Urdu language draws grammatical structures and vocabulary from Persian, Sanskrit, Arabic, and Turkish [14]. Due to morphological and syntactical complexity, minimal research has been conducetd on Urdu text, especially for abusive word detection [5]. Similarly, there is no standard dataset in Perso-Arabic-scripted Urdu publicly available for offensive text detection.

This study proposes a lexicon-based framework to detect abusive, offensive, and slang words in Perso-Arabic-scripted Urdu Tweets. Additionally, due to the nonavailability of the standard dataset, we also design and annotate a dataset composed of abusive, offensive, and slang word Perso-Arabic-scripted Urdu as our second significant contribution for future research. The results show that our proposed USAD model can identify 72.6% of Tweets as abusive or nonabusive correctly with the precision of 55.21%. The contributions of this work are as follows:(1)A lexicon-based framework that detects abusive, offensive, and slang words in Perso-Arabic-scripted Urdu Tweets is proposed(2)A dataset composed of abusive, offensive, and slang words Perso-Arabic-scripted Urdu is designed and annotated

In Section 2, a brief introduction of the Urdu language is given. Related work is discussed in Section 3, while in Section 4, we discussed the USAD model. The experimentation preliminaries are discussed in Section 5, while in Section 6 and 7, we discussed the results and conclusions with future recommendations, respectively.

1.1. Motivation

Urdu is one of the main spoken languages in the subcontinent and the 11th most spoken language in the world [15]. Urdu is also the national and official language of Pakistan, and most of the users in Pakistan use Perso-Arabic-scripted Urdu on social media. Like other countries, users of social media in Pakistan often use slang and vulgar words in their Tweets [5]. As shown in Figure 1, a single Tweet in the Urdu language contains two offensive words. According to the PECA ’16 (Prevention of Electronic Crimes Act, 2016) Pakistan chapter ii, section 20 and 21, offence against dignity and modesty of a person using any communication medium is a punishable act [16]. Unfortunately, no mechanism is available to automatically detect the Urdu language’s offensive and vulgar words in the Perso-Arabic script. Even though the use of abusive, offensive, and vulgar language in Tweets is punishable according to Pakistan’s laws, this condemnable act is still in practice due to the nonavailability of automatic detection mechanisms.

2. Urdu and Perso-Arabic Script

Urdu is one of the South Asian region’s popular languages and Pakistan’s national and official language [15]. Urdu belongs to the Indo-Aryan language family, and colloquially, it is mostly mutually intelligible with conversational Hindi [17]. Formal Urdu draws grammatical structures and vocabulary mainly from the Persian language and a small amount of Sanskrit, Arabic, and Turkish language [14]. Like Persian and Arabic, Urdu is written from right to left in Perso-Arabic script and Urdu has more phonic sounds than Arabic and Persian. Urdu has 40 distinct alphabets called “Huruf-e-Tahaji,” written in various calligraphic styles such as Nastaliq, Naskh, Reqa, Diwani, and others [18].

Similarly, Hindi, a mutually intelligible language of the Urdu language, is written in Devanagari script [19]. Before the development of the Urdu charter set and keyboard, the Roman script was used to write the contents in Urdu. Urdu written in Roman script is called Roman Urdu [20] and Romanagari for Hindi [5].

Due to complex morphological and grammatical structures, diacritics [21], and limited linguistics resources, the Urdu language is mostly neglected by the research community. In this regard, the first ever 8 bit encoding standard for Urdu, “Urdu Zabta Takhti (UZT) 1.01,” was developed and accepted by the Government of Pakistan in 2000 [22]. Several studies are available on the Urdu language, such as opinion mining, sentiment analysis, text clustering, and classification. In contrast, only a single study is available for offensive language detection in Roman Urdu [5]. Therefore, the detection of abusive and offensive language in Perso-Arabic-scripted Urdu is still an open issue.

An increasing amount of attention to the computational linguistic community has been given to the automatic detection of hate speech, slang words, and offensive and abusive language from online social media. Social media is an open forum where people from different countries, races, nationalities, religions, and cultures can share their opinions and comments. These comments might usually include offensive or abusive words against other users [11]. Therefore, it is a crucial issue to detect and block or censor this condemnable practice. In this regard, many studies have been conducted in English [6, 7], Arabic [1, 9], Indonesian [10], and Roman Urdu [5]. This section briefly discusses various studies conducted for automatic hate speech and offensive language detection on social media for different languages.

Recently, automatic detection of hate speech and offensive and abusive words in users’ comments has become a trending research topic. Researchers have used different methodologies such as machine learning techniques, lexicon-based techniques, graph-based techniques, and others to detect abusive words automatically.

Watanabe et al. used unigram and patterns with supervised learning algorithms to classify hateful and clean comments in English [6]. Burnap and Williams used supervised machine learning and a statistical model to detect hate speech on Twitter [23]. They used a combination of rule-based, spatial-based, and probabilistic classifiers to detect the hate speech. Lee et al. proposed a model for abusive word detection using a dictionary of abusive and nonabusive words and unsupervised learning techniques for social media comments in the English language [7].

Chen et al. proposed the Lexical Syntactic Feature (LSF) architectures that use specific bulling content, writing style, and structure as a feature vector to predict the user’s potentiality for creating obscene content [24]. Park and Fung used a Character-level Convolutional Network, Word-level Convolutional Network, and Hybrid Convolutional Network to detect racist, sexist, and abusive Tweets in the English language [25], while Mishra et al. used a Graph Convolutional Network with the user’s online community structure and linguistic behavior to detect the offensive language [26].

While most approaches work on the English language, some studies are available for other languages; for example, Pelle et al. proposed the “Hate2Vec” approach that uses lexicon and bag-of-word classifiers to detect offensive comments in English and Portuguese languages [27]. Sigurbergsson and Derczynski proposed the Recurrent Neural Network-based hate offensive language and speech detection model for Danish and English languages [8]. In contrast, Schneider et al. used a Convolutional Neural Network to detect the abusive, insulting, and offensive comments for the German Language [28].

Ibrohim and Budi proposed n-gram and supervised learning-based approaches to detect the abusive language in Indonesian social media [10]. Alakrot et al. proposed an n-gram-based model to catch even misspelled offensive and obscene Arabic words and phrases in user comments [9]. For this purpose, they also construct a dataset of abusive words in the Arabic language to detect antisocial behavior [29]. In comparison, Abozinadah proposed a multidimensional analysis model that uses social graph analysis, statistical analysis, and lexical analysis to detect the abusive language in Arabic text [1]. Akhter et al. proposed an n-gram and supervised machine learning algorithm-based offensive language detection model for Roman Urdu [5].

Rizwan et al. proposed a CNN-gram-based model to detect hate speech and offensive language in Roman Urdu [30]. They tested their model on the RUHSOLD (Roman Urdu Hate Speech and Offensive Language Detection) dataset. They tested the performance of their proposed model with and seven baseline models. Abbas performed experiments using multiple machine learning algorithms to detect toxic (offensive) comments in Roman Urdu [31]. He reported that Random Forest gave 96.4% accuracy with the character 4-gram technique. Kausar et al. proposed “ProSOUL,” a framework to identify the propaganda in online Urdu content [32]. They proposed a Linguistic Inquiry and Word Count Dictionary to detect psycholinguistic features to propaganda in Urdu contents.

Offensive language and hate speech detection is an important issue, especially in social media, which can influence the user’s behavior and reaction. Unfortunately, most of the research has focused on automatic offensive language detection for resource-rich languages [5] such as English, Danish, and German, while publication on other languages is rare. Therefore, we proposed automatic abusive and slang word detection for the Perso-Arabic Urdu language in this research.

4. Urdu Slang and Abusive Word Detection (USAD) Model

Urdu is one of the major languages of the subcontinent and the national language of Pakistan. Most of the users of social media from Pakistan prefer to comment in their native Urdu language in Perso-Arabic script. Like other languages, the use of abusive and offensive phrases is very much common in Urdu comments. In this section, we discuss the working of the proposed USAD model.

4.1. Working of the USAD Model

The proposed USAD model is divided into two major phases, i.e., the Lexicon Building Phase and the data testing phase. We crawled and collected Tweets posted in Perso-Arabic script using Twitter Application Programming Interface (API) in the lexicon building phase. For dataset preparation, the tweets are saved in a text file using UTF-8 (Unicode Transformation Format version 8) encoding and forwarded to the preprocessing module. In the data preprocessing step, stop words, punctuation marks, digits, and nonlanguage characters are removed, and tweets are tokenized as a single entity. After data cleansing, a lexicon of Urdu abusive and slang words is created manually. The details of lexicon creation are discussed in Section 5.1. In the data testing phase, clean tweets are given as input to the classification module. In the classification module, each word of an input tweet is tested against the abusive words lexicon for classifying tweets as abusive or nonabusive. The architecture of the proposed USAD model is shown in Figure 2.

5. Experimentation Preliminary

A Python-based lexicon building and the testing tool are developed to implement the proposed USAD model for abusive and offensive Urdu tweet detection. The abusive word dictionary is used to classify the tweets into abusive and nonabusive class. This section explains the methods of Urdu abusive and slang words lexicon and testing dataset creation. Furthermore, to evaluate the proposed USAD model’s performance, we used a standard machine learning performance evaluation parameter, i.e., precision, recall, F-measure, and accuracy.

5.1. Dataset and Lexicon Creation

For abusive and slang words lexicon, we collected more than 5000 Tweets and replies posted by famous politicians, journalists, analysts, and intellectuals on different topics during October 2019 and December 2019. The tweets are then saved into a text file with a UTF-8 encoding scheme. After applying the preprocessing and data cleansing steps, an abusive words lexicon is created manually. The abusive word lexicon is composed of 2533 abusive words of Urdu language posted in 3 months period.

For testing, we build a dataset composed of 1200 Tweets and replies posted on different topics during the same period, i.e., October 2019 and December 2019. For data cleansing, the same preprocessing steps are used. After data cleansing, we manually annotate the dataset into abusive and nonabusive classes for result comparison. After manual annotation, the testing data are supplied to the classification algorithm that uses a string-matching method for Tweet classification. The details of lexicon building a dataset and testing dataset are given in Table 1. Similarly, the examples of Tweets with abusive words in Perso-Arabic Urdu are given in Table 2.

5.2. Performance Evaluation Metrics

For performance evaluation of the proposed USAD model for detecting abusive and slang Urdu tweets, we used standard machine learning performance evaluation parameters, i.e., precision, recall, F-measure, and accuracy. Precision shows how many of the identified tweets are abusive, and recall shows how many of the total tweets are correctly identified as abusive tweets, while F-measure is a harmonic mean of precision and recall [20,33]. The equations of the selected evaluation parameters are given as follows:where TP stands for true positive, FP stands for false positive, TN stands for true negative, while FN stands for the false-negative sample.

6. Results and Discussions

This research proposes a lexicon-based framework to detect abusive, offensive, and slang words in Perso-Arabic-scripted Urdu Tweets. For experimentation, we build a Python-based testing environment for both lexicon building and Tweets classification (Section 5). For lexicon building, we crawled more than 5000 Urdu Tweets and made a lexicon of 1250 abusive and slang words, while for testing, we took 1200 Urdu Tweets and manually annotated them into abusive and nonabusive classes. After manual classification, the dataset is supplied to the testing module for the automatic classification of the data using an abusive lexicon. This section discusses the results of the proposed USAD model’s effectiveness in the automatic detection of abusive Urdu Tweet.

The results show that, out of 365 abusive Tweets, our proposed USAD model correctly identified 265 Tweets as abusive, while out of 835, the proposed USAD identified 620 Tweets as nonabusive Tweets. The results are actual, and the predicted Tweets are given in Table 3. In terms of precision and recall, USAD performed well by identifying 72.6% (Recall) Tweets as abusive correctly with the precision of 55.21%. Similarly, the proposed USAD model was able to classify 73.75% of Tweets correctly as abusive and nonabusive. The precision, recall, f-measure, and accuracy of the proposed USAD model are depicted in Figure 3.

The USAD model was able to identify 72.6% of Tweets as abusive and 74.3% Tweets as nonabusive correctly. Upon investigation of misclassified Tweets, it was found that the Tweets were misclassified due to the limited abusive words lexicon, proverbs, and quotes, contextual abusive words, abusive terms of other languages, and misspelled abusive words. The details of the findings mentioned above are discussed in this section with examples in Table 4.

6.1. Limited Abusive Words Lexicon

One of the significant limitations of Tweets’ misclassification is the number of abusive words in the abusive words lexicon. The creation of new slang, abusive, and vulgar terms is an ongoing process in any language, just like literature. Additionally, some slang words are event-driven, and some are contextual. For this research, we took 5220 Tweets in the Urdu language to develop an abusive language lexicon composed of 1250 abusive words posted in three months (October to December 2019).

6.2. Proverbs and Quotes

Another major issue of misclassification is the lack of proverbs and quotes in the abusive words’ lexicon. Urdu has a rich history stemming from diverse cultures coexisted for a long period within the same region. As Urdu is also the combination of various languages of the subcontinent, it has a rich collection of proverbs and quotes of almost every language, especially Persian. Some of the proverbs and quotes contain abusive words which actually are nonabusive words. As in this research, we build our lexicon for abusive words; therefore, many positives quotes and proverbs are identified as abusive Tweets.

6.3. Contextual Abusive Words

Most of the languages’ abusive words usually evolve from an event or context, such as sarcasm. People typically make fun of the actions or statements of an individual to defame them. These words are contextual abusive words which are hard to identify due to a lack of contextual information.

6.4. Abusive Terms of Other Languages

Another important reason for data misclassification is abusive words other than Urdu or in Roman script. Most users also use abusive words in their native languages such as Pashto, Punjabi, Sindhi, and Hindko or use abusive words of the English language in their Tweets. As our Lexicon is Urdu based, Tweets with abusive words in other languages are identified as nonabusive.

6.5. Misspell Abusive Word

Currently, the misspelling of words is a prevalent practice on social media. Users are usually careless about their posts as the human brain can process and infer the text’s meaning. Therefore, it is difficult for the computer to identify abusive words with wrong spelling.

7. Conclusions

In this work, we proposed the USAD model for automatic detection of abusive Tweets posted in Perso-Arabic-scripted Urdu. For experimentation, we used a lexicon of abusive Urdu words composed of 1250 words and a testing dataset consisting of 1200 manually annotated Tweets (365 abusive and 835 nonabusive). The results show that the proposed USAD model can identify 72.6% of Tweets as abusive or nonabusive correctly with the precision of 55.21%. Upon the investigation of misclassified Tweets, we have found that the Tweets were misclassified due to the limited abusive words lexicon, nonexistence of Urdu proverbs and quotes in the lexicon, contextual abusive word, abusive terms of other languages, and misspelled abusive words.

It is concluded that the proposed USAD model’s performance can be improved with the more significant abusive lexicon. Moreover, the inclusion of the abusive terms of other languages of Pakistan such as Pashto, Punjabi, English, and others can also improve the proposed model’s performance as people usually prefer abusive terms of their mother tongue. Similarly, the inclusion of all possible misspelled abusive words in a lexicon will also significantly improve the model’s performance. Furthermore, a lexicon of proverbs and quotes and an abusive lexicon can help the model decide the class of the phrase. However, for this purpose, a phrase-level matching will be appropriate. The most challenging task will be handling contextual abusive terms and sarcastic terms as these terms are based on some event, and usually, the words used in these Tweets replies are not directly connected with the base Tweets.

For future work, we aimed to enhance the lexicon of abusive words with more abusive words, including abusive words from other local languages in both Perso-Arabic and Roman scripts. We have also aimed to create a lexicon for proverbs and quotes of the Urdu language to improve the machine’s performance by differentiating abusive words and quotes or proverbs. To solve the problem of misspelling abusive words, edit distance, and n-gram approaches are potential candidates. For solving the sarcastic and contextual abusive words problem, the semantic graph-based method can be used effectively.

Data Availability

All the data used to support the findings of the study are available in the manuscript.

Conflicts of Interest

All the authors declare that they have no conflicts of interest related to this study.

Acknowledgments

The authors would like to thank and acknowledge the support provided by King Saud University, Riyadh, Saudi Arabia, through Researchers Supporting Project number RSP-2020/184. This work was supported by King Saud University, Riyadh, Saudi Arabia, through Researchers Supporting Project number RSP-2020/184.