The part of speech (PoS) tagging is a core component in many natural language processing (NLP) applications. In fact, the PoS taggers contribute as a preprocessing step in various NLP tasks, such as syntactic parsing, information extraction, machine translation, and speech synthesis. In this paper, we examine the performance of a modern standard Arabic (MSA) based tagger for the classical (i.e., traditional or historical) Arabic. In this work, we employed the Stanford Arabic model tagger to evaluate the imperative verbs in the Holy Quran. In fact, the Stanford tagger contains 29 tags; however, this work experimentally evaluates just one that is the VB ≡ imperative verb. The testing set contains 741 imperative verbs, which appear in 1,848 positions in the Holy Quran. Despite the previously reported accuracy of the Arabic model of the Stanford tagger, which is 96.26% for all tags and 80.14% for unknown words, the experimental results show that this accuracy is only 7.28% for the imperative verbs. This result promotes the need for further research to expose why the tagging is severely inaccurate for classical Arabic. The performance decline might be an indication of the necessity to distinguish between training data for both classical and MSA Arabic for NLP tasks.

1. Introduction

The part of speech (PoS) tagging, also known as word-category disambiguation, is a process to determine the tag of each word in a given input text. The tagging process uses the context to label words using syntactic tags, such as noun, adjective, verb, or preposition that are also known as parts of speech, word-classes, grammatical categories, lexical class markers, or syntactic categories. Tagging is performed either manually by linguistic experts or automatically by machine learning algorithms; intuitively, this work considers the computational track. Word tags are mainly used to describe the words and their jobs according to the context for further processing. That is, each word has a particular role based on the position and the adjacent words in the sentence. The tagset is a predefined list that generally includes some symbols, such as nouns, pronouns, adjectives, verbs, adverbs, propositions, conjunctions, and the definite and indefinite articles (sometimes called “determiners”). Of course, the tagset is prepared by the language linguistic scholars to describe the language’s membership or word family. The size of the tagset is variable and depends on the requirements or the capacity of developing applications. In any case, the tagset should best fit and efficiently serve the intended purposes. Hence, there is no predefined tagset for all languages and there is no standard (i.e., unique) tagset for a certain language. Rather, it is a debatable matter.

The PoS is increasingly becoming a vital factor in the related natural language processing (NLP) applications. In fact, creating knowledge base resources (e.g., tag relationships) is one objective of the PoS tagging that can be later used in other NLP tools. In fact, PoS tagging has many roles in the field of NLP as a basic prepossessing step. For instance, some of NLP PoS tagging based applications include syntactic parsing, information extraction, machine translation, speech synthesis, and named entity recognition (NER). This work is aimed at exploring the performance of the PoS for the classical Arabic using a modern standard Arabic (MSA) tagger that is the Stanford tagger [1]. Since it is difficult to evaluate the Stanford tagger for all tags (29 tags) as it requires a large annotated corpus, the Quranic imperative verbs were chosen in the evaluation process. The Stanford tagger uses the label VB to mark the imperative verbs. That is, this work is restricted to a testing dataset that contains a list of all imperative verbs in the Holy Quran that is obtained from [2]. This work is distinguished by presenting an experimental study of the classical Arabic performance using one of the freely available taggers and, therefore, making it clear for comparison purposes. This work also aims to demonstrate the tagging problems from different points of view, such as the Arabic PoS tagging benefits and challenges, tagsets capacities, tagging algorithms, and the recent studies in this field.

In spite of the importance of taggers’ performance for both classical and MSA Arabic, few studies have explored the accuracy for the classical Arabic. On the other hand, most of the previous studies focused on the tagsets and the tagging approaches. For instance, one study [3] proposed an Arabic tagset with detailed hierarchical levels of the categories and their relationships (i.e., a tree of different levels). As indicated, this study focused on the imperative verbs in the Holy Quran. The reason for choosing the imperative is that it is easier to find such annotated testing collections due to the previous effort of Arabic scholars to serve the Quranic studies. In addition, the Arabic language is distinguished to have a stand-alone form of imperative verbs whereas it is mixed with the present verb as found in the English language. For instance, the English language has the verb “go” as an imperative and present verb, while the same verbs have a different form in the Arabic language as the imperative is “اذهب” and the present is “يذهب” which are completely different words in terms of transcription and tense.

Even though the documentation of the Stanford tagger [4] indicates that the accuracy of the Arabic model is 96.26% on an MSA test portion as described in [5] and 80.14% for unknown words, our measure shows extremely less accuracy. In this work, the Stanford tagger scored only 7.28 % accuracy for a collection of Arabic imperative verbs. It is worth indicating that the Stanford tagger works at word level (i.e., the tag is given to the whole word instead of its parts, such as prefixes, stems, and suffixes as some other taggers do). Despite diacritics playing an important role in the tagging process, nevertheless, they are discarded in this work since the Stanford tagger does not consider the diacritics of the input text. However, we do keep the Hamza (e.g., أ) and the Madd ( آ) symbols in the corresponding characters. That is, the testing dataset is a nonvocalized Arabic text. The output of this work highlights the importance of reinvestigating the tagging problem for the Arabic language since many of previous studies report accuracies into the nineties percentile. Reinvestigation includes different aspects of training data as either classical or MSA, the tagsets, the corpora sizes, etc.

The rest of this paper is organized as follows. In the next section, we demonstrate the benefits of the tagging for various NLP applications. In Section 3, we present why tagging is a challenging task. We exhibit the literature review in Section 4 followed by the Stanford tagset in Section 5. The proposed method is described in Section 6 and the experimental results in Section 7. Finally, we conclude in Section 8.

2. The Benefits of Tagging

The PoS tagging is the core of many NLP algorithms due to the useful information it gives about a word and its neighbors. In fact, NLP applications employ the output of the PoS tagging for different purposes, such as checking the correctness of the syntactic structure around the word. For instance, regarding the Arabic language, adjectives are preceded by nouns while nouns are preceded by adjectives in the English language such as “A beautiful school مدرسة جميلة”. Similarly, nouns are preceded by verbs in the English such as “He runs fast”, while the Arabic allows both directions, such as “المعلم يكتب الدرس the teacher writes the lesson” and “يكتب المعلم الدرس the teacher writes the lesson.” Therefore, the Google translator gives the same translation for two different word order Arabic sentences. Hence, knowing the syntax of word order is extremely important for some NLP applications since it limits the output candidates and increases the probabilities of correct answers. The following are some of NLP applications that utilize the PoS tagging:(i)Capturing common syntactical rules: [6] presents a data mining based method to extract the common syntactical rules in the Holy Quran. The study reported that the common relationships between the words’ tags (i.e., the common rule) are tag1=RP tag2=NN tag3=WP 91 ⇒ tag4=VBD 90 accuracy (0.97912). For more information of the tags, the reader refers to Section 5 in this paper.(ii)Enhancing the performance in speech recognition: [7] employs the PoS to generate new words based on the neighboring word tags. The study used compound nouns that are followed by adjectives and the preposition followed by any word. After recognition, the compound words were placed back to their original states (i.e., two parts). This method shows performance enhancement.(iii)Named Entity Recognition (NER): [8] employs a tagger for named entities recognition (NER). NER aims at extracting the names such as people, organizations, locations, cities, or companies. NER is beneficial for certain applications such as classifying content for news providers. This facilitates categorization and content discovery. NER also speeds up the search process in sizeable data that contains, for instance, millions of articles. Other applications include using powering content recommendations, customer support, and research papers.(iv)Syntactic Parsing: [9] employs the PoS tagging for syntactic parsing. Syntactic parsing is a process to confirm that the input sentence follows the language’s formal grammar. Figure 1 shows a parsing tree for a simple sentence. The parsing tree represents the syntactic structure of the text and is mainly used for analyzing the input sentence.(v)Other PoS tagging based applications include: semantic role labeling [10], speech synthesis [11], speech recognition [12], information extraction [13], summarization [14], sentiment analysis also called opinion mining [15], diacritization [16], software engineering [17], question answering [18], translation [19], plagiarism detection [20], key phrases extraction [21], ontology [22], and extracting Arabic noun compound [23].

3. The Challenge of Tagging

That fact that a word can take different tags makes the PoS tagging a challenging task. That is, a word can be labeled by different tags based on the context. Therefore, the goal of the PoS tagging algorithms is to remove such ambiguity and label the words correctly. Table 1 shows some examples of words that take different tags based on the context. As shown in the table, the word “gold ذهب” in sentence 1 is tagged as VBD (verb, past tense) while it is tagged as NN (noun, singular or mass) in sentence 2. Similarly, the word “Said سعيد” in the first sentence is tagged as NNP (proper noun, singular) while it is tagged as JJ (adjective) in sentence 3. This shows how a particular word can have different labels, which is the challenge of the PoS tagging process. Hence, the problem of the PoS tagging is to resolve ambiguities by choosing the proper tag considering the surrounded words. Of course, the absence of diacritics in the Arabic formal writing system adds even more ambiguity. For instance, there is no ambiguity to know that the diacritized word “gold ذَهَبْ” is a noun and the diacritized word “went ذَهَبَ” is a verb. The figure also shows the tagging output for the translated sentences using the English model of the Stanford tagger.

4. Literature Review

Despite the importance of the PoS tagging for both MSA and classical Arabic, most of the previous tagging studies have mainly focused on the MSA. In addition, the literature shows there is an active research to consider suitable tagsets that truly reflect the linguistic items of Arabic as one of the morphologically rich languages. In this literature, we demonstrate the up-to-date Arabic tagging research which focused on the main aspects and components, such as the type of the training text (i.e., MSA, classical, tweets), tagsets, tagging algorithms, unknown words, stemming. In [30], the study indicated that stemming (i.e., removing prefixes and postfixes or suffixes) enhances the tagging performance. In [31], the study presented a method to tag tweets that is usually written out of the formal and proper spelling of the language. In [28], the study considered a method to handle the “unknown words”, which are the words that did not appear in the training corpus. In [26], the study considered the problem which arises when estimating the transition probabilities in limited amounts of training data. The study proposed decision trees based method to handle this problem that generally occurs in the hidden Markov models (HMM) tagging technique. In [32], the study implemented the master-slave technique for the PoS tagging; they used HMM as a master tagger and maximum match (MM) and Brill taggers as slaves. There are many approaches to perform the PoS tagging, the most widely used is the statistical approach that is based on the HMM. Another approach is not-statistical, which is on a number of hand-crafted disambiguation rules to find the most appropriate tag for each word as in [33].

The recent studies of part of speech tagging include different aspects. For instance, [34] developed a part of speech tagger for the Arabic heritage. They scored an accuracy of 96.22%. They also reported that the most of the tagging errors are results of segmentation. Reference [35] employs part of speech tagging to enhance the performance of Arabic text classification. Reference [36] demonstrates part of speech tagging for the Arabic Gulf dialect. For the tagging process, they employ Support Vector Machine (SVM) classifier and bidirectional Long Short Term Memory (Bi-LSTM). Reference [37] presents a tagging based study regarding Arabic dialects identification. Reference [38] uses part of speech and semantic tagging to extract features for training Neural Machine Translation.

Table 2 presents some information regarding tagging systems, such as tagging algorithms, tagsets, corpora, and accuracies. We are aware that the accuracy is not a matter since each work has its own corpus; nevertheless, reporting these measures might give an indication of the overall accuracy of the Arabic PoS tagging. Similarity, even the tagset size is important; however, it is more important to have enough training set to cover the tags used; otherwise, zero values might be assigned to the HMM transition probabilities which raises a tagging problem.

5. Stanford Tagset

As indicated in the literature review, there are many tagsets that are used in the previous studies. Mainly, the tags are divided into two classes (i.e., categories), which are closed class and open class. The closed class has a fixed membership, such as prepositions while the open class can accept new words especially in the technology fields as “to fax”. Table 3 shows the 29 tags of the Arabic model of the Stanford tagger.

6. The Proposed Method

This section presents the steps that we follow to find the performance of the Stanford tagger against the Quranic imperative verbs. The first step is the tagging process that produces an annotated text file of the entire Quranic sentences. Then we used a number of Python programs to extract the correctly tagged imperative verbs as well as the wrongly tagged imperative verbs, etc. The textual version of the Holy Quran is obtained from the Quran Printing Complex, Saudi Arabia website [36]. Algorithm 1 summarizes the implemented steps.

The proposed algorithm
1. Obtain the text of the Holy Quran from [24] and remove the diacritics.
2. Install the full version of the Stanford Arabic model tagger from [25].
3. Have the text of the Holy Quran tagged.
4. Obtain a list of all imperative verbs in the Holy Quran from [2].
5. Find all words that have the tag VB ≡ imperative verb.
6. Compare the two lists; the one we obtained in step 5 and the list we obtained in step 4 to find the correctly tagged imperative verb.
7. Find the accuracy based on the information that is obtained in step 6.

The input testing set is the nondiacritized textual form of the Holy Quran. Figure 2 shows what the testing set looks like. The figure contains the first chapter or Surah of the Holy Quran (Sūrat al-Fātiah—The Opening) in addition to the first three sentence of the second chapter (Sūrat Al-Baqarah—The Cow). Figure 3 shows the output of the Stanford tagger for the Quranic sentences that appear in Figure 2. As it is observed, Figure 3 shows some correctly tagged words such as the following: المستقيم/DTJJ, أنعمت/VBD, الذين/WP, يؤمنون/VBP}. The figure also shows some wrongly tagged words such as the following: بسم/NNP, إياك/VBD, اهدنا/VBD}.

The tagger output that is shown in Figure 3 is the main content that can be used for further analysis to find the behavior of the tagger. Of course, the correctly tagged words are required (i.e., the correct labels of the testing words) in order to measure the accuracy which adds more difficulty in this kind of research. In other words, if we want to measure the accuracy for the “entire” Holy Quran, we have to prepare an annotated version of the Holy Quran which is a difficult task. This is why we chose a subset that contains only the imperative verbs.

7. The Experimental Results

For the evaluation, we used the full Stanford tagger (129 MB) that is freely available at the website of the Stanford natural language processing group through the link [37]. It is relatively simple to execute the tagger by running the command shown in Figure 4 in the Windows, Command Prompt program. That is, the tagger does not require special systems, as we run it on the Command prompt of the Windows 10 home operating system. The figure shows that 77,749 words are tagged in a very short time.

The experimental results are demonstrated in Table 4. The table exposes the information regarding the imperative verbs; however, this work can be expanded to measure the performance for different tags such as noun or verb. Similarly, it is possible to find the performance of the Stanford tagger regarding the prepositions in the Holy Quran, in which the same steps can be followed to get the accuracy for prepositions, or the overall accuracy of all tags. Finally, exploring the performance for the Stanford tagger as well as for the other taggers will lead to discover more weakness points to be avoided in future NLP systems.

8. Conclusions

This work explored the performance of the Stanford tagger for the Arabic language. The experimental results show the importance of distinguishing between training data when preparing taggers. That is, the tagger that is prepared for poetry is different from the tagger that is prepared for prose. Similarly, the tagger used in the old text is different than one that is prepared for MSA. The tweets are also different from MSA. This is the main observation of this study as the performance of the MSA based tagger sharply declines for the classical text. The study also shows the differences between the literature tagsets which promotes a better study and work for a standard tagset that thoroughly covers the language. However, preparing a comprehensive tagset requires an extensive double check of the transition probabilities between all tags since zero probabilities might give errors especially in HMM based taggers. As a future work, it might be good to merge between hand-crafted rules and statistical approaches for the PoS tagging. It is also important to consider word segmentation before tagging, as many Arabic words contain different tags, such as a preposition and a noun for example as in the word“بالمدرسة ≡ at school”. Finally, the Arabic language is characterized by sizeable vocabulary as well as an extremely rich morphology that requires an endless effort towards optimal NLP systems. It is worth indicating [39, 40] as they have a thorough discussion of the Arabic challenges, as well as some recent Arabic NLP contribution such as stemming, corpora, and classifiers.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


The authors would like to thank the Palestine Polytechnic University (PPU) and the Palestine Technical University–Kadoorie for their support to conduct this research.