Abstract

The Arabic language has many complex grammar rules that may seem complicated to the average user or learner. Automatic grammar checking systems can improve the quality of the text, reduce the costs of the proofreading process, and play a role in grammar teaching. This paper presents an initiative toward developing a novel and comprehensive Arabic auditor that can address vowelized texts. We called the “Arabic Grammar Detector” (AGD-أَجِــدْ). AGD was successfully implemented based on a dependency grammar and decision tree classifier model. Its purpose is to extract patterns of grammatical rules from a projective dependency graph in order to designate the appropriate syntax dependencies of a sentence. The current implementation covers almost all regular Arabic grammar rules for nonvowelized texts as well as partially or fully vowelized texts. AGD was evaluated using the Tashkeela corpus. It can detect more than 94% of grammatical errors and hint at their causes and possible corrections.

1. Introduction

As with all languages, researchers into the Arabic language and its applications have exerted strenuous efforts to achieve progress in language processing. These efforts have focused on multiple levels, including morphology [1, 2], which studies and characterizes the structure of words, syntax [3, 4], the grammatical arrangement of words, and semantics [5, 6], which determines a text’s exact meaning. Question answering, grammar detection and correction, machine translation, and automatic summarization are types of applications dependent on these level processes. Some of these applications are extremely important and form the basis for many natural language processing (NLP) applications, such as spelling and grammar auditing systems.

There is a lack of research in the field of Arabic grammar detection and correction. This is due to the specific characteristics of the Arabic language, such as its complexity and richness at the semantic level, which can lead to ambiguity and produce an erroneous and incomprehensible text. Moreover, the flexible arrangement of words in a sentence, the properties of agglutination, and diacritics complicated Arabic grammar. All these properties lead to a variety of issues on the morphological, syntax, and semantic levels. Moreover, one important reason for the lack of research in this area is the incomplete infrastructure for the Arabic morphological level. This has led to a lack of breakthroughs in research at any higher level, namely, syntax and semantic. The emphasis is still placed on studies related to the morphological level, such as spell-checking and correct word analysis.

We attempt here to bridge the research gap by implementing a comprehensive supervised learning system for detecting grammatical errors and hint at their causes and correction in diacritic and nondiacritic Arabic text. The dependency grammar of the Arabic grammatical rules adopted in this research enables us to parse the Arabic structure graph to infer the correct pattern of grammatical rules for a sentence based on the properties of its words, which are extracted via a morphological analyzer. As far as supervised learning is concerned, part of the task of the decision tree learning algorithm involves learning how to classify the words of a text according to its grammar parsing status (correct-incorrect). This work is considered the first initiative to automate the dependency grammar rules of the Arabic language. As far as we know, AGD is the first Arabic grammatical auditing system dealing with the diacritic Arabic text.

The rest of the paper is structured as follows. Section 2 presents the literature review, Section 3 demonstrates dependency grammar, and Section 4 describes grammatical rules adopted in our system. Section 5 features a demonstration of the syntax dependency and classification model used. Section 6 presents an evaluation of the proposed system. Finally, Section 7 concludes the paper with some future directions.

2. Literature Review

There are numerous specialized research studies for auditing language-specific grammar, such as English [716], European languages [1723], and Asian languages [2429]. Moreover, there are many well-known grammar checking tools, such as Grammarly [30], LanguageTool [31], and ClearEdits [32]. Yet, none of them supports the Arabic language, which is one of the six official languages of the United Nations. To the best of our knowledge, there is no available Arabic grammar checker so far [33].

From a research point of view, only three studies, to our knowledge, have dealt with grammar auditing in Arabic. The first was conducted by Shaalan [34], who worked on the first Arabic grammar checker, called Arabic GramCheck. The study was divided into two parts: a morphological analyzer and a standard bottom-up parser. The author evaluated this system on personal data. Unfortunately, this checker is no longer available.

The second study on Arabic grammar checkers was presented by Moukrim et al. [35]. Apparently, their system uses the Arabic grammar described in the ontology [36] to generate constraints and sentence rules. Their work is based on two hypotheses. First, from the ontology, all possible sentences can be generated, and second, all correct generated sentences can be compared with the original sentence. Some improvements have been made to this work, as demonstrated in [37], using a Stanford parser to segment and annotate the input sentences. Then, they adopted the same two assumptions they mentioned in [35] to detect grammatical errors. The researchers ignored most cases of grammatical errors in the Arabic language and evaluated only four simple types of errors on nonvowelized texts. These errors include the end-mark of the five nouns (الأسماء الخمسة), the syntactic dependency in the adjective, the adverb, and the permutation with its noun. The grammar check was 94.28% accurate on the four types of errors, using only 100 sentences.

Lastly, a recent preliminary work by Madi and Al-Khalifa [38] examines the use of deep learning to detect grammatical errors in Arabic. They used the corpus of the Qatar Arabic Language Bank (QALB) for their training and assessment stages. The results of this study have not been published yet.

3. Dependency Grammar (DG)

One of the tasks of language studies is to identify the proper grammatical syntax for every sentence within a specific formalism and grammar [39]. Thus, formalism implies the correct structural constraints of the language. However, the grammar of the language is made up of two parts. First is the constituency grammar (النحو المكوناتي), which directly parses the sentence in terms of its components with context-free grammar (CFG). Second is the dependency grammar (النحو العلائقي), which treats the sentence according to the syntactic relationships between words [36].

A DG typically falls into the class of grammars that focus on words rather than constituents. Grammars that are mainly built on constituents are known as phrase structure grammars. Thus, phrase structure grammars are constituent-based, whereas DGs are word-based. Unlike the phrase structure grammars, which see sentences and clauses structured in terms of constituents, DGs presume that the structure of sentences and clauses results from the dependency between words [40, 41]. Graphically, the dependency grammar describes the structure of sentences as a tree where nodes represent words and edges represent dependencies. In contrast, the terminals and the nonterminals of context-free grammar in the constituency grammar are represented by the leaf’s nodes and internal nodes, respectively, as shown in Figure 1.

Natural language dependency representations are flexible and simple; they use directed graphics to encode words and their syntax dependencies. Figure 2 illustrates an example of a dependency graph for the sentence “The boy reads the book in the morning”— الولد الكتاب صباحا يقرأ. The edges of this graph represent a unique syntax dependency pointing from a word to its modifier. In this depiction, all edges are tagged with the specific syntax function of the dependency—for example, sub for the subject and obj for the object of a verb. In order to make programming calculations and some important definitions easier, a dummy token is inserted into the sentence as the rightmost word. It will, therefore, always be the root of the dependency graph.

The dependency graph is an example of a nested or projective graph. Assuming that the root of the graph is the rightmost word in the sentence (Arabic is written from right to left), a projective graph is a graph whose edges can be drawn in the plane above the sentence without two edges crossing each other, as shown in Figure 2. Meanwhile, a nonprojective dependency graph cannot fulfil this property [42]. An example of a nonprojective graph for the sentence “The boy is reading the book in the morning”—الولد يقرأ الكتاب صباحا—is shown in Figure 3. Long-distance dependencies, or dependencies in languages with flexible word order, are the reason for nonprojectivity. An important part of sentences in some languages necessitates a nonprojective dependency analysis. Therefore, the ability to learn and derive nonprojective dependency graphs is an essential step in solving certain problems—for example, multilingual processing of languages or permutation cases (حالات التقديم والتأخير) in the Arabic language.

Some studies [43, 44] assumed another type of grammar called a link grammar as a third type of grammatical formalisms. As far as we know, the main difference between the link grammar and the dependency grammar is that the former must have an algorithm to demonstrate how to link words, while the latter adopts the dependency principle to link two or more words.

4. Grammatical Rules

This section introduces grammatical rules and how to output their proper end-mark by our system. To the best of our knowledge, there is no work to automate the Arabic grammatical rules, except the Arabic grammar ontology built by لمالكي [36], which has many restrictions. Thus, our work seeks to construct the Arabic grammar rules as a dependency grammar, based on the criteria of grammatical rules defined in متن الآجرومية and its explanations by العثيمين [45], which is considered as a basic reference for Arabic grammar. These grammatical rules are laid out in a hierarchy, grounded in the most generic and dependent properties. Each rule has its own properties and constraints. Once these properties and constraints between a head (current word) and its dependencies (previous words) can be achieved completely, rules lower in the hierarchy can be used. For example, the head may be a noun, verb, or particle. The noun is then divided into subject and inchoative rules, and so on. The complete hierarchy of our grammatical rules is shown in Figure 4.

In total, the hierarchy contains 30 main grammatical rules, and some of these rules have one or more subrules. At any moment, this backbone of the hierarchy can be extended (by adding subrules) to include several extensions and enhancements intended to facilitate and improve the usage in certain applications.

5. Syntax Dependency and Classification Model

In any sentence, the relationship between two or more words (dependencies) with one word (head) is called a syntax dependency. Our technique for generating syntax dependencies is mainly based on patterns of grammatical rules applied to sentence structure graphs with the aim of constructing a decision tree classifier model. Even though the approach is general, it requires relevant rules for each language and treebank representation. The method for verifying the correct grammatical syntax dependencies—in order to check whether the end-mark (العلامة الإعرابية) is correct—consists of three phases: dependency extraction, dependency synthesis, and predicate the correct end-mark. Figure 5 illustrates these phases.

The dependency extraction phase is straightforward. First, a sentence is analyzed with a morphological analyzer. A morphological analyzer identifies the attributes of a particular word by a chain of the morphological processes that the word has undergone, including specifying the word units and how these units are related to form the words overall, the function of these units in a particular word, and its syntactic and semantic behaviour in the sentence. In our case, any morphological analyzer could be used, but in practice, we have used MADAMIRA [46], which was developed by the Penn Arabic Treebank (PATB). MADAMIRA provides high-quality word-level disambiguation of Arabic expression and presents a high-precision statistical analysis of sentence structure. For more information about the morphological step using MADAMIRA, please see [47]. Second, a root is identified to find the right dependencies in the sentence. In this version of the work, a projective dependency graph was adopted, which means a word and its descendants formed a contiguous substring of the sentence. Therefore, we assume that the root of the dependency graph was the proper word on the far right of the sentence. This is sufficient to represent most of the rules of Arabic grammar, except the odd permutation cases. The Arabic language is very flexible in terms of words’ ordering. These different word orderings lead to many permutation cases. The position of the root in these cases is changed; it cannot be assumed to be at the far right of the sentence. For example, in Figure 3, the subject (“the boy”—الولد) precedes (occurs right in Arabic sentence) the verb (“read”—يقرأ), which is valid syntax in the Arabic language. In this case, if we consider the subject as the root of the sentence, it will lead to wrong results.

After completing this phase, the word-level processing is finished, and the processing will move to sentence-level in the next phases.

In the second phase, dependency synthesis, we associate each of the extracted dependencies with a grammatical rule (listed in the hierarchy in Figure 4), which is accurately designated as possible to infer the right syntax dependency. The dependencies are flexible to accommodate any sentence’s syntax in the Arabic language. Once each grammatical rule was extracted, we define one or more projective dependency graphs based on the sentence structure. Conceptually, each node in the graph is tested to conform to the properties and constraints of the extracted grammatical rule, and the graph corresponding to the most specific grammatical rules is considered to be the type—or name—of the syntax dependencies.

In the last phase, the classification model is completed by reaching the leaf’s nodes, which represent the end-mark cases to the specific grammatical dependency rule produced in the synthesis phase. Through that, the correct end-mark could be inferred by comparing the end-mark of the input word with the correct end-mark extracted from the classifier model, and errors—if they occur—are presented to the user. It is worth noting that the checker will recognize the grammatical errors and inform the user what the exact cause of errors is. Therefore, extra clarification rules were added to each grammatical rule to adopt this feature.

Thus, if we insert a dummy root for the sentence, then each word (node) in the internal nodes of the decision tree will have a dependency relation. Accordingly, the number of syntax dependencies in the representation equals the number of words in the sentence. In grammar auditors, it is necessary to combine the particle (such as prepositions and conjunctions) with the next word to form one word.

6. Evaluation

In this section, we discuss the results obtained by AGD. The evaluation consists of two stages: a preliminary evaluation and an evaluation of inserted errors.

The evaluations were performed on 10 essays that contain 752 sentences. The essays were selected from different collections of partially or totally vowelized Arabic texts, taken from the Tashkeela corpus (https://sourceforge.net/projects/tashkeela/). Table 1 shows the size of each essay, including the type it belongs to.

In the evaluation process, we adopted four symbols in order to facilitate the steps of the performance computation in an accuracy, recall, and precision metric:

Correct Results:

TT: the word is correct (True), and the system says it is correct (True).

FF: the word is incorrect (False), and the system says it is incorrect (False).

Incorrect Results:

TF: the word is correct (True), and the system says it is incorrect (False).

FT: the word is incorrect (False), and the system says it is correct (True).

Additionally, during the construction of the AGD, some errors may not be discovered due to incorrect analysis in the preprocessing operation, or due to ambiguity in semantics. These errors are outside the framework of the AGD because they happened in the preprocessing stage that precedes the grammar auditor. We called these errors limitations. Based on that, several limitations were set to enable the AGD to work in an ideal situation, i.e., assuming no errors happed in the preprocessing phases. These limitations are as follows:(A)It represents a subject that is a hidden pronoun (الفاعل ضمير مستتر). Usually, the subject comes directly after (occurs left in Arabic sentence) the verb, yet in some cases, the subject is hidden, and the word after the verb is an object. This case can only be detected at the semantic level. For example, “Mohammad read the book and wrote the homework”— قرأ محمد الكتاب، وكتب الواجب; the subject in the first sentence is Mohammad, while in the second one, it is a hidden pronoun.(B)It represents an indefinite added to indefinite (المضاف إلى نكرة). The structures of the Arabic language allow an indefinite word to be added to a definite or an indefinite word. AGD can distinguish the addition to the definite word (e.g., “the grammar book”—كتاب النحو), but not to an indefinite word (e.g., “a grammar book”— كتاب نحو).(C)It represents errors in morphological analysis. Morphological analysis errors are common, and they affect the correctness of the grammatical structure—for example, not analyzing some words, or analyzing a noun as a verb or vice versa(D)It represents an odd position of the subject; i.e., the object precedes the subject (e.g., “the book read by Mohammad”— قرأ الكتاب محمد) or the subject precedes the verb (e.g., “Mohammad read the book”— محمد قرأ الكتاب).

Table 2 displays the limitations that were discovered in the texts of tested essays. We should mention that determining the syntax dependencies of Arabic grammatical rules correctly depends on accurate morphological analysis and determination of the semantics of the sentence. In AGD, the semantic understanding (limitations A, B, and D) and the incorrect analysis (limitation C) represent 82.86% and 17.14% of total limitations, respectively.

Before presenting the results of our work in detail, some points must be highlighted, namely:(1)The evaluated texts were free of spelling errors because they affect the morphological process accuracy, which is a step that precedes the grammar auditing process.(2)We cannot compare our work with others because none have used a well-known corpus, and none is available for testing and comparing.(3)The Tashkeela corpus contains more than 75 million words. Our evaluation aims to measure the effectiveness of our approach. Accordingly, random essays were selected from the corpus to fulfil that aim. The corpus includes (totally or partially) vowelized text, as well as nonvowelized texts.

6.1. Stage 1: Preliminary Evaluation

This evaluation aims to evaluate AGD in detecting and correcting errors that we do not know before the auditing process. For a comparison purpose, a specialized human auditor was hired to do an audit for the same samples. It is worth noting here that the results of the human auditor and the AGD audit were finally examined by a professor in the field of Arabic grammar to judge on the results if they differ. This step was taken due to the divergence and complexity of the Arabic language. The results of the human audit and AGD audit, along with the number of words that were grammatically audited (GA) correctly in each essay, are shown in Tables 3 and 4.

From Tables 3 and 4, the most grammatical errors appeared when the related dependent was apart; for example, in “The boy wrote the math homework with the teacher in the morning, science and language”— كتب الولد واجب الرياضيات مع المعلم صباحا والعلوم واللغة. In this sentence, science and language are related to math, and they should take its end-mark. The second most common error is determining the adverb object rather than the general noun; for example, in “December 18, International Day of the Arabic Language”— 18ديسمبر, اليوم العالمي للغة العربية, the day is a noun, not an adverb object.

The confusion matrices in Tables 5 and 6 show that 43 words are grammatically audited incorrectly by AGD in its ideal situation, compared with 32 words by human audit. Most of the human audit errors resulted from incorrect discrimination of the end-mark, as well as inaccurate identification of the accusative and jussive cases. The precision and recall in Table 7 show that the AGD has a high recall but low precision, which means it detects most of the errors out of total errors, but some of its predicted errors are incorrect. In contrast, the human audit detects half of the errors out of the total errors, but most of the predicted errors are correct.

6.2. Stage 2: Evaluation of Inserted Errors

This evaluation aims to evaluate AGD in discovering inserted errors in fully correct essays and being able to explain their causes. We measure the performance of AGD in terms of precision and recall. Admittedly, there is no available corpus for grammatical errors in the Arabic language. Accordingly, we manually inserted many grammatical errors in the evaluated texts (e.g., يقرأ الولدُ القرآنَis changed to يقرأ الولدَ القرآنُ). The errors were embedded by persons who were not members of the AGD programming team. The results of the AGD audit, along with the number of inserted errors in each essay, are shown in Table 8.

As shown in Tables 9 and 10, only 13 out of 220 errors were not GA correctly by AGD in its ideal situation. In other words, the failure of AGD to correct the grammar of diacritic Arabic texts by grammar correction is only 5.91%. Most of these errors focused on the precise identification of the adjective-noun following the descriptor or on the sources of verbs (الاسم المصدر الذي يعامل معاملة الفعل) that take the place of the verbs and treat the noun after them according to the properties of the verb, not the noun.

7. Conclusions

An Arabic grammar auditor is a complex system that requires extensive linguistic resources and research, as well as the help of specialists in the field. An in-depth grammar auditor that adopts a new approach is presented in this paper. It aims to detect grammatical errors and hint at their causes and correction in both vowelized and nonvowelized Arabic texts. This approach is based on the grammar of the projective dependency graph, whose rules are extracted from the hierarchy of grammatical rules that we have constructed. In brief, a synthesis of the dependencies is performed to designate the most specific pattern of grammatical rules, depending on the sentence words’ features, to infer the proper end-mark based on the results of the decision tree classifier model. We have achieved promising results using this approach. In its current stage, AGD provides accurate results compared with human auditing. The appendix gives a sample demonstration of AGD.

The future work of our research includes three aspects. The first aspect is detecting errors related to agreements in the Arabic language, such as the agreement between the verb and the subject’s gender and agreement between the inchoative and the predicate in number and gender. This can be done by extending the backbone of the hierarchy of Arabic grammatical rules adopted in our system. The second aspect is expanding the scope of AGD evaluation by uploading it on the web to be available for public use and benefiting from users’ feedback to enhance the performance of the AGD system. The third aspect is enhancing the AGD by involving the nonprojective dependency graph to deal with the Arabic odd permutation cases.

Appendix

Figure 6 shows a screenshot of the AGD demo. It shows the Arabic grammatical audit of the vowelized input text. The red text color code indicates grammar errors. The other tab panels show the morphological analysis results. Hovering over a word (as is done here on الأطفال) displays a box with an explanation and correction of the error for that word.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to thank the Deanship of Scientific Research at King Saud University for funding and supporting this research through the initiative of DSR Graduate Students Research Support (GSR). The authors also would like to thank Abdullah Alansary, the professor in the field of Arabic grammar, Imam Mohammad bin Saud University, for directing the search and helping in the audit process.