Abstract

Cross-lingual plagiarism occurs when the source (or original) text(s) is in one language and the plagiarized text is in another language. In recent years, cross-lingual plagiarism detection has attracted the attention of the research community because a large amount of digital text is easily accessible in many languages through online digital repositories and machine translation systems are readily available, making it easier to perform cross-lingual plagiarism and harder to detect it. To develop and evaluate cross-lingual plagiarism detection systems, standard evaluation resources are needed. The majority of earlier studies have developed cross-lingual plagiarism corpora for English and other European language pairs. However, for Urdu-English language pair, the problem of cross-lingual plagiarism detection has not been thoroughly explored although a large amount of digital text is readily available in Urdu and it is spoken in many countries of the world (particularly in Pakistan, India, and Bangladesh). To fulfill this gap, this paper presents a large benchmark cross-lingual corpus for Urdu-English language pair. The proposed corpus contains 2,395 source-suspicious document pairs (540 are automatic translation, 539 are artificially paraphrased, 508 are manually paraphrased, and 808 are nonplagiarized). Furthermore, our proposed corpus contains three types of cross-lingual examples including artificial (automatic translation and artificially paraphrased), simulated (manually paraphrased), and real (nonplagiarized), which have not been previously reported in the development of cross-lingual corpora. Detailed analysis of our proposed corpus was carried out using -gram overlap and longest common subsequence approaches. Using Word unigrams, mean similarity scores of 1.00, 0.68, 0.52, and 0.22 were obtained for automatic translation, artificially paraphrased, manually paraphrased, and nonplagiarized documents, respectively. These results show that documents in the proposed corpus are created using different obfuscation techniques, which makes the dataset more realistic and challenging. We believe that the corpus developed in this study will help to foster research in an underresourced language of Urdu and will be useful in the development, comparison, and evaluation of cross-lingual plagiarism detection systems for Urdu-English language pair. Our proposed corpus is free and publicly available for research purposes.

1. Introduction

In cross-lingual plagiarism, a piece of text in one (or source) language is translated into another (or target) language by neither changing the semantics and content nor referring the origin [1, 2]. Cross-lingual plagiarism detection is a challenging research problem due to various reasons. Firstly, machine translation systems are available online free of cost such as Google Translator (https://translate.google.com/) to translate a document written in one language into another language. Secondly, the Web has become a hub of multilingual resources. For example, Wikipedia contains articles in more than 200 languages on same topics (http://en.wikipedia.org/wiki/wikipedia Last visited 10-02-2019). Thirdly, people might be often interested to write in another language which is different from their native language. Consequently, all these factors contribute to an environment, which makes it easier to commit cross-lingual plagiarism and difficult to detect it.

The task of plagiarism can be broadly categorized into two categories [3]: (1) intrinsic plagiarism analysis and (2) extrinsic plagiarism analysis. In the former case, a single document is examined to identify plagiarism in terms of variation of an author(s)’s writing style. The fragment(s) for text which is significantly different from other fragments in a document is a trigger of plagiarism. Mostly stylometric-based features are modeled to detect such plagiarism. In the latter case, we are provided with a document which is suspected to contain plagiarism (suspicious document) and source collection. The aim is to identify fragments of text(s) in the suspicious document which are plagiarized and their corresponding source fragments from the source collection. Extrinsic plagiarism can be further divided into (1) monolingual–both source and plagiarized texts are in the same language and (2) cross-lingual plagiarism–source and plagiarized texts are in different languages. In case of cross-lingual plagiarism, a source text can be translated either automatically or manually, and after translation, it can be either used verbatim or rewritten for plagiarism [4].

To develop and evaluate Cross-Lingual Plagiarism Detection (CLPD) methods, standard evaluation resources are needed. Majority of CLPD corpora are developed for English, European, and some other languages (http://www.webis.de/research/corpora-Last-visited-10-02-2019). In addition, none of the existing cross-lingual corpus contains a mix of artificial, simulated, and real examples, which is necessary to make a realistic and challenging corpus. The problem of CLPD has not been thoroughly explored for South Asian languages such as Urdu, which is a widely spoken by a large number of people around the globe. Urdu is the first language of about 175 million people around the world and particularly spoken in Pakistan, India, Bangladesh, South Africa, and Nepal (http://www.ethnologue.com/language/urd, last visited: 20-02-2019). It is written from right to left like Arabic script. Urdu language usually follows Nastalique writing style [5]. However, Urdu is an underresourced language in terms of computational and evaluation resources.

The main objectives of this study are threefold: (1) to develop a large benchmark cross-lingual corpus for Urdu-English language pair, which contains a mix of artificial, simulated, and real examples, (2) to carry out linguistic analysis of the proposed corpus to get insights into the edit operations used in cross-lingual plagiarism, and (3) to carry out detailed empirical analysis of the proposed corpus using n-gram Overlap and Longest Common Subsequence approaches to investigate whether the documents in the corpus are created using different obfuscation techniques. There are total 2,398 source-suspicious document pairs in our proposed corpus. Source documents are in Urdu language, and suspicious ones are in English. The source-suspicious document pairs are categorized into two main categories: (1) plagiarized (1,588 document pairs) and (2) nonplagiarized (810 document pairs). The plagiarized documents are created using three obfuscation strategies: (1) automatic translation (540 document pairs), (2) artificial paraphrasing (540 document pairs), and (3) manual paraphrasing (508 document pairs). The documents in our proposed corpus are from various domains including Computer Science, Management Science, Electrical Engineering, Physics, Psychology, Countries, Pakistan Studies, General Topics, Zoology, and Biology, which makes the corpus more realistic and challenging. We also carried out linguistic and empirical analysis of our proposed corpus.

Our proposed corpus will be beneficial for (1) fostering and promoting research in a low resourced language—Urdu, (2) enabling us to make a direct comparison of existing and new CLPD methods for Urdu-English language pair, (3) developing and evaluate new methods for CLPD for Urdu-English language pairs, and (4) developing a bilingual Urdu-English dictionary using our proposed corpus. Furthermore, our proposed corpus is free and publicly available for research purposes.

The rest of this paper is organized as follows: Section 2 summarizes the related work on existing corpora for CLPD. Section 3 describes the corpus generation process, including source documents collection, levels of rewriting, creation of suspicious documents, and standardization of the corpus. Section 4 presents the linguistic analysis of our proposed corpus. Section 5 presents a deeper empirical analysis of the corpus. Finally, Section 6 concludes the paper.

In the literature, efforts have been made to develop benchmark corpora for CLPD. One of the prominent efforts is the series of PAN (http://pan.webis.de/, last visited: 20-02-2019) (a forum of scientific events and shared tasks on digital text forensic) competitions. A number of frameworks for cross-lingual plagiarism evaluation are also proposed by researchers for this forum [6, 7]. The main outcome of these competitions is a set of benchmark corpora for mono- and cross-lingual plagiarism detection. The majority of plagiarism cases, in these corpora, are monolingual (90%), and remaining 10% are cross-lingual such as English-Persian and English-Arabic and other language pairs. Almost 80% of cross-lingual plagiarism cases, in these corpora, are generated using automatic translation, and the rest are generated using manual translation. PAN cross-lingual corpora have been developed for two language pairs: English-Spanish and English-German.

The relevant literature presents a number of benchmark CLPD corpora for languages like Indonesian-English [8], Arabic-English [9], Persian-English [10], and English-Hindi [11]. Developing such a resource for especially under-resourced languages is an active research area [12, 13]. Parallel corpora have also been developed and used in [14] for the automatic translation purpose in cross-lingual domain. CLPD systems based on these corpora and other approaches are also proposed in the literature [15]. Most of these approaches used syntax-based plagiarism detection methods, but at the same time, semantic-based plagiarism detection approaches were also applied for the purpose. Savador et al. used semantic plagiarism detection approach using the graph analysis method for cross-language plagiarism detection. It is a language-independent model for plagiarism detection applied to the Spanish-English and German-English domains [16].

Cross Language Indian Text Reuse (CLITR) task has been designed in conjunction with Forum for Information Retrieval Evaluation (FIRE) to detect cross-lingual plagiarism for English-Hindi language pair. The corpus is divided into training and test segments in which source documents are in English and suspicious documents are in Hindi.

The training and test collection both include 5032 source files in English while 198 suspicious files in training and 190 suspicious files are in Hindi (http://www.uni-weimar.de/medien/webis/events/panfire-11/panfire11-web/, last visited:20-08-2018). Corpora have also been developed for performance evaluation of cross-language information retrieval (CLIR) systems [17], while Kishida [18] raised technical issues of this domain. Moreover, different plagiarism detection tasks like text alignment and source retrieval are designed based on these corpora’s, and overview of these tasks are being consistently (yearly) been published by PAN@ CLEF forum [19, 20].

The JRC-Acquis Multilingual Parallel Corpus has been used by Potthast et al., to apply CLPD approaches. As many as 23,564 parallel documents are constructed in the corpus that is extracted from legal documents of European Union [21, 22]. Out of 22 languages in legal document collection, only 5 including French, Germen, Polish, Dutch, and Spanish were selected to generate source-suspicious document pair (English language was used as source language). Comparable Wikipedia Corpus is another dataset used for the evaluation of CLPD methods. The corpus contains 45,984 documents.

Benchmark cross-lingual corpora have been developed using two approaches: (1) automatic translation and (2) manual translation. PAN corpora are created using both approaches for English-Spanish and English-German language pairs. However, the majority of cross-lingual cases are generated using automatic translation, and only a small number of them are generated using manual translation.

CLITR Corpus is generated using both automatic and manual translations: Near copy/exact copy documents are created using automatic translation, whereas heavy revision (HR) documents are created using manual paraphrasing of automatic translations of source texts. Again, this corpus only contains 388 suspicious documents, and it is created for English-Hindi language pair.

Two cross-lingual corpora used in plagiarism detection task are (1) JRC-EU Corpus and (2) Fairy Tale Corpus [21, 22]. JRC-EU cross-lingual corpus consists of randomly extracted 400 documents from the legislation reports of European Union which includes 200 English source documents and 200 Czech documents. Fairy-tale corpus contains 54 documents: 27 in English and 27 in Czech. Ceska et al. also used these corpuses for CLPD task [23].

In a previous study, we developed a corpus for the PAN 2015 Text Alignment task (we named it CLUE Corpus) [24]. In that corpus, there are total 1000 documents (500 are source documents and 500 are suspicious documents). Among the suspicious collections, 270 documents are plagiarized using 90 source-plagiarized fragment pairs, while the remaining 230 suspicious documents are nonplagiarized. Note that this corpus contains simulated cases of plagiarism, which were inserted into suspicious document to generate plagiarized documents. The CLUE Corpus can be used for the development and evaluation of CLPD systems for English-Urdu language pair for the text alignment task only as described by PAN organizers.

To conclude, the relevant literature presents the majority of CLPD corpora for English and other European languages. Moreover, these are mainly created using comparable documents, parallel documents, and automatic translations, which are not realistic examples for cross-lingual plagiarism. This study contributes a large benchmark corpus (containing 2,398 source-suspicious document pairs) for CLPD in Urdu-English language domain. Note that the 270 fragment pairs used in the development of CLUE Corpus are also included in this corpus.

3. Corpus Generation

This section describes the process for construction of a benchmark corpus for CLPD for Urdu-English language pair (hereafter called CLPD-UE-19 Corpus) including collection of source texts, levels of rewrite used in creating suspicious documents, creation of suspicious documents, and standardization of corpus and corpus characteristics.

3.1. Collection of Source Texts

Urdu is an underresourced language as large repositories of digital texts in this language are not readily available for the research purposes. Urdu newspapers in Pakistan mostly publish news stories in images format which is not suitable for text processing. Therefore, to collect realistic, high-quality, and diversified source articles for generating CLPD-UE-19 Corpus, we selected Wikipedia1 as a source. Wikipedia is a free and publicly available, multitopic, and multilingual resource. Initially, Wikipedia contains an article in multiple languages which makes it possible to be considered as a comparable corpus. AJ Head investigated the potential use of Wikipedia for course-related search by students [25]. Martinez also investigated the cases where Wikipedia is mainly used for copy and paste plagiarism cases [26]. Wikipedia articles are taken as source documents for generating cross-lingual plagiarism detection corpus for Hindi-English language pair [27].

Plagiarism is a serious problem, particularly in higher educational institutions [2831]. Therefore, CLPD-UE-19 Corpus focuses on plagiarism cases generated by university students. Table 1 shows the domains from which Wikipedia (http://ur.wikipedia.org/wiki/urdu) source articles are collected to generate CLPD-UE-19 Corpus. Apart of it, 270 source-suspicious document pairs were used in the creation of the CLUE Corpus [24].

These domains include Computer Science, Management Science, Electrical Engineering, Physics, Psychology, Countries, Pakistan Studies, General Topics, Zoology, and Biology. As can be noted, these articles are on a wide range of topics, which makes the CLPD-UE-19 Corpus more realistic and challenging.

The amount of text reused for creating a plagiarized document can vary from a phrase, sentence, and paragraph to the entire document. It is also likely that to hide plagiarism, a plagiarist may reuse the texts of different sizes from different sources. Therefore, the size of source documents is varied. The length of a source text may fall into one of the three categories: (1) small (1–50 words), (2) medium (50–100 words), and (3) large (100–200 words).

3.2. Levels of Rewrite

The proposed corpus contains two types of suspicious documents: (1) plagiarized and (2) nonplagiarized. The details of these are as follows.

3.2.1. Plagiarized Documents

A plagiarized document in CLPD-UE-19 Corpus falls into one of the three categories: (1) automatic translation, (2) artificially paraphrased copy, and (3) manually paraphrased copy. The reason for creating plagiarized documents with three different levels of rewrite is that a plagiarist is likely to use one of the three abovementioned approaches for creating a plagiarized document using existing document(s) for cross-lingual settings.

(a) Automatic Translation. Using this approach, plagiarized documents (in English) are created by automatically translating the source texts (in Urdu) using Google Translator (https://translate.google.com/, last visited: 20-02-2019). Note that Google Translator has been effectively used in earlier research studies [32, 33].

(b) Artificially Paraphrased Copy. This approach aims to create artificially paraphrased cases of cross-lingual plagiarism in two steps. A source text (in Urdu) is translated automatically into English using Google Translator in the first step. After that, an automatic text rewriting tool is used to paraphrase the translated text, which results in an artificially paraphrased copy of the original text. For this study, we explore various free and publicly available text rewriting tools. Among the available tools, we found that two of them have the highest number of visitors per day: (1) Spinbot text rewriting tool (http://www.spinbot.net/) with an average number of 26 k visitors per day and (2) Article Rewriter text rewriting tool (http://articlerewritertool.com/) with an average number of 45 k visitors per day reported by Alexa (this is a ranking system set by alexa.com (a subsidiary of amazon.com) that basically audits and makes public the frequency of visits on various websites) as compared to other tools like http://paraphrasing-tool.com/, etc.

(c) Manually Paraphrased Copy. Using this approach, the plagiarized document were created by manually translating and paraphrasing the original texts.

3.2.2. Nonplagiarized

Wikipedia is a comparable corpus and contains an article in multiple languages. It is notable that these articles are not translations of each other. To generate nonplagiarized cases, similar fragments of texts were manually identified from English and Urdu Wikipedia articles on the same topic.

The assumption is that although English and Urdu Wikipedia articles are written on the same topic, they are independently written by two different authors. Therefore, similar fragments of English-Urdu texts can serve as independently written cross-lingual document pairs.

As far as we are aware, the proposed methods used for creating cross-lingual plagiarism cases of artificially paraphrased plagiarism and Nonplagiarism have not been previously used for creating cross-lingual plagiarism cases in any other language pair.

3.3. Generation of Suspicious Texts

Crowdsourcing is a process of performing a task in collaboration of a large number of people usually working as a remote user. It can be done with a group of people, small teams or even individuals. Generating a large benchmark CLPD corpus is not a trivial task. Therefore, we use the crowdsourcing approach to generate suspicious texts with four levels of rewriting. Examples of manually paraphrased copy and nonplagiarized are generated by participants (volunteers), who are graduate-level university students (masters and M Phil). All the participants are native speakers of Urdu. As the medium of instruction in university and colleges is English, students have a high level of proficiency in English language too.

The majority of the participants are from the English department, and hence are well aware of paraphrasing techniques.

However, for better quality, they were provided with examples of paraphrasing. The plagiarized documents generated by volunteers were manually examined, and low-quality documents were discarded.

3.4. Examples of Cross-Lingual Plagiarism Cases from CLPD-UE-19 Corpus

Figure 1 presents an example of source-plagiarized document pair from CLPD-UE-19 Corpus created using automatic translation approach. As can be noted, the translated text is not an exact copy of the original one. The possible reason for this is that Urdu is an underresourced language, and machine translation systems for Urdu-English language pair are not matured compared to other language pairs. Consequently, the translated text seems to be a near copy of the original text instead of an exact copy. Moreover, it can also be observed from the translated document that for few words for which Google Translator does not find any equivalent word in English, it merely replaces the pronunciation of that word with English homonyms, for instance, تمد ن is replaced with tmd and مثلأئ is replaced with Msly. To conclude, the overall quality of Google Translator seems to be good considering the complexity in translating Urdu text to English.

Figure 2 shows an example of plagiarism document where automatic translation of a source document is further altered by an automatic rewriting tool to get artificially plagiarized copy of the source document. It can be observed from this example that automatic text rewriting tool has replaced the words by appropriate synonyms (the words presented in Italics are synonyms of original words). However, the text rewriting tool does not alter the order of text. The alteration in the translated text is carried out by rewriting tool which further increases the level of rewriting and makes it difficult to identify similarity between source-plagiarized text pairs.

A sample plagiarized document generated using the manually paraphrased copy approach is shown in Figure 3, which is a very well paraphrased content. Different text rewriting operations have been applied by the participants to paraphrase the original text including synonym replacement, sentence merging/splitting, insertion/deletion of text, word reordering. Consequently, the source-plagiarized text pairs are semantically similar but different at surface level, which makes the CLPD task even more challenging.

A nonplagiarized source-suspicious document pair from the CLPD-UE-19 Corpus is shown in Figure 4. The text is topically related, but independently written. The inclusion of more introductory sentences and last sentence reflects that both texts are written in different contexts.

3.5. Corpus Characteristics

Table 2 presents the detailed statistics of the proposed corpus. In this table, AT, APC, MPC, and NP represent automatic translation, artificially paraphrased copy, manually paraphrased copy, and nonplagiarized, respectively. There are total 2,398 source-suspicious document pairs in the corpus, 810 are nonplagiarized and 1,588 are plagiarized. Among the plagiarized document pairs, 540 are automatically translated, 540 are artificially paraphrased, and 508 are manually paraphrased. Above statistics show that the corpus contains a large number of documents for both plagiarized and nonplagiarized cases. Also, the documents for four different levels of rewrite in the proposed corpus are almost balanced. The CLPD-UE-19 Corpus is standardized in XML format and publicly available for research purposes (the CLPD-UE-19 Corpus is distributed under the terms of the Creative Common Attribution 4.0 International License and can be downloaded from the following link: https://www.dropbox.com/sh/p9e00rxjj9r7cbk/AACj3gtVEy5T74rfP58_BtP6a?dl=0).

4. Linguistic Analysis of CLPD-UE-19 Corpus

This section presents the linguistic analysis of the CLPD-UE-19 Corpus. As reported in [34, 35], various edit operations are performed on the source text to create plagiarized text, particularly when it the source text is reused for paraphrased plagiarism. Below we discuss the various edit operations which we observed while carrying out linguistic analysis on a subset of CLPD-UE-19 Corpus (note that, we used 50 source-suspicious document pairs for the linguistic analysis presented in this section) (Figures 59).

4.1. Replacing Pronoun with Noun

In these edit operations, a pronoun is replaced by actual name or vice versa in source and suspicious document, for instance:

4.2. Order Change with Add/Delete Words

It is also a common approach used in edit operation. In this approach, later part of the source text is quoted first in the suspicious text and vice versa like.

4.3. Continuing Sentences: Adding Words

Combining two sentences by using an additional word is the most used approach in rewriting text, for example.

4.4. Date Completed

It is another approach where an event in the source text is rewritten in context of the event date and place in suspicious document.

4.5. Summary

In this category, an abstract description of the rewritten text in suspicious document is used in place of long narrations in the source document.

The corpus contains a number of examples of order changes and changing active to passive and direct to indirect and vice versa. Such examples reflect that edit operations change the source text so that it is not a verbatim case. It is not an easy case for plagiarism detection.

5. Translation + Monolingual Analysis of CLPD-UE-19 Corpus

For convenience, this section is further divided into three Sections: starting with experimental setup, next two sections describe detailed and comprehensive analysis of the corpus.

5.1. Experimental Setup

To analyze the quality of artificially and manually paraphrased levels of rewritten cases, we applied translation + monolingual analysis approach on our proposed corpus. Using this approach, we automatically translated source documents (in Urdu) into English using Google Translator. Now, both source and suspicious documents are in the same language, i.e., English. After that, we computed mean similarity scores for source-suspicious document pairs for all four categories (automatic translation copy, artificially paraphrased copy, manually paraphrased copy, and nonplagiarized) using n-gram overlap and longest common subsequence approaches.

To compute similarity scores between source-suspicious document pairs, we applied containment similarity measure [36] (equation (1)). Using the n-gram overlap approach, similarity score between source-suspicious document pair is computed by counting common n-grams between two documents divided by the number of n-grams in both or any one of the documents. If S(X, n) and S(Y, n) represent word n-grams of length n in source and suspicious document, respectively, then similarity between them using containment similarity measure is computed as follows:

We used another simple and popular similarity estimation model, longest common subsequence (LCS), to compute the mean similarity scores for four levels of rewrite in CLPD-UE-19 Corpus. Using the LCS approach, for a given pair of source-suspicious text (X and Y), we first computed the LCS between source-suspicious strings and then divided the LCS score with the length of smaller document to get a normalized score between 0 and 1 (equation (2)). Note that LCS method is order-preserving, and LCS score is affected by edit operations performed on source text to generate plagiarized text:

5.2. Partial (Domainwise) Analysis

This dimension provides us an opportunity for microlevel and size-oriented domain analysis. Size is one of the dimensions in the rewritten cases. For this purpose, few sample documents from different domains have been randomly selected. Automatic translation copy (ATC) of a source document is compared with artificial and manual paraphrased versions of the same document. Bi, tri, and tetragram split has been applied to identify word to the sentence level similarity between different levels of the rewritten text. An empirical based analysis has been carried out for documents related to all the domains, but only results of only three domains for all size documents are listed here. Almost all results showing that n-gram similarity between both levels of rewrite decreases gradually as values of n increase.

5.2.1. Discussion

It is observed that overall average word n-gram similarity in small-sized manually paraphrased copies of documents is less than large- and medium-sized cases similarity. It also reflects that paraphrasing small-sized text using different edit operations is more paraphrased as compared to other sizes of suspicious documents and hence difficult to detect as well.

In Tables 35 and Figure 10, it is noteworthy that 4-gram value or even 3-gram value in most of the cases approaches to zero. It reflects that how well a source document has gradually been altered in both APC and MPC levels of rewrite across the entire corpus. Only a few documents out of such a large corpus have high value of similarity between the source and its MPC level because plagiaries have not used any major paraphrasing techniques for rewriting the source text. But, in such a large corpus of more than 2300 documents, these are only a few such cases.

To have a better view of rewriting levels, we apply APC- and MPC-wise average n-gram approach also, the results of which are presented in Table 6. As per Figure 11, the similarity ratio in most of the APC cases is higher than MPC cases. It also indicates that artificial paraphrasing techniques are still slightly not as precise in paraphrasing source text as compared to the manual effort.

5.3. Complete (Corpus-Based) Analysis

Table 7 shows the mean similarity scores obtained using n-gram overlap and LCS approaches. AT refers to automatic translation, APC refers to artificial paraphrased copy, MPC refers to manually paraphrased copy, and NP refers to nonplagiarized. 1-gram refers to mean similarity scores generated using n-gram overlap approach, where (i.e., unigram). Similarly, 2-gram refers to mean similarity scores generated using n-gram overlap approach, where (i.e., bigram) and so on. Mean similarity scores obtained using LCS approach are referred as LCS. Note that mean similarity score for AT is 1.00 for all methods. The reason is that we used Google Translator for both creating AT cases of plagiarism (Section 3.2) and M + TA analysis (presented in this section). Therefore, the two translations are exactly same generating a similarity score of 1.00 for AT.

As expected, similarity score drops as the level of rewrite increases (from AT to NP). This shows that it is hard to detect plagiarism when the level of rewrite increases. This also shows that suspicious documents in the CLPD-UE-19 Corpus are generated using different obfuscation strategies. For n-gram overlap approach, mean similarity scores drops as the length of n increases, indicating that it is hard to find long exact matches in the source-suspicious document pairs. For LCS approach, the score is quite low compared to 1-gram approach. This highlights the fact that the order of texts in the source and suspicious document pair is significantly different which makes it hard to find longer matches.

6. Conclusion

The main goal of this study was to develop a large benchmark corpus of cross-lingual cases of plagiarism for Urdu-English language pair at four levels of rewrite including automatic translation, artificial paraphrasing, manual paraphrasing, and nonplagiarized. There are total 2,398 document pairs in our proposed corpus: 1,588 are plagiarized and 810 are nonplagiarized. Plagiarized documents are created using three obfuscation strategies: automatic translation (540 documents), artificial paraphrasing (540 documents), and manual paraphrasing (508 documents). Wikipedia articles are used as source texts and categorized into small, medium, and large documents. Crowdsourcing approach has been applied to create our proposed corpus. We also performed linguistic analysis and translation + monolingual analysis of our proposed corpus. Our empirical analysis showed that there is a clear distinction in four levels of rewrite in our proposed corpus, which makes the corpus more realistic and challenging. Being an emerging area of research [37], in future, we plan to apply cross-lingual plagiarism detection techniques on our proposed corpus.

Data Availability

The authors declare that the data mentioned and discussed in this paper will be provided, if required.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors are thankful to all the volunteers for their valuable contribution in construction of the CLPD-UE-19 Corpus.