Research Article

Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair

Table 2

Corpus statistics.

Size (count of words)Level name/plagiarized and nonplagiarized/plagiarized version (total count)Subject domains
CSGTPhyBioEEZolPsyPSMS

≤50(Small)NP: 450100507525
PlagiarizedAT (300)100509951
AP (300)100509951
MP (290)100509050

>50 and ≤100Paragraph (medium)NP: 225502520751540
PlagiarizedAT (150)5025151050
AP (150)5025151050
MP (148)5025151048

≥100 and ≤200Essay (large)NP: 13530153357
PlagiarizedAT (90)301545
AP (90)301545
MP (70)301525
Total72036036311565108177102188

CS: Computer science, GT: General Topics, Phy: Physics, Bio: Biology, EE: Electrical Engineering, Zol: Zoology, Psy: Psychology, PS: Pak Studies, MS: Management Sciences (200 nonplagiarized documents are from countries domain).