Research Article
Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair
| Size (count of words) | Level name/plagiarized and nonplagiarized/plagiarized version (total count) | Subject domains | CS | GT | Phy | Bio | EE | Zol | Psy | PS | MS |
| ≤50 | (Small) | NP: 450 | 100 | 50 | 75 | | | | 25 | | | Plagiarized | AT (300) | 100 | 50 | 99 | | | | 51 | | | AP (300) | 100 | 50 | 99 | | | | 51 | | | MP (290) | 100 | 50 | 90 | | | | 50 | | |
| >50 and ≤100 | Paragraph (medium) | NP: 225 | 50 | 25 | | | 20 | 75 | | 15 | 40 | Plagiarized | AT (150) | 50 | 25 | | | 15 | | | 10 | 50 | AP (150) | 50 | 25 | | | 15 | | | 10 | 50 | MP (148) | 50 | 25 | | | 15 | | | 10 | 48 |
| ≥100 and ≤200 | Essay (large) | NP: 135 | 30 | 15 | | | | 33 | | 57 | | Plagiarized | AT (90) | 30 | 15 | | 45 | | | | | | AP (90) | 30 | 15 | | 45 | | | | | | MP (70) | 30 | 15 | | 25 | | | | | | Total | 720 | 360 | 363 | 115 | 65 | 108 | 177 | 102 | 188 |
|
|
CS: Computer science, GT: General Topics, Phy: Physics, Bio: Biology, EE: Electrical Engineering, Zol: Zoology, Psy: Psychology, PS: Pak Studies, MS: Management Sciences (200 nonplagiarized documents are from countries domain).
|