Research Article

Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair

Table 1

Domains from which Wikipedia source articles were selected in creating our proposed CLPD-UE-19 Corpus.

DomainMajor topics

Computer scienceFree software, binary numbers, open source, database normalization, robotics, artificial intelligence, MSN, Google, Yahoo, WhatsApp, Android, Facebook, Twitter, RUBY language, daily motion, HTML, mobile apps, Gmail, Skype, and others
General topicsGlobalization, muhammad iqbal, global warming, capitalism, mosque, bookselling, Pakistan air force, cricket, fashion, Lahore Fort, capitalism, Badshahi Masjid, and two-nation theory
Electrical engineeringElectricity, magnetism, and conducting materials
Management scienceTrade and finance
PhysicsAtoms and scientists
PsychologyNeurology, psycho diseases, and enlightenment
CountriesPolitics and trade of different countries (mostly African)
Pakistan studiesHistory of Pakistan and Indo-Pak partition
ZoologyAnimals, food, and living styles
BiologyNatural organisms, living cells, and DNA