First Y-Short Tandem Repeat Categorical Dataset for Clustering Applications

Seman, Ali; Abu Bakar, Zainab; Isa, Mohamed Nizam

doi:https://doi.org/10.7167/2013/364725

Dataset Papers in Science

On this page

Abstract Introduction Dataset Description Disclosure Acknowledgments Dataset Files References Copyright Related Articles

Dataset Paper | Open Access

Volume 2013 | Article ID 364725 | https://doi.org/10.7167/2013/364725

First Y-Short Tandem Repeat Categorical Dataset for Clustering Applications

Ali Seman,¹Zainab Abu Bakar,¹and Mohamed Nizam Isa²

Academic Editor: L. Nanni, V. Grolmusz

Received09 Oct 2012

Accepted08 Nov 2012

Published17 Feb 2013

Abstract

The Y-chromosome short tandem repeat (Y-STR) data are mainly collected for a performance benchmarking result in clustering methods. There are six Y-STR dataset items, divided into two categories: Y-STR surname and Y-haplogroup data presented here. The Y-STR data are categorical, unique, and different from the other categorical data. They are composed of a lot of similar and almost similar objects. This characteristic of the Y-STR data has caused certain problems of the existing clustering algorithms in clustering them.

1. Introduction

Y-chromosome short tandem repeats (Y-STRs) are the tandem repeats on the Y-chromosome. The Y-STR represents the number of times an STR motif repeats and is often called the allele value of the marker. Most of the markers begin with a prefix D that stands for DNA, Y that stands for Y-chromosome, and S that stands for a single copy sequence, then followed by the location on the Y-chromosome or often known as locus. This nomenclature is based on an international standard body called Human Gene Nomenclature Committee (HUGO; http://www.hugo-international.org/). For example, if there are eight allele values for the DYS391 marker, the STR would look like the following fragments: [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA]. The number of tandem repeats has effectively been used to characterize and differentiate between two people.

The Y-STR data are now being actively adapted as a remarkable method in genetic genealogy and anthropology studies such as in Hart [1], Smolenyak and Turner [2], Pomery [3], Sykes [4], Shawker [5], Fitzpatrick [6], and Fitzpatrick and Yeiser [7]. The method is used to trace similar groups of Y-surname projects as to support the traditional genealogical study. Furthermore, in wider perspectives such as in the anthropological studies, the method is also being utilized in establishing groups of males, often called haplogroups, across the geographical areas throughout the world. The haplogroups are the study in reference to mitochondria DNA and Y-chromosomes [1]. As a consequence, a reputable reference, known as modal haplotype, used for defining groups of males all over the world has been made available (see http://www.isogg.org/ for the details). The modal haplotype is actually a haplotype diversity where the degree of relatedness has become spread out.

The Y-STR data have been applied and used in clustering Y-surname and Y-haplogroup applications. Initial benchmarking results of clustering Y-STR data have been reported (see, e.g., [8–12]). Furthermore, the Y-STR data and their clustering results have also been published in a journal called Journal of Genetic Genealogy, a journal of genetic genealogical community [13]. A more comprehensive benchmark, involving six Y-STR dataset items and eight existing partitional algorithms, has also been reported [14]. The outcomes of this result indicate that the Y-STR data are quite unique compared to other categorical data, characterizing many similar and almost similar objects. This uniqueness of the Y-STR data has caused the existing clustering algorithms to produce poor clustering results (see the detailed problems of clustering Y-STR data in [15]). As a result, we have recently proposed a new algorithm called -Approximate Modal Haplotype (-AMH) for clustering six Y-STR data [15]. Letting these Y-STR dataset items be a benchmark, the -AMH algorithm has been proven as an efficient clustering algorithm for partitioning Y-STR data. Tables 1 and 2 show the clustering results, comparing the -AMH algorithm and the other eight clustering algorithms as reported in [15].

Thus, the objective of this paper is to give the detailed insight of the six Y-STR dataset items used in the previous benchmarking results of clustering applications. This is because the scope of the previous reported Y-STR dataset was limited to the summary of the six Y-STR data only. No further descriptions on the methodological aspects have been reported, for example, data acquisition, filtration, distribution, similarities, and so forth. Certainly, the detailed descriptions of these Y-STR data are important for future references and further benchmarks of any relevant applications.

2. Methodology

The Y-STR data are secondary data. They were taken and established from the raw data of the results of the DNA genealogical testing reported in various Y-DNA projects. Most of the DNA genealogical testing results can be accessed publicly through a genealogical portal or a database called WorldFamilies.net (see http://www.worldfamilies.net/). The data were retrieved from the respective websites in April 2010. The results were reported in the form of spreadsheet and grouped in accordance with surnames or haplogroups. The reported sheets were commonly arranged in several columns that began with the Kit Number, Paternal Ancestor Name, and Haplogroup, followed by columns of markers. Normally, the test markers are up to 67 markers. Thus, the reported sheets provided all 67 columns of markers. However, in the case of lower testing markers, the columns were left empty without allele values. For each column of the markers, the allele values were presented in numeric.

Most of the results however did not restrict to any specific number of the DNA testing markers. Therefore, there was no uniformity of the reported results because there is no standard in terms of the number of markers chosen by participants. This is because the companies that provide the DNA testing services usually offer the DNA testing from a minimum of 6 DNA markers to a maximum of 67 markers. Thus, some participants who wish to know their familial relatedness more stringently may choose up to 67 testing markers; otherwise they only require a few markers.

There are two groups of data representations: the Y-STR data for Y-haplogroup representation and the Y-STR data for Y-surname representation. Three dataset items were established to represent each group. For the purpose of clustering analysis, each datum was given a prefix attached to the original kit number. For instance, for the Y-surname data, a prefix of an alphabet that belongs to his surname or group is normally attached to his kit number. For example, if the datum belongs to a family of Donald surname, the prefix D is attached to his kit number such as D-15868. For the haplogroup data, the prefix of its haplogroup was given along with the kit number such as A-23456, which represented haplogroup A. These naming conventions were used in order to maintain the original references if any questions arise in future. In addition, it was also used in the process of analyzing the clustering accuracy results in the misclassification matrix during the experimental analysis. The misclassification matrix is a method proposed by Huang [16] in the process of calculating the clustering accuracy scores.

The Y-STR data are treated as categorical data rather than numerical data, even though the allele values are in numeric. This is because the distance between two Y-STR objects is measured by comparing each allele (attribute) value of the Y-STR objects and their modal haplotype. Thus, the total of the mismatch values is the measurement of the genetic distance between two Y-STR objects. In fact, an initial experimental result showed that the Y-STR data were more favorable to be treated as categorical objects, rather than numerical objects [8, 13]. The dissimilarity measure between a Y-STR object and the modal haplotype can be formalized as described in subject to

where is the number of markers.

The Y-STR data were filtered based on 25 similar markers according to the Y-DNA 25-marker test. The chosen markers included DYS393, DYS390, DYS19 (394), DYS391, DYS385a, DYS385b, DYS426, DYS388, DYS439, DYS389I, DYS392, DYS389II, DYS458, DYS459a, DYS459b, DYS455, DYS454, DYS447, DYS437, DYS448, DYS449, DYS464a, DYS464b, DYS464c, and DYS464b. The justifications to choose 25 markers are as follows.(i) The 25 markers are considerably good enough for running out a genetic connection between two people. According to Fitzpatrick [6], 12 markers (Y-DNA 12 test) are already sufficient to determine who does or does not have a relationship to the core group of family. (ii) The results based on the 25 markers are found to be moderate and chosen by many participants. Therefore, the results were mostly available for establishing such dataset.

Table 3 shows the detailed description of the 25 markers.

In the case of Y-surname, the data were filtered to obtain just the members of the main group of the family by comparing their allele values to the modal haplotype. Therefore, the final data were limited to the group of 0 to 5 mismatches only. This is because the fewer mismatches for a given number of markers, the more possibility for two people to share the common ancestor [7]. It means that these two people are much related to each other. Note that the DNA genealogical testing results included the results of greater than 5 mismatches. For the haplogroup only, the data that had been confirmed by SNP analysis were chosen. In the result sheets, the data that had been confirmed by SNP were marked in green color. As a result of the filtration, the final data were much smaller as compared to the original data.

The first, second, and third dataset items represent category 1, the Y-STR data for haplogroup applications, whereas the fourth, fifth, and sixth dataset items represent category 2, the Y-STR data for Y-surname applications.

Table 4 shows the distribution of each Y-STR dataset item. The largest number of the dataset items is 751 which belongs to Dataset Item 1. The smallest number of the dataset items is 112 which belongs to Dataset Items 5 and 6. In terms of classes, the largest number of classes is 14 classes and the smallest is three classes. The distribution of the objects is indicated by the values in the parentheses. The distributions for the haplogroup dataset items are observably unbalanced. The unbalanced distribution was caused by the filtration process as discussed before. However, this situation is known as a data reduction process that is much smaller in volume; yet it closely maintains the integrity of the data as suggested by Han and Kamber [17]. The unbalanced distributions can be seen through Dataset Items 1, 2, and 3. For example, in Dataset Item 1, the class R consists of 475 objects that cover 63% as compared to the other classes. Meanwhile, the class N of Dataset Item 2 consists of 141 objects that cover 53% as compared to the other classes. In fact, this item also contains the lowest number of the objects in a class, which are 6 objects (about 2% of the total objects) in Group J. In Dataset Item 3, the class T consists of 158 objects, which is about 60% larger than the other classes. However, the Y-surname dataset items are much balanced in terms of the object distribution among the classes. This is because the Y-surname data are usually represented by the group of their family relatedness. See the detailed characteristics and the object distributions of each dataset item as shown in Table 4.

Besides the distribution of the objects, the main difference between two Y-STR data is that the haplogroup data were characterized by the objects that had lower degree of similarity (quite distant) to each other, whereas the Y-STR surname data comprised the objects that had higher degree of similarity (similar or almost similar) to each other. For further comparison, Tables 5–10 provide the detailed values of the minimum, maximum, average, and range of the genetic distances. The genetic distances were calculated and based on the mismatched values between the Y-STR objects of that particular dataset item and their modal haplotypes as formalized in (1a) and (1b). Note that the modal haplotypes here were the modes established from their respective classes.

Tables 5, 6, and 7 show the genetic distances of the Y-STR haplogroup data. The average distance of Dataset Item 1 is 7.9–18.6 as shown in Table 5. This item is considered as having a lower degree of similarity of objects among themselves.

The average distance of Dataset Item 2 is 4.4–9.5 as shown in Table 6. This item is also considered as having a lower degree of similarity of objects among themselves.

The average distance of Dataset Item 3 is 6.3–8.4 as shown in Table 7. This item is also considered as having a lower degree of similarity of objects among themselves. The low degree of similarity of the Y-STR haplogroup dataset items indicates that the objects in the datasets are considerably distant to each other.

In the case of Y-STR surname dataset items, the average distance of Dataset Item 4 is 0.9–2.1 as shown in Table 8. This item is considered as having a higher degree of similarity of objects among themselves.

In Dataset Item 5, the average distance is 0.2–1.8 as shown in Table 9. This item is also considered as having a higher degree of similarity of objects among themselves.

In Dataset Item 6, the average distance is 0.2–3.8 as shown in Table 10. This table is also considered as having a higher degree of similarity of objects among themselves. The higher degree of similarity of the Y-STR surname dataset items as compared to the haplogroup dataset items indicates that the objects in the Y-surname dataset items are considerably similar or almost similar to each other.

In addition, the range values also indicate that the Y-STR surname dataset items consist of higher degree of similarity of the Y-STR surname objects. The range value of Dataset Item 4 is 3–6 (Table 8); Dataset Item 5, 1–5 (Table 9); and Dataset Item 6, 1–9 (Table 10). These values are obviously different as compared to the range values of the Y-STR haplogroup dataset items. For example, the range value of Dataset Item 1 is 7–12 (Table 5); Dataset Item 2, 7–19 (Table 6); and Dataset Item 3, 11–17 (Table 7).

3. Dataset Description

The dataset associated with this Dataset Paper consists of 6 items which are described as follows.

Dataset Item 1 (Table). This table consists of 751 objects of Y-STR haplogroup belonging to the Ireland Y-DNA Project (http://www.familytreedna.com/public/IrelandHeritage/). After filtration, this table is composed of only five haplogroups: E (24), G (20), L (200), J (32), and R (475). Note that the raw data are approximately 3419 data divided into 29 groups. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a lower degree of similarity of objects among themselves, which indicates that the objects in the table are considerably distant to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its haplogroup name separated by the dash and followed by the original Kit Number.

Column 1: Kit Number
Column 2: DYS393
Column 3: DYS390
Column 4: DYS19 (394)
Column 5: DYS391
Column 6: DYS385a
Column 7: DYS385b
Column 8: DYS426
Column 9: DYS388
Column 10: DYS439
Column 11: DYS389I
Column 12: DYS392
Column 13: DYS389II
Column 14: DYS458
Column 15: DYS459a
Column 16: DYS459b
Column 17: DYS455
Column 18: DYS454
Column 19: DYS447
Column 20: DYS437
Column 21: DYS448
Column 22: DYS449
Column 23: DYS464a
Column 24: DYS464b
Column 25: DYS464c
Column 26: DYS464b

Dataset Item 2 (Table). This table consists of 267 objects of Y-STR haplogroup obtained from the Finland DNA Project (http://www.familytreedna.com/public/Finland). After filtration, this table is composed of only four haplogroups: L (92), J (6), N (141), and R (28). Note that the raw data are approximately 906 data divided into 7 groups. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a lower degree of similarity of objects among themselves, which indicates that the objects in the table are considerably distant to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its haplogroup name separated by the dash and followed by the original Kit Number.

Column 1: Kit Number
Column 2: DYS393
Column 3: DYS390
Column 4: DYS19 (394)
Column 5: DYS391
Column 6: DYS385a
Column 7: DYS385b
Column 8: DYS426
Column 9: DYS388
Column 10: DYS439
Column 11: DYS389I
Column 12: DYS392
Column 13: DYS389II
Column 14: DYS458
Column 15: DYS459a
Column 16: DYS459b
Column 17: DYS455
Column 18: DYS454
Column 19: DYS447
Column 20: DYS437
Column 21: DYS448
Column 22: DYS449
Column 23: DYS464a
Column 24: DYS464b
Column 25: DYS464c
Column 26: DYS464

Dataset Item 3 (Table). This table consists of 263 objects obtained from the Y-haplogroup project (http://www.worldfamilies.net/yhapprojects). After filtration, this final table is composed of only three haplogroups: Group G (37), Group N (68), and Group T (158). Note that the raw data are approximately 516 data taken from haplogroups G, N, and T. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a lower degree of similarity of objects among themselves, which indicates that the objects in the table are considerably distant to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its haplogroup name separated by the dash and followed by the original Kit Number.

Column 1: Kit Number
Column 2: DYS393
Column 3: DYS390
Column 4: DYS19 (394)
Column 5: DYS391
Column 6: DYS385a
Column 7: DYS385b
Column 8: DYS426
Column 9: DYS388
Column 10: DYS439
Column 11: DYS389I
Column 12: DYS392
Column 13: DYS389II
Column 14: DYS458
Column 15: DYS459a
Column 16: DYS459b
Column 17: DYS455
Column 18: DYS454
Column 19: DYS447
Column 20: DYS437
Column 21: DYS448
Column 22: DYS449
Column 23: DYS464a
Column 24: DYS464b
Column 25: DYS464c
Column 26: DYS464

Dataset Item 4 (Table). This table consists of 236 objects combining four surnames: the Donald surname (112), the Flannery surname (64), the Mumma surname (42), and the William surname (18). The Donald surname data were obtained from Clan Donald’s DNA Projects (http://dna-project.clan-donald-usa.org/). The raw data are approximately 896 data. The Flannery surname data were obtained from the Flannery Clan Y-DNA project (http://www.flanneryclan.ie/). The raw data are approximately 896 data. The Mumma surname data were obtained from the Mumma-Moomaw Project (http://www.mumma.org/). The raw data are approximately 78 data. The William surname data were obtained from the Williams DNA Project (http://williams.genealogy.fm/). The raw data are approximately 626 data taken from 94 groups. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a higher degree of similarity of objects among themselves, which indicates that the objects in the table are considerably similar or almost similar to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its surname separated by the dash and followed by the original Kit Number.

Column 1: Kit Number
Column 2: DYS393
Column 3: DYS390
Column 4: DYS19 (394)
Column 5: DYS391
Column 6: DYS385a
Column 7: DYS385b
Column 8: DYS426
Column 9: DYS388
Column 10: DYS439
Column 11: DYS389I
Column 12: DYS392
Column 13: DYS389II
Column 14: DYS458
Column 15: DYS459a
Column 16: DYS459b
Column 17: DYS455
Column 18: DYS454
Column 19: DYS447
Column 20: DYS437
Column 21: DYS448
Column 22: DYS449
Column 23: DYS464a
Column 24: DYS464b
Column 25: DYS464c
Column 26: DYS464

Dataset Item 5 (Table). This table consists of 112 objects belonging to the Philips DNA project (http://www.phillipsdnaproject.com/). After filtration, the final data are composed of only 8 family groups: Group 2 (30), Group 4 (8), Group 5 (10), Group 8 (18), Group 10 (17), Group 16 (10), Group 17 (12), and Group 29 (7). Note that the raw data are approximately 341 data taken from 64 groups. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a higher degree of similarity of objects among themselves, which indicates that the objects in the table are considerably similar or almost similar to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its surname separated by the dash and followed by the original Kit Number.

Column 1: Kit Number
Column 2: DYS393
Column 3: DYS390
Column 4: DYS19 (394)
Column 5: DYS391
Column 6: DYS385a
Column 7: DYS385b
Column 8: DYS426
Column 9: DYS388
Column 10: DYS439
Column 11: DYS389I
Column 12: DYS392
Column 13: DYS389II
Column 14: DYS458
Column 15: DYS459a
Column 16: DYS459b
Column 17: DYS455
Column 18: DYS454
Column 19: DYS447
Column 20: DYS437
Column 21: DYS448
Column 22: DYS449
Column 23: DYS464a
Column 24: DYS464b
Column 25: DYS464c
Column 26: DYS464

Dataset Item 6 (Table). This table consists of 112 objects belonging to the Brown Surname project (http://brownsociety.org/). After filtration, the data are composed of only 14 family groups: Group 2 (9), Group 10 (17), Group 15 (6), Group 18 (6), Group 20 (7), Group 23 (8), Group 26 (8), Group 28 (8), Group 34 (7), Group 44 (6), Group 35 (7), Group 46 (7), Group 49 (10), and Group 91 (6). Note that the raw data are approximately 543 data taken from 126 groups. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a higher degree of similarity of objects among themselves, which indicates that the objects in the table are considerably similar or almost similar to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its surname separated by the dash and followed by the original Kit Number.

Column 1: Kit Number
Column 2: DYS393
Column 3: DYS390
Column 4: DYS19 (394)
Column 5: DYS391
Column 6: DYS385a
Column 7: DYS385b
Column 8: DYS426
Column 9: DYS388
Column 10: DYS439
Column 11: DYS389I
Column 12: DYS392
Column 13: DYS389II
Column 14: DYS458
Column 15: DYS459a
Column 16: DYS459b
Column 17: DYS455
Column 18: DYS454
Column 19: DYS447
Column 20: DYS437
Column 21: DYS448
Column 22: DYS449
Column 23: DYS464a
Column 24: DYS464b
Column 25: DYS464c
Column 26: DYS464

4. Concluding Remarks

The Y-STR data are a bit unique. They are characterized by a lot of similar and almost similar objects to each other. This uniqueness of the Y-STR data makes them different from the other common categorical datasets such as Soybean, Zoo, and Credit. In addition, this is considered the first effort to document Y-STR datasets, so that they are not limited to be used for clustering application only. The availability of the data will benefit researchers for further use in any method or application.

Dataset Availability

The dataset associated with this Dataset Paper is dedicated to the public domain using the CC0 waiver and is available at http://dx.doi.org/10.7167/2013/364725/dataset. In addition, the dataset can be accessed and downloaded freely from BioMed Central through the following links: http://www.biomedcentral.com/imedia/3073202776992603/supp1.txt, http://www.biomedcentral.com/imedia/1801488029699262/supp3.txt, http://www.biomedcentral.com/imedia/5259281766992624/supp4.txt, http://www.biomedcentral.com/imedia/1928703388699263/supp5.txt, and http://www.biomedcentral.com/imedia/7090097036992633/supp6.txt.

Disclosure

The authors declare that they have no competing interests.

Acknowledgments

The authors would like to extend their gratitude to many contributors toward the completion of this paper including Engineer Azizian Mohd Sapawi and their research assistants: Syahrul, Azhari, Kamal, Hasmarina, Nurin, Soleha, Mastura, Fadzila, Suhaida, and Shukriah.

Dataset Files

364725.item.1.xlsx
Dataset Item 1 (Table). This table consists of 751 objects of Y-STR haplogroup belonging to the Ireland Y-DNA Project (http://www.familytreedna.com/public/IrelandHeritage/). After filtration, this table is composed of only five haplogroups: E (24), G (20), L (200), J (32), and R (475). Note that the raw data are approximately 3419 data divided into 29 groups. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a lower degree of similarity of objects among themselves, which indicates that the objects in the table are considerably distant to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its haplogroup name separated by the dash and followed by the original Kit Number.
- Column 1: Kit Number
- Column 2: DYS393
- Column 3: DYS390
- Column 4: DYS19 (394)
- Column 5: DYS391
- Column 6: DYS385a
- Column 7: DYS385b
- Column 8: DYS426
- Column 9: DYS388
- Column 10: DYS439
- Column 11: DYS389I
- Column 12: DYS392
- Column 13: DYS389II
- Column 14: DYS458
- Column 15: DYS459a
- Column 16: DYS459b
- Column 17: DYS455
- Column 18: DYS454
- Column 19: DYS447
- Column 20: DYS437
- Column 21: DYS448
- Column 22: DYS449
- Column 23: DYS464a
- Column 24: DYS464b
- Column 25: DYS464c
- Column 26: DYS464b

364725.item.2.xlsx
Dataset Item 2 (Table). This table consists of 267 objects of Y-STR haplogroup obtained from the Finland DNA Project (http://www.familytreedna.com/public/Finland). After filtration, this table is composed of only four haplogroups: L (92), J (6), N (141), and R (28). Note that the raw data are approximately 906 data divided into 7 groups. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a lower degree of similarity of objects among themselves, which indicates that the objects in the table are considerably distant to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its haplogroup name separated by the dash and followed by the original Kit Number.
- Column 1: Kit Number
- Column 2: DYS393
- Column 3: DYS390
- Column 4: DYS19 (394)
- Column 5: DYS391
- Column 6: DYS385a
- Column 7: DYS385b
- Column 8: DYS426
- Column 9: DYS388
- Column 10: DYS439
- Column 11: DYS389I
- Column 12: DYS392
- Column 13: DYS389II
- Column 14: DYS458
- Column 15: DYS459a
- Column 16: DYS459b
- Column 17: DYS455
- Column 18: DYS454
- Column 19: DYS447
- Column 20: DYS437
- Column 21: DYS448
- Column 22: DYS449
- Column 23: DYS464a
- Column 24: DYS464b
- Column 25: DYS464c
- Column 26: DYS464

364725.item.3.xlsx
Dataset Item 3 (Table). This table consists of 263 objects obtained from the Y-haplogroup project (http://www.worldfamilies.net/yhapprojects). After filtration, this final table is composed of only three haplogroups: Group G (37), Group N (68), and Group T (158). Note that the raw data are approximately 516 data taken from haplogroups G, N, and T. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a lower degree of similarity of objects among themselves, which indicates that the objects in the table are considerably distant to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its haplogroup name separated by the dash and followed by the original Kit Number.
- Column 1: Kit Number
- Column 2: DYS393
- Column 3: DYS390
- Column 4: DYS19 (394)
- Column 5: DYS391
- Column 6: DYS385a
- Column 7: DYS385b
- Column 8: DYS426
- Column 9: DYS388
- Column 10: DYS439
- Column 11: DYS389I
- Column 12: DYS392
- Column 13: DYS389II
- Column 14: DYS458
- Column 15: DYS459a
- Column 16: DYS459b
- Column 17: DYS455
- Column 18: DYS454
- Column 19: DYS447
- Column 20: DYS437
- Column 21: DYS448
- Column 22: DYS449
- Column 23: DYS464a
- Column 24: DYS464b
- Column 25: DYS464c
- Column 26: DYS464

364725.item.4.xlsx
Dataset Item 4 (Table). This table consists of 236 objects combining four surnames: the Donald surname (112), the Flannery surname (64), the Mumma surname (42), and the William surname (18). The Donald surname data were obtained from Clan Donald’s DNA Projects (http://dna-project.clan-donald-usa.org/). The raw data are approximately 896 data. The Flannery surname data were obtained from the Flannery Clan Y-DNA project (http://www.flanneryclan.ie/). The raw data are approximately 896 data. The Mumma surname data were obtained from the Mumma-Moomaw Project (http://www.mumma.org/). The raw data are approximately 78 data. The William surname data were obtained from the Williams DNA Project (http://williams.genealogy.fm/). The raw data are approximately 626 data taken from 94 groups. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a higher degree of similarity of objects among themselves, which indicates that the objects in the table are considerably similar or almost similar to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its surname separated by the dash and followed by the original Kit Number.
- Column 1: Kit Number
- Column 2: DYS393
- Column 3: DYS390
- Column 4: DYS19 (394)
- Column 5: DYS391
- Column 6: DYS385a
- Column 7: DYS385b
- Column 8: DYS426
- Column 9: DYS388
- Column 10: DYS439
- Column 11: DYS389I
- Column 12: DYS392
- Column 13: DYS389II
- Column 14: DYS458
- Column 15: DYS459a
- Column 16: DYS459b
- Column 17: DYS455
- Column 18: DYS454
- Column 19: DYS447
- Column 20: DYS437
- Column 21: DYS448
- Column 22: DYS449
- Column 23: DYS464a
- Column 24: DYS464b
- Column 25: DYS464c
- Column 26: DYS464

364725.item.5.xlsx
Dataset Item 5 (Table). This table consists of 112 objects belonging to the Philips DNA project (http://www.phillipsdnaproject.com/). After filtration, the final data are composed of only 8 family groups: Group 2 (30), Group 4 (8), Group 5 (10), Group 8 (18), Group 10 (17), Group 16 (10), Group 17 (12), and Group 29 (7). Note that the raw data are approximately 341 data taken from 64 groups. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a higher degree of similarity of objects among themselves, which indicates that the objects in the table are considerably similar or almost similar to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its surname separated by the dash and followed by the original Kit Number.
- Column 1: Kit Number
- Column 2: DYS393
- Column 3: DYS390
- Column 4: DYS19 (394)
- Column 5: DYS391
- Column 6: DYS385a
- Column 7: DYS385b
- Column 8: DYS426
- Column 9: DYS388
- Column 10: DYS439
- Column 11: DYS389I
- Column 12: DYS392
- Column 13: DYS389II
- Column 14: DYS458
- Column 15: DYS459a
- Column 16: DYS459b
- Column 17: DYS455
- Column 18: DYS454
- Column 19: DYS447
- Column 20: DYS437
- Column 21: DYS448
- Column 22: DYS449
- Column 23: DYS464a
- Column 24: DYS464b
- Column 25: DYS464c
- Column 26: DYS464

364725.item.6.xlsx
Dataset Item 6 (Table). This table consists of 112 objects belonging to the Brown Surname project (http://brownsociety.org/). After filtration, the data are composed of only 14 family groups: Group 2 (9), Group 10 (17), Group 15 (6), Group 18 (6), Group 20 (7), Group 23 (8), Group 26 (8), Group 28 (8), Group 34 (7), Group 44 (6), Group 35 (7), Group 46 (7), Group 49 (10), and Group 91 (6). Note that the raw data are approximately 543 data taken from 126 groups. The values in the parentheses indicate the number of objects belonging to that particular group. This table is considered as having a higher degree of similarity of objects among themselves, which indicates that the objects in the table are considerably similar or almost similar to each other. In the table, the first column is the Kit Number followed by the 25 markers. Note that the Kit Number is actually the extended Kit Number that combined a prefix of its surname separated by the dash and followed by the original Kit Number.
- Column 1: Kit Number
- Column 2: DYS393
- Column 3: DYS390
- Column 4: DYS19 (394)
- Column 5: DYS391
- Column 6: DYS385a
- Column 7: DYS385b
- Column 8: DYS426
- Column 9: DYS388
- Column 10: DYS439
- Column 11: DYS389I
- Column 12: DYS392
- Column 13: DYS389II
- Column 14: DYS458
- Column 15: DYS459a
- Column 16: DYS459b
- Column 17: DYS455
- Column 18: DYS454
- Column 19: DYS447
- Column 20: DYS437
- Column 21: DYS448
- Column 22: DYS449
- Column 23: DYS464a
- Column 24: DYS464b
- Column 25: DYS464c
- Column 26: DYS464

References

A. Hart, How to Interpret Family History and Ancestry DNA Test Results for Beginners: The Geography and History of your Relatives, ASJA Press, New York, NY, USA, 2004.
M. S. Smolenyak and A. Turner, Trace your Roots with DNA Using Genetic Tests to Explore your Family Tree, Rodale Inc., 2004.
C. Pomery, Fimily History in the Genes: Trace your Family Tree, The National Archives, Surrey, UK, 2007.
B. Sykes, The Seven Daughters of Eve, W. W. Norton and Company, New York, NY, USA, 2001.
T. H. Shawker, Unlocking your Genetic History: A Step-By-Step Guide to Discovering your Faimily Medical and Genetic Heritage, Rutledge Hill Press, 2004.
C. Fitzpatrick, Forensic Genealogy, Rice Book Press, Fountain Valley, Calif, USA, 2005.
C. Fitzpatrick and A. Yeiser, DNA and Genealogy, Rice Book Press, Fountain Valley, Calif, USA, 2005.
A. Seman, Z. Abu Bakar, and A. M. Sapawi, “Centre-based clustering for Y-Short Tandem Repeats (Y-STR) as numerical and categorical data,” in Proceedings of the International Conference on Information Retrieval and Knowledge Management (CAMP '10), pp. 28–33, Shah Alam, Malaysia, March 2010.
View at: Publisher Site | Google Scholar
A. Seman, Z. A. Bakar, and A. M. Sapawi, “Attribute value weighting in K-modes clustering for Y-short tandem repeats (Y-STR) surname,” in Proceedings of the International Symposium on Information Technology (ITSim '10), pp. 1531–1536, Kuala Lumpur, Malaysia, June 2010.
View at: Publisher Site | Google Scholar
A. Seman, Z. A. Bakar, and A. M. Sapawi, “Modeling centre-based hard and soft clustering for y chromosome short tandem repeats (YSTR) data,” in Proceedings of the International Conference on Science and Social Research (CSSR '10), pp. 68–73, Kuala Lumpur, Malaysia, December 2010.
View at: Publisher Site | Google Scholar
A. Seman, Z. A. Bakar, and N. Daud, “Hard and soft updating centroids for clustering Y-Short tandem repeats (Y-STR) data,” in Proceedings of the IEEE Conference on Open Systems (ICOS '10), pp. 6–11, Kuala Lumpur, Malaysia, December 2010.
View at: Publisher Site | Google Scholar
A. Seman, Z. Abu Bakar, and A. M. Sapawi, “Centre-based Hard Clustering Algorithm for Y-STR Data,” Malaysia Journal of Computing, vol. 1, pp. 62–73, 2010.
View at: Google Scholar
A. Seman, Z. Abu-Bakar, and A. M. Sapawi, “Centre-based hard and soft clustering approaches for Y-STR data,” Journal of Genetic Genealogy, vol. 6, no. 1, pp. 1–9, 2010.
View at: Google Scholar
A. Seman, Z. Abu Bakar, and M. N. Isa, “Evaluation of k-Mode-type algorithms for clustering Y-short tandem repeats,” Journal of Trends in Bioinformatics, vol. 5, no. 2, pp. 47–52, 2012.
View at: Google Scholar
A. Seman, Z. Abu Bakar, and M. N. Isa, “An efficient clustering algorithm for partitioning Y-short tandem repeats data,” BMC Research Notes, vol. 5, no. 1, article 557, 2012.
View at: Google Scholar
Z. Huang, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283–304, 1998.
View at: Google Scholar
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, Francisco, Calif, USA, 2001.

Copyright

Copyright © 2013 Ali Seman et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

2021

Downloads

871

Citations