Research Article  Open Access
En Cheng, Z. Meral Ozsoyoglu, "PathCounting Formulas for Generalized Kinship Coefficients and Condensed Identity Coefficients", Computational and Mathematical Methods in Medicine, vol. 2014, Article ID 898424, 20 pages, 2014. https://doi.org/10.1155/2014/898424
PathCounting Formulas for Generalized Kinship Coefficients and Condensed Identity Coefficients
Abstract
An important computation on pedigree data is the calculation of condensed identity coefficients, which provide a complete description of the degree of relatedness of two individuals. The applications of condensed identity coefficients range from genetic counseling to disease tracking. Condensed identity coefficients can be computed using linear combinations of generalized kinship coefficients for two, three, four individuals, and two pairs of individuals and there are recursive formulas for computing those generalized kinship coefficients (Karigl, 1981). Pathcounting formulas have been proposed for the (generalized) kinship coefficients for two (three) individuals but there have been no pathcounting formulas for the other generalized kinship coefficients. It has also been shown that the computation of the (generalized) kinship coefficients for two (three) individuals using pathcounting formulas is efficient for large pedigrees, together with path encoding schemes tailored for pedigree graphs. In this paper, we propose a framework for deriving pathcounting formulas for generalized kinship coefficients. Then, we present the pathcounting formulas for all generalized kinship coefficients for which there are recursive formulas and which are sufficient for computing condensed identity coefficients. We also perform experiments to compare the efficiency of our method with the recursive method for computing condensed identity coefficients on large pedigrees.
1. Introduction
With the rapidly expanding field of medical genetics and genetic counseling, genealogy information is becoming increasingly abundant. In January 2009, the US Department of Health and Human Services released an updated and improved version of the Surgeon General’s Webbased family health history tool [1]. This Webbased tool makes it easy for users to record their family health history. Large extended human pedigrees are very informative for linkage analysis. Pedigrees including thousands of members in 10–20 generations are available from genetically isolated populations [2, 3]. In human genetics, a pedigree is defined as “a simplified diagram of a family’s genealogy that shows family members’ relationships to each other and how a specific trait, abnormality, or disease has been inherited” [4]. Pedigrees are utilized to trace the inheritance of a specific disease, calculate genetic risk ratios, identify individuals at risk, and facilitate genetic counseling. To calculate genetic risk ratios or identify individuals at risk, we need to assess the degree of relatedness of two individuals. As a matter of fact, all measures of relatedness are based on the concept of identical by descent (IBD). Two alleles are identical by descent if one is an ancestral copy of the other or if they are both copies of the same ancestral allele. The IBD concept is primarily due to Cotterman [5] and Malecot [6] and has been successfully applied to many problems in population genetics.
The simplest measure of relationship between two individuals is their kinship coefficient. The kinship coefficient between two individuals and is the probability that an allele selected randomly from and an allele selected randomly from the same autosomal locus of are identical by descent. To better discriminate between different types of pairs of relatives, identity coefficients were introduced by Gillois [7] and Harris [8] and promulgated by Jacquard [9]. Considering the four alleles of two individuals at a fixed autosomal locus, there are 15 possible identity states. Disregarding the distinction between maternally and paternally derived alleles, we obtain 9 condensed identity states. The probabilities associated with each condensed identity state are called condensed identity coefficients, which are useful in a diverse range of fields. This includes the calculation of risk ratios for qualitative disease, the analysis of quantitative traits, and genetic counseling in medicine.
A recursive algorithm for calculating condensed identity coefficients proposed by Karigl [10] has been known for some time. This method requires that one calculates a set of generalized kinship coefficients, from which one obtains condensed identity coefficients via a linear transformation. One limitation is that this recursive approach is not scalable when applied to very large pedigrees. It has been previously shown that the kinship coefficients for two individuals [11–13] and the generalized kinship coefficients for three individuals [14, 15] can be efficiently calculated using pathcounting formulas together with path encoding schemes tailored for pedigree graphs.
Motivated by the efficiency of pathcounting formulas for computing the kinship coefficient for two individuals and the generalized kinship coefficient for three individuals, we first introduce a framework for developing pathcounting formulas to compute generalized kinship coefficients concerning three individuals, four individuals, and two pairs of individuals. Then, we present pathcounting formulas for all generalized kinship coefficients which have recursive formulas proposed by Karigl [10] and are sufficient to compute condensed identity coefficients. In summary, our ultimate goal is to use pathcounting formulas for generalized kinship coefficients computation so that efficiency and scalability for condensed identity coefficients calculation can be improved.
The main contributions of our work are as follows:(i)a framework to develop pathcounting formulas for generalized kinship coefficients;(ii)a set of pathcounting formulas for all generalized kinship coefficients having recursive formulas [10];(iii)experimental results demonstrating significant performance gains for calculating condensed identity coefficients based on our proposed pathcounting formulas as compared to using recursive formulas [10].
2. Materials and Methods
This section describes kinship coefficients and generalized kinship coefficients, identity coefficients, and condensed identity coefficients in more detail. Conceptual terms for the pathcounting formulas for three and four individuals are introduced in Section 2.3. In addition, an overview of pathcounting formula derivation is presented.
2.1. Kinship Coefficients and Generalized Kinship Coefficients
The kinship coefficient between two individuals and is the probability that a randomly chosen allele at the same locus from each is identical by descent (IBD). There are two approaches to computing the kinship coefficient : the recursive approach [10] and the pathcounting approach [16]. The recursive formulas [10] for and are where and denote the father and the mother of , respectively, and is the inbreeding coefficient of .
Wright’s pathcounting formula [16] for is where is a common ancestor of and is a set of nonoverlapping pathpairs from to and , is the length of the path , is the length of the path , and is the inbreeding coefficient of . The pathpair is nonoverlapping if and only if the two paths share no common individuals, except .
Recursive formulas proposed by Karigl [10] for generalized kinship coefficients concerning three individuals, four individuals, and two pairs of individuals are listed as follows in (3), (4), and (5):
is the probability that randomly chosen alleles at the same locus from each of the three individuals (i.e., , and ) are identical by descent (IBD). Similarly, is the probability that randomly chosen alleles at the same locus from each of the four individuals (i.e., , and ) are IBD. is the probability that a random allele from is IBD with a random allele from and that a random allele from is IBD with a random allele from at the same locus. Note that if there is no common ancestor of , and . if there is no common ancestor of , , , and , and in the absence of a common ancestor either for and or for and .
2.2. Identity Coefficients and Condensed Identity Coefficients
Given two individuals and with maternally and paternally derived alleles at a fixed autosomal locus, there are 15 possible identity states, and the probabilities associated with each identity state are called identity coefficients. Ignoring the distinction between maternally and paternally derived alleles, we categorize the 15 possible states to 9 condensed identity states, as shown in Figure 1. The states range from state 1, in which all four alleles are IBD, to state 9, in which none of the four alleles are IBD. The probabilities associated with each condensed identity state are called condensed identity coefficients, denoted by . The condensed identity coefficients can be computed based on generalized kinship coefficients using the linear transformation shown as follows in (6):
In our work, we focus on deriving the pathcounting formulas for the generalized kinship coefficients, including , , and .
2.3. Terms Defined for PathCounting Formulas for Three and Four Individuals
(1) TripleCommon Ancestor. Given three individuals , , and , if is a common ancestor of the three individuals, then we call a triplecommon ancestor of , , and .
(2) QuadCommon Ancestor. Given four individuals , , , and , if is a common ancestor of the four individuals, then we call a quadcommon ancestor of , , , and .
(3) . It denotes the set of all possible paths from to , where the paths can only traverse edges in the direction of parent to child such that if and only if is an ancestor of . denotes a particular path from to , where .
(4) PathPair. It consists of two paths, denoted as , where and .
(5) Nonoverlapping PathPair. Given a pathpair , it is nonoverlapping if and only if the two paths share no common individuals, except .
(6) PathTriple. It consists of three paths, denoted as , where , , and .
(7) PathQuad. It consists of four paths, denoted as , where , , , and .
(8) . It denotes all common individuals shared between and , except .
(9) . It denotes all common individuals shared among , , and , except .
(10) . It denotes all common individuals shared among , , , and , except .
(11) Crossover and 2Overlap Individual. If , we call a crossover individual with respect to and if the two paths pass through different parents of . On the other hand, if and pass through the same parent of , then we call a 2overlap individual with respect to and .
(12) 3Overlap Individual. If and the three paths , , and pass through the same parent of , then we call a 3overlap individual with respect to , , and .
(13) 2Overlap Path. If is a 2overlap individual with respect to and , then both and pass through the same parent of , denoted by , and the edge from to is called an overlap edge. All consecutive overlap edges constitute a path and this path is called a 2overlap path. If the 2overlap path extends all the way to the ancestor , we call it a root 2overlap path.
(14) 3Overlap Path. It consists of all 3overlap individuals in a consecutive order. If the 3overlap path extends all the way to the root , we call it a root 3overlap path.
Example 1. Consider the pathpairs from to and in Figure 2, where is a common ancestor of and . For pathpair1, , and →→→ is a root 2overlap path with respect to and . For pathpair4, , where is a crossover individual; is a 2overlap individual with respect to and , and → is a root 2overlap path with respect to and .
Example 2. There are four pathquads listed in Figure 3, from to four individuals , , , and , where is a quadcommon ancestor of the four individuals. For pathquad2, considering the paths and , the path →→→ is a root 2overlap path; are 2overlap individuals with respect to and . For pathquad3, are 3overlap individuals with respect to , , and , and the path →→→ is a root 3overlap path.
Then, we summarize all the conceptual terms used in the pathcounting formulas for two individuals, three individuals, and four individuals in Table 1 which reveals a glimpse of our framework for generalizing Wright’s formula to three and four individuals from terminology aspect.

2.4. An Overview of PathCounting Formula Derivation
According to Wright’s pathcounting formula [16] (see (2)) for two individuals and , the pathcounting approach requires identifying common ancestors of and and calculating the contribution of each common ancestor to . More specifically, for each common ancestor, denoted as , we obtain all pathpairs from to and and identify acceptable pathpairs. For , an acceptable pathpair is a nonoverlapping pathpair where the two paths share no common individuals, except . In Figure 2, pathpair2 is an acceptable pathpair, while pathpair1, pathpair3, and pathpair4 are not acceptable pathpairs. The contribution of each common ancestor to is computed based on the inbreeding coefficient of , modified by the length of each acceptable pathpair.
To compute , the pathcounting approach requires identifying all triplecommon ancestors of , , and and summing up all triplecommon ancestors’ contributions to . For each triplecommon ancestor, denoted as , we first identify all pathtriples each of which consists of three paths from to , , and , respectively. Some examples of pathtriples are presented in Figure 2.
For , only nonoverlapping pathpairs are acceptable. A pathtriple consists of three pathpairs , , and . For , a pathtriple might be acceptable even though either 2overlap individuals or crossover individuals exist between a pathpair. The main challenge we need to address is finding necessary and sufficient conditions for acceptable pathtriples.
Aiming at solving the problem of identifying acceptable pathtriples, we first use a systematic method to generate all possible cases for a pathpair by considering different types of common individuals shared between the two paths. Then, we introduce building blocks which are connected graphs with conditions on every edge in the graph that encapsulates a set of acceptable cases of pathpairs. In each building block, we represent paths as nodes and interactions (i.e., shared common individuals between two paths) as edges. There are at least two paths in a building block. For each building block, we obtain all acceptable cases for concerned pathpairs. Given a pathtriple, it can be decomposed to one or multiple building blocks. Considering a shared pathpair between two building blocks, we use the natural join operator from relational algebra to match the acceptable cases for the shared pathpair between two building blocks. In other words, considering the acceptable cases for building blocks as inputs, we use the natural join operator to construct all acceptable cases for a pathtriple. Acceptable cases for a pathtriple are identified and then used in deriving the pathcounting formula for .
Then, we summarize all the main procedures used for deriving the pathcounting formula for in a flowchart shown in Figure 4. The main procedures are also applicable for deriving the pathcounting formulas for and .
3. Results and Discussion
3.1. PathCounting Formulas for Three Individuals
We first introduce a systematic method to generate all possible cases for a pathpair. Then we discuss building blocks for pathtriples and identify all acceptable cases which are used in deriving the pathcounting formula for .
3.1.1. Cases for a PathPair
Given a pathpair with , where is a common ancestor of and and () consists of all common individuals shared between and , except , we introduce three patterns (i.e., crossover, 2overlap, and root 2overlap) to generate all possible cases for .(1)(): and share one or multiple crossover individuals.(2)(): and are root 2overlapping from , and the root 2overlap path can have one or multiple 2overlap individuals.(3)(): and are overlapping but not from , and the 2overlap path can have one or multiple 2overlap individuals.
Based on the three patterns, (), (), and (), we use regular expressions to generate all possible cases for the pathpair . For convenience, we drop and use , and instead of patterns (), (), and (), whenever there is no confusion. When , the eight cases shown in (7) cover all possible cases for . The completeness of eight cases shown in (7) for can be proved by induction on the total number of , , and appearing in . Using the pedigree in Figure 2, Cases 1–3 and Case 6 are illustrated in (8), (9), (10), and (11): where are 2overlap individuals and the overlap path is a root 2overlap path: where is a 2overlap individual and the overlap path is a root 2overlap path; is a crossover individual: where is a crossover individual: where is a crossover individual; is a 2overlap individual and the overlap path is a 2overlap path.
3.1.2. PathPair Level Graphical Representation of a PathTriple
Given a pathtriple , we represent each path as a node. The pathtriple can be decomposed to three pathpairs (i.e., , , and ). For each pathpair, if the two paths share at least one common individual (i.e., either 2overlap individual or crossover individual), except , then there is an edge between the two nodes representing the two paths. Therefore, we obtain four different scenarios , shown in Figure 5.
In Figure 5, the scenario has no edges, so it means that consists of three independent paths. In Figure 2, pathtriple1 is an example of . Next, we introduce a lemma which can assist with identifying the options for the edges in the scenarios .
Lemma 3. Given a pathtriple , consider the three pathpairs , , and , if there is a 2overlap edge which is represented by in regular expression representation of any of the three pathpairs, and then the pathtriple has no contribution to .
Proof. In [17], Nadot and Vaysseix proposed, from a genetic and biological point of view, that can be evaluated by enumerating all eligible inheritance paths at allelelevel starting from a triple common ancestor to the three individuals , , and .
For the pedigree in Figure 6, let us consider the pathtriple listed as follows. ; ; .
For , is a crossover individual, is an overlap individual, and is a 2overlap edge represented by in regular expression representation (see the definition for in Section 3.1.1).
For the individual , let us denote the two alleles at one fixed autosomal locus as and . At allelelevel, only one allele can be passed down from to . Since and are parents of is passed down from one parent, and is passed down from the other parent. It is infeasible to pass down both and from to . In other words, there are no corresponding inheritance paths for the pathtriple with a 2overlap edge between (i.e., Case 6: ). Therefore, such kind of pathtriples has no contribution to .
(a) Pedigree
(b) Inheritance paths
Figure 6(b) shows one example of eligible inheritance paths corresponding to a pedigree graph. Each individual is represented by two allele nodes. The eligible inheritance paths in Figure 6(b) consist of red edges only.
Only Case 1, Case 2, and Case 3 do not have in the regular expression representation of a pathpair (see (7)); considering the scenarios shown in Figure 5, an edge can have three options .
3.1.3. Constructing Cases for a PathTriple
For the scenarios in Figure 5, we define two building blocks along with some rules in Figure 7 to generate acceptable cases. For , the edge can have three options . For , we cannot allow both edges to be root overlap, because if two edges are root overlap, then and must share at least one common individual, except , which contradicts the fact that and have no edge.
Next, we focus on generating all acceptable cases for the scenarios in Figure 5, where only contains more than one building block. In order to leverage the dependency among building blocks, we decompose to , shown in Figure 8. For each , we have a set of acceptable pathtriples, denoted as .
Considering the dependency among , we use the natural join operator, denoted as , operating on to generate all acceptable cases for . As a result, we obtain , where denotes the acceptable cases of the pathtriple in the scenario .
For each scenario in Figure 5, we generate all acceptable cases for . The scenario has no edges, and it shows that consists of three independent paths, while, for the other scenarios (), the edges can have two options:(1)all edges belong to crossover; or(2)one edge belongs to root 2overlap; the remaining () edges belong to crossover.
In summary, acceptable pathtriples can have at most one root 2overlap path, any number of crossover individuals, but zero 2overlap path.
3.1.4. Splitting Operator
Considering the existence of root 2overlap path and crossover in acceptable pathtriples, we propose a splitting operator to transform a pathtriple with crossover individuals to a noncrossover pathtriple without changing the contribution from this pathtriple to . The main purpose of using the splitting operator is to simplify the pathcounting formula derivation process. We first use an example in Figure 9 to illustrate how the splitting operator works. In Figure 9, there is a crossover individual between and in the path triple in . The splitting operator proceeds as follows:(1)split the node to two nodes, and ;(2)transform the edges and to and , respectively;(3)add two new edges, and .
Lemma 4. Given a pedigree graph having crossover individuals regarding shown in Figure 9, let denote the lowest crossover individual, where no descendant of can be a crossover individual among the three paths , , and . After using the splitting operator for the lowest crossover individual in , the number of crossover individuals in is decreased by 1.
Proof. The splitting operator only affects the edges from to and . If there is a new crossover node appearing, the only possible node is either or . Assume becomes a crossover individual; it means that is able to reach and from two separate paths. It contradicts the fact that is the lowest crossover individual between and .
Next, we introduce a canonical graph which results from applying the splitting operator for all crossover individuals. The canonical graph has zero crossover individual.
Definition 5 (Canonical Graph). Given a pedigree graph having one or more crossover individuals regarding , If there exists a graph which has no crossover individuals with regards to such that(i)any acceptable pathtriple in has an acceptable pathtriple in which has the same contribution to as the one in for ;(ii)any acceptable pathtriple in has an acceptable pathtriple in which and has the same contribution to as the one in for .
We call a canonical graph of regarding .
Lemma 6. For a pedigree graph having one or more crossover individuals regarding , there exists a canonical graph for .
Proof (Sketch).
The proof is by induction on the number of crossover individuals.
Induction hypothesis: assume that if has or less crossovers, there is a canonical graph for .
In the induction step, let be a graph with crossovers; let be the lowest crossover between paths and in . We apply the splitting operator on in and obtain having crossovers by Lemma 4.
3.1.5. PathCounting Formula for
Now, we present the pathcounting formula for : where , : the inbreeding coefficient of , : a triplecommon ancestor of , , and , Type 1: has zero root 2overlap, Type 2: has one root 2overlap path ending at the individual and : the length of the path (also applicable for , , and ).
For completeness, the pathcounting formula for is given in Appendix A; and the correctness proof of the pathcounting formula is given in Appendix B.
3.2. PathCounting Formulas for Four Individuals
3.2.1. PathPair Level Graphical Representation of
Given a pathquad and , the pathquad can have 11 scenarios shown in Figure 10 where all four paths are considered symmetrically.
In Figure 11, we introduce three building blocks . For and , the rules presented in Figure 7 are also applicable for Figure 11. For , we only consider root overlap, because the crossover individuals can be eliminated by using the splitting operator introduced in Section 3.1.4. Note that for , if , then it is equivalent to the scenario in Figure 8 Therefore, we only need to consider when .
3.2.2. Building BlockBased Cases Construction for
For a scenario in Figure 11, we first decompose to one or multiple building blocks. For a scenario , it has only one building block, and all acceptable cases can be obtained directly. For , there is no need to consider the conflict between the edges in and because and are disconnected. Let denote all acceptable cases of the pathpairs in , and let denote all acceptable cases for . Therefore, we obtain where denotes the Cartesian product operator from relational algebra.
For , we obtain . For , we define the largest subgraph of based on which we construct .
Definition 7 (Largest Subgraph). Given a scenario and , the largest subgraph of , denoted as , is defined as follows:(1) is a proper subgraph of ;(2)if contains , then must also contain ;(3)no such exists that is a proper subgraph of while is also a proper subgraph of .
For each scenario and , we list the largest subgraph of , denoted as , in Table 2.
For a scenario and , let denote the set of building blocks in but not in , where is the largest subgraph of . Let and denote the number of edges in and , respectively. According to Table 2, we can conclude that . In order to leverage the dependency among building blocks, we consider only in . For example, . Let denote all acceptable cases for . And let denote the set of acceptable cases for . Then, we can use and Diff to construct all acceptable cases for . Then, we apply this idea for constructing all acceptable cases for each in Table 2.
Given a pathquad , an acceptable case has the following properties:(1)if there is one root 3overlap path, there can be at most one root 2overlap path;(2)otherwise, there can be at most two root 2overlap paths.
3.2.3. PathCounting Formula for
Now, we present the pathcounting formula for as follows: where , , : the inbreeding coefficient of , : a quadcommon ancestor of , , , and , Type 1: zero root 2overlap and zero root 3overlap path, Type 2: one root 2overlap path ending at and : the length of the path (also applicable for , etc.).
For completeness, the pathcounting formulas for and are presented in Appendix A. The correctness of the pathcounting formula for four individuals is proven in Appendix C.
3.3. PathCounting Formulas for Two Pairs of Individuals
3.3.1. Terminology and Definitions
(1) 2PairPathPair. It consists of two pairs of pathpairs denoted as , where , is a common ancestor of and , and is a common ancestor of and . If , then is a quadcommon ancestor of , , , and .
(2) HomoOverlap and HeterOverlap Individual. Given two pairs of individuals , if (or , we call a homooverlap individual when and (or and ) pass through the same parent of . If , where and , we call a heteroverlap individual when and pass through the same parent of .
(3) Root HomoOverlap and HeterOverlap Path. Given a 2pairpathpair , if is a homooverlap individual and the homooverlap path extends all the way to the quadcommon ancestor , then we call it a root homooverlap path. If is a heteroverlap individual and the heteroverlap path extends all the way to the quadcommon ancestor , then we call it a root heteroverlap path.
Example 8. is quadcommon ancestor for , , , and in Figure 12. For (a), is a homooverlap individual between and .
is a homooverlap individual between and . And, and are root homooverlap paths. For (b), is a heteroverlap individual between and . is a heteroverlap individual between and . And and are root heteroverlap paths.
(a)
(b)
3.3.2. PathCounting Formula for
Now, we present a pathpair level graphical representation for shown in Figure 13. The options for an edge can be . (Refer to Section 3.1.1 for definitions of , and ). Based on the different types of presented in (14), all cases for are summarized in Table 3, where is the last individual of a root homooverlap path (i.e., the path ending at ) and and are the last individuals of root heteroverlap paths and , respectively.

Given a pedigree graph having one or multiple progenitors , we define that the generation of a progenitor is 0, denoted as . If an individual has only one parent , then we define . If an individual has two parents and , we define .
The pathcounting formula for is as follows: where : a quadcommon ancestor of , , , and , : a common ancestor of and , and : a common ancestor of and . For , there are four types (i.e., Type 1 to Type 4). Type 1: zero root homooverlap and zero root heteroverlap. Type 2: zero root homooverlap and one root heteroverlap ending at , Type 4: one root homooverlap ending at and two root heteroverlap ending at and, and. For , there is one type (i.e., Type 5). Type 5: has zero overlap individual, has zero overlap individual.
At most one pathpair can have crossover individuals.
Between a path from and a path from , there are no overlap individuals, but there can be crossover individuals, , where and :
Note that if and have zero quadcommon ancestors, we have the following formula for : Type 6: is a nonoverlapping pathpair and is a nonoverlapping pathpair. Between a path from and a path from , there are no overlap individuals, but there can be crossover individuals.
and are defined as in Type 5.
The correctness of the pathcounting formula for is proven in Appendix C. For completeness, please refer to [18] for the pathcounting formulas for , , , and .
3.4. Experimental Results
In this section, we show the efficiency of our pathcounting method using NodeCodes for condensed identity coefficients by making comparisons with the performance of a recursive method used in [10]. We implemented two methods: (1) using recursive formulas to compute each required kinship coefficient and generalized kinship coefficient; (2) using pathcounting method coupled with NodeCodes to compute each required kinship coefficient and generalized kinship coefficient independently. We refer to the first method as Recursive, the second method as NodeCodes. For completeness, please refer to [18] for the details of the NodeCodesbased method.
Nodecodes of a node is a set of labels each representing a path to the node from its ancestors. Given a pedigree graph, let be the progenitor (i.e., the node with 0 indegree). (For simplicity, we assume there is one progenitor, , as the ancestor of all individuals in the pedigree. Otherwise, a virtual node can be added to the pedigree graph and all progenitors can be made children of .) For each node in the graph, the set of NodeCodes of , denoted as , are assigned using a breadthfirstsearch traversal starting from as follows.(1)If is then contains only one element: the empty string.(2)Otherwise, let be a node with , and be ’s children in sibling order; then for each in (), a code is added to (), where , and indicates the gender of the individual represented by node .
Computations of kinship coefficients for two individuals and generalized kinship coefficients for three individuals presented in [11, 12, 14, 15] are using NodeCodes. The NodeCodesbased computation schemes can also be applied for the generalized kinship coefficients for four individuals and two pairs of individuals. For completeness, please refer to [18] for the details using NodeCodes to compute the generalized kinship coefficients for four individuals and two pairs of individuals based on our proposed pathcounting formulas in Sections 3.2 and 3.3.
In order to test the scalability of our approach for calculating condensed identity coefficients on large pedigrees, we used a population simulator implemented in [11] to generate arbitrarily large pedigrees. The population simulator is based on the algorithm for generating populations with overlapping generations in Chapter 4 of [19] along with the parameters given in Appendix of [20] to model the relatively isolated Finnish Kainuu subpopulation and its growth during the years 1500–2000. An overview of the generation algorithm was presented in [11, 12, 14]. The parameters include starting/ending year, initial population size, initial age distribution, marriage probability, maximum age at pregnancy, expected number of children by time period, immigration rate, and probability of death by time period and age group.
We examine the performance of condensed identity coefficients using twelve synthetic pedigrees which range from 75 individuals to 195,197 individuals. The smallest pedigree spans 3 generations, and the largest pedigree spans 19 generations. We analyzed the effects of pedigree size and the depth of individuals in the pedigree (the longest path between the individual and a progenitor) on the computation efficiency improvement.
In the first experiment, 300 random pairs were selected from each of our 12 synthetic pedigrees. Figure 14 shows computation efficiency improvement for each pedigree. As can be seen, the improvement of NodeCodes over Recursive grew increasingly larger as the pedigree size increased, from a comparable amount of 26.83% on the smallest pedigree to 94.75% on the largest pedigree. It also shows that pathcounting method coupled with NodeCodes can scale very well on large pedigrees in terms of computing condensed identity coefficients.
In our next experiment, we examined the effect of the depth of the individual in the pedigree on the query time. For each depth, we generated 300 random pairs from the largest synthetic pedigree.
Figure 15 shows the effect of depth on the computation efficiency improvement. We can see the improvement of NodeCodes over Recursive, ranging from 86.48% to 91.30%.
4. Conclusion
We have introduced a framework for generalizing Wright’s pathcounting formula for more than two individuals. Aiming at efficiently computing condensed identity coefficients, we proposed pathcounting formulas (PCF) for all generalized kinship coefficients for which are sufficient for expressing condensed identity coefficients by a linear combination. We also perform experiments to compare the efficiency of our method with the recursive method for computing condensed identity coefficients on large pedigrees. Our future work includes (i) further improvements on condensed identify coefficients computation by collectively calculating the set of generalized kinship coefficients to avoid redundant computations, and (ii) experimental results for using PCF in conjunction with encoding schemes (e.g., compact pathencoding schemes [13]) for computing condensed identity coefficients on very large pedigrees.
Appendices
A. PathCounting Formulas of Special Cases
A.1. PathCounting Formula for
For , we introduce a special case, where and are mergeable.
Definition A.1 (Mergeable PathPair). A pathpair is mergeable if and only if the two paths and are completely identical.
Next, we present a graphical representation of in Figure 16.
Lemma A.2. For and in Figure 16, cannot be a mergeable pathpair.
Proof. For and , if is mergeable, then any common individual between and is also a shared individual between and . It means which contradicts the fact that .
Considering all three scenarios in Figure 16, only can have a mergeable pathpair by Lemma A.2. Now, we present our pathcounting formula for where is not an ancestor of :
where : a common ancestor of and .
When is not mergeable, Type 1: has no root 2overlap. Type 2: has one root 2overlap path ending at the individual .
When is mergeable,
Type 3: is a nonoverlapping pathpair
For the sake of completeness, if is an ancestor of , there is no recursive formula for in [10], but we can use either the recursive formula for or the pathcounting formula for to compute .
A.2. PathCounting Formula for
Given a pathquad , if is not mergeable, then we process the pathquad as equivalent to . If is mergeable, the pathquad