Path-Counting Formulas for Generalized Kinship Coefficients and Condensed Identity Coefficients

Cheng, En; Ozsoyoglu, Z. Meral

doi:https://doi.org/10.1155/2014/898424

Computational and Mathematical Methods in Medicine

On this page

Abstract Introduction Materials and Methods Results and Discussion Conclusion Acknowledgments References Copyright Related Articles

Special Issue

Advances in Statistical Medicine

View this Special Issue

Research Article | Open Access

Volume 2014 | Article ID 898424 | https://doi.org/10.1155/2014/898424

Path-Counting Formulas for Generalized Kinship Coefficients and Condensed Identity Coefficients

En Cheng¹and Z. Meral Ozsoyoglu²

Academic Editor: Zhenyu Jia

Received14 Jan 2014

Accepted08 May 2014

Published21 Jul 2014

Abstract

An important computation on pedigree data is the calculation of condensed identity coefficients, which provide a complete description of the degree of relatedness of two individuals. The applications of condensed identity coefficients range from genetic counseling to disease tracking. Condensed identity coefficients can be computed using linear combinations of generalized kinship coefficients for two, three, four individuals, and two pairs of individuals and there are recursive formulas for computing those generalized kinship coefficients (Karigl, 1981). Path-counting formulas have been proposed for the (generalized) kinship coefficients for two (three) individuals but there have been no path-counting formulas for the other generalized kinship coefficients. It has also been shown that the computation of the (generalized) kinship coefficients for two (three) individuals using path-counting formulas is efficient for large pedigrees, together with path encoding schemes tailored for pedigree graphs. In this paper, we propose a framework for deriving path-counting formulas for generalized kinship coefficients. Then, we present the path-counting formulas for all generalized kinship coefficients for which there are recursive formulas and which are sufficient for computing condensed identity coefficients. We also perform experiments to compare the efficiency of our method with the recursive method for computing condensed identity coefficients on large pedigrees.

1. Introduction

With the rapidly expanding field of medical genetics and genetic counseling, genealogy information is becoming increasingly abundant. In January 2009, the US Department of Health and Human Services released an updated and improved version of the Surgeon General’s Web-based family health history tool [1]. This Web-based tool makes it easy for users to record their family health history. Large extended human pedigrees are very informative for linkage analysis. Pedigrees including thousands of members in 10–20 generations are available from genetically isolated populations [2, 3]. In human genetics, a pedigree is defined as “a simplified diagram of a family’s genealogy that shows family members’ relationships to each other and how a specific trait, abnormality, or disease has been inherited” [4]. Pedigrees are utilized to trace the inheritance of a specific disease, calculate genetic risk ratios, identify individuals at risk, and facilitate genetic counseling. To calculate genetic risk ratios or identify individuals at risk, we need to assess the degree of relatedness of two individuals. As a matter of fact, all measures of relatedness are based on the concept of identical by descent (IBD). Two alleles are identical by descent if one is an ancestral copy of the other or if they are both copies of the same ancestral allele. The IBD concept is primarily due to Cotterman [5] and Malecot [6] and has been successfully applied to many problems in population genetics.

The simplest measure of relationship between two individuals is their kinship coefficient. The kinship coefficient between two individuals and is the probability that an allele selected randomly from and an allele selected randomly from the same autosomal locus of are identical by descent. To better discriminate between different types of pairs of relatives, identity coefficients were introduced by Gillois [7] and Harris [8] and promulgated by Jacquard [9]. Considering the four alleles of two individuals at a fixed autosomal locus, there are 15 possible identity states. Disregarding the distinction between maternally and paternally derived alleles, we obtain 9 condensed identity states. The probabilities associated with each condensed identity state are called condensed identity coefficients, which are useful in a diverse range of fields. This includes the calculation of risk ratios for qualitative disease, the analysis of quantitative traits, and genetic counseling in medicine.

A recursive algorithm for calculating condensed identity coefficients proposed by Karigl [10] has been known for some time. This method requires that one calculates a set of generalized kinship coefficients, from which one obtains condensed identity coefficients via a linear transformation. One limitation is that this recursive approach is not scalable when applied to very large pedigrees. It has been previously shown that the kinship coefficients for two individuals [11–13] and the generalized kinship coefficients for three individuals [14, 15] can be efficiently calculated using path-counting formulas together with path encoding schemes tailored for pedigree graphs.

Motivated by the efficiency of path-counting formulas for computing the kinship coefficient for two individuals and the generalized kinship coefficient for three individuals, we first introduce a framework for developing path-counting formulas to compute generalized kinship coefficients concerning three individuals, four individuals, and two pairs of individuals. Then, we present path-counting formulas for all generalized kinship coefficients which have recursive formulas proposed by Karigl [10] and are sufficient to compute condensed identity coefficients. In summary, our ultimate goal is to use path-counting formulas for generalized kinship coefficients computation so that efficiency and scalability for condensed identity coefficients calculation can be improved.

The main contributions of our work are as follows:(i)a framework to develop path-counting formulas for generalized kinship coefficients;(ii)a set of path-counting formulas for all generalized kinship coefficients having recursive formulas [10];(iii)experimental results demonstrating significant performance gains for calculating condensed identity coefficients based on our proposed path-counting formulas as compared to using recursive formulas [10].

2. Materials and Methods

This section describes kinship coefficients and generalized kinship coefficients, identity coefficients, and condensed identity coefficients in more detail. Conceptual terms for the path-counting formulas for three and four individuals are introduced in Section 2.3. In addition, an overview of path-counting formula derivation is presented.

2.1. Kinship Coefficients and Generalized Kinship Coefficients

The kinship coefficient between two individuals and is the probability that a randomly chosen allele at the same locus from each is identical by descent (IBD). There are two approaches to computing the kinship coefficient : the recursive approach [10] and the path-counting approach [16]. The recursive formulas [10] for and are where and denote the father and the mother of , respectively, and is the inbreeding coefficient of .

Wright’s path-counting formula [16] for is where is a common ancestor of and is a set of nonoverlapping path-pairs from to and , is the length of the path , is the length of the path , and is the inbreeding coefficient of . The path-pair is nonoverlapping if and only if the two paths share no common individuals, except .

Recursive formulas proposed by Karigl [10] for generalized kinship coefficients concerning three individuals, four individuals, and two pairs of individuals are listed as follows in (3), (4), and (5):

is the probability that randomly chosen alleles at the same locus from each of the three individuals (i.e., , and ) are identical by descent (IBD). Similarly, is the probability that randomly chosen alleles at the same locus from each of the four individuals (i.e., , and ) are IBD. is the probability that a random allele from is IBD with a random allele from and that a random allele from is IBD with a random allele from at the same locus. Note that if there is no common ancestor of , and . if there is no common ancestor of , , , and , and in the absence of a common ancestor either for and or for and .

2.2. Identity Coefficients and Condensed Identity Coefficients

Given two individuals and with maternally and paternally derived alleles at a fixed autosomal locus, there are 15 possible identity states, and the probabilities associated with each identity state are called identity coefficients. Ignoring the distinction between maternally and paternally derived alleles, we categorize the 15 possible states to 9 condensed identity states, as shown in Figure 1. The states range from state 1, in which all four alleles are IBD, to state 9, in which none of the four alleles are IBD. The probabilities associated with each condensed identity state are called condensed identity coefficients, denoted by . The condensed identity coefficients can be computed based on generalized kinship coefficients using the linear transformation shown as follows in (6):

In our work, we focus on deriving the path-counting formulas for the generalized kinship coefficients, including , , and .

2.3. Terms Defined for Path-Counting Formulas for Three and Four Individuals

(1) Triple-Common Ancestor. Given three individuals , , and , if is a common ancestor of the three individuals, then we call a triple-common ancestor of , , and .

(2) Quad-Common Ancestor. Given four individuals , , , and , if is a common ancestor of the four individuals, then we call a quad-common ancestor of , , , and .

(3) . It denotes the set of all possible paths from to , where the paths can only traverse edges in the direction of parent to child such that if and only if is an ancestor of . denotes a particular path from to , where .

(4) Path-Pair. It consists of two paths, denoted as , where and .

(5) Nonoverlapping Path-Pair. Given a path-pair , it is nonoverlapping if and only if the two paths share no common individuals, except .

(6) Path-Triple. It consists of three paths, denoted as , where , , and .

(7) Path-Quad. It consists of four paths, denoted as , where , , , and .

(8) . It denotes all common individuals shared between and , except .

(9) . It denotes all common individuals shared among , , and , except .

(10) . It denotes all common individuals shared among , , , and , except .

(11) Crossover and 2-Overlap Individual. If , we call a crossover individual with respect to and if the two paths pass through different parents of . On the other hand, if and pass through the same parent of , then we call a 2-overlap individual with respect to and .

(12) 3-Overlap Individual. If and the three paths , , and pass through the same parent of , then we call a 3-overlap individual with respect to , , and .

(13) 2-Overlap Path. If is a 2-overlap individual with respect to and , then both and pass through the same parent of , denoted by , and the edge from to is called an overlap edge. All consecutive overlap edges constitute a path and this path is called a 2-overlap path. If the 2-overlap path extends all the way to the ancestor , we call it a root 2-overlap path.

(14) 3-Overlap Path. It consists of all 3-overlap individuals in a consecutive order. If the 3-overlap path extends all the way to the root , we call it a root 3-overlap path.

Example 1. Consider the path-pairs from to and in Figure 2, where is a common ancestor of and . For path-pair1, , and →→→ is a root 2-overlap path with respect to and . For path-pair4, , where is a crossover individual; is a 2-overlap individual with respect to and , and → is a root 2-overlap path with respect to and .

Example 2. There are four path-quads listed in Figure 3, from to four individuals , , , and , where is a quad-common ancestor of the four individuals. For path-quad2, considering the paths and , the path →→→ is a root 2-overlap path; are 2-overlap individuals with respect to and . For path-quad3, are 3-overlap individuals with respect to , , and , and the path →→→ is a root 3-overlap path.

Then, we summarize all the conceptual terms used in the path-counting formulas for two individuals, three individuals, and four individuals in Table 1 which reveals a glimpse of our framework for generalizing Wright’s formula to three and four individuals from terminology aspect.

2.4. An Overview of Path-Counting Formula Derivation

According to Wright’s path-counting formula [16] (see (2)) for two individuals and , the path-counting approach requires identifying common ancestors of and and calculating the contribution of each common ancestor to . More specifically, for each common ancestor, denoted as , we obtain all path-pairs from to and and identify acceptable path-pairs. For , an acceptable path-pair is a nonoverlapping path-pair where the two paths share no common individuals, except . In Figure 2, path-pair2 is an acceptable path-pair, while path-pair1, path-pair3, and path-pair4 are not acceptable path-pairs. The contribution of each common ancestor to is computed based on the inbreeding coefficient of , modified by the length of each acceptable path-pair.

To compute , the path-counting approach requires identifying all triple-common ancestors of , , and and summing up all triple-common ancestors’ contributions to . For each triple-common ancestor, denoted as , we first identify all path-triples each of which consists of three paths from to , , and , respectively. Some examples of path-triples are presented in Figure 2.

For , only nonoverlapping path-pairs are acceptable. A path-triple consists of three path-pairs , , and . For , a path-triple might be acceptable even though either 2-overlap individuals or crossover individuals exist between a path-pair. The main challenge we need to address is finding necessary and sufficient conditions for acceptable path-triples.

Aiming at solving the problem of identifying acceptable path-triples, we first use a systematic method to generate all possible cases for a path-pair by considering different types of common individuals shared between the two paths. Then, we introduce building blocks which are connected graphs with conditions on every edge in the graph that encapsulates a set of acceptable cases of path-pairs. In each building block, we represent paths as nodes and interactions (i.e., shared common individuals between two paths) as edges. There are at least two paths in a building block. For each building block, we obtain all acceptable cases for concerned path-pairs. Given a path-triple, it can be decomposed to one or multiple building blocks. Considering a shared path-pair between two building blocks, we use the natural join operator from relational algebra to match the acceptable cases for the shared path-pair between two building blocks. In other words, considering the acceptable cases for building blocks as inputs, we use the natural join operator to construct all acceptable cases for a path-triple. Acceptable cases for a path-triple are identified and then used in deriving the path-counting formula for .

Then, we summarize all the main procedures used for deriving the path-counting formula for in a flowchart shown in Figure 4. The main procedures are also applicable for deriving the path-counting formulas for and .

3. Results and Discussion

3.1. Path-Counting Formulas for Three Individuals

We first introduce a systematic method to generate all possible cases for a path-pair. Then we discuss building blocks for path-triples and identify all acceptable cases which are used in deriving the path-counting formula for .

3.1.1. Cases for a Path-Pair

Given a path-pair with , where is a common ancestor of and and () consists of all common individuals shared between and , except , we introduce three patterns (i.e., crossover, 2-overlap, and root 2-overlap) to generate all possible cases for .(1)(): and share one or multiple crossover individuals.(2)(): and are root 2-overlapping from , and the root 2-overlap path can have one or multiple 2-overlap individuals.(3)(): and are overlapping but not from , and the 2-overlap path can have one or multiple 2-overlap individuals.

Based on the three patterns, (), (), and (), we use regular expressions to generate all possible cases for the path-pair . For convenience, we drop and use , and instead of patterns (), (), and (), whenever there is no confusion. When , the eight cases shown in (7) cover all possible cases for . The completeness of eight cases shown in (7) for can be proved by induction on the total number of , , and appearing in . Using the pedigree in Figure 2, Cases 1–3 and Case 6 are illustrated in (8), (9), (10), and (11): where are 2-overlap individuals and the overlap path is a root 2-overlap path: where is a 2-overlap individual and the overlap path is a root 2-overlap path; is a crossover individual: where is a crossover individual: where is a crossover individual; is a 2-overlap individual and the overlap path is a 2-overlap path.

3.1.2. Path-Pair Level Graphical Representation of a Path-Triple

Given a path-triple , we represent each path as a node. The path-triple can be decomposed to three path-pairs (i.e., , , and ). For each path-pair, if the two paths share at least one common individual (i.e., either 2-overlap individual or crossover individual), except , then there is an edge between the two nodes representing the two paths. Therefore, we obtain four different scenarios , shown in Figure 5.

In Figure 5, the scenario has no edges, so it means that consists of three independent paths. In Figure 2, path-triple1 is an example of . Next, we introduce a lemma which can assist with identifying the options for the edges in the scenarios .

Lemma 3. Given a path-triple , consider the three path-pairs , , and , if there is a 2-overlap edge which is represented by in regular expression representation of any of the three path-pairs, and then the path-triple has no contribution to .

Proof. In [17], Nadot and Vaysseix proposed, from a genetic and biological point of view, that can be evaluated by enumerating all eligible inheritance paths at allele-level starting from a triple common ancestor to the three individuals , , and .
For the pedigree in Figure 6, let us consider the path-triple listed as follows. ; ; .
For , is a crossover individual, is an overlap individual, and is a 2-overlap edge represented by in regular expression representation (see the definition for in Section 3.1.1).
For the individual , let us denote the two alleles at one fixed autosomal locus as and . At allele-level, only one allele can be passed down from to . Since and are parents of is passed down from one parent, and is passed down from the other parent. It is infeasible to pass down both and from to . In other words, there are no corresponding inheritance paths for the path-triple with a 2-overlap edge between (i.e., Case 6: ). Therefore, such kind of path-triples has no contribution to .

(a) Pedigree

(b) Inheritance paths

Figure 6(b) shows one example of eligible inheritance paths corresponding to a pedigree graph. Each individual is represented by two allele nodes. The eligible inheritance paths in Figure 6(b) consist of red edges only.

Only Case 1, Case 2, and Case 3 do not have in the regular expression representation of a path-pair (see (7)); considering the scenarios shown in Figure 5, an edge can have three options .

3.1.3. Constructing Cases for a Path-Triple

For the scenarios in Figure 5, we define two building blocks along with some rules in Figure 7 to generate acceptable cases. For , the edge can have three options . For , we cannot allow both edges to be root overlap, because if two edges are root overlap, then and must share at least one common individual, except , which contradicts the fact that and have no edge.

Next, we focus on generating all acceptable cases for the scenarios in Figure 5, where only contains more than one building block. In order to leverage the dependency among building blocks, we decompose to , shown in Figure 8. For each , we have a set of acceptable path-triples, denoted as .

Considering the dependency among , we use the natural join operator, denoted as , operating on to generate all acceptable cases for . As a result, we obtain , where denotes the acceptable cases of the path-triple in the scenario .

For each scenario in Figure 5, we generate all acceptable cases for . The scenario has no edges, and it shows that consists of three independent paths, while, for the other scenarios (), the edges can have two options:(1)all edges belong to crossover; or(2)one edge belongs to root 2-overlap; the remaining () edges belong to crossover.

In summary, acceptable path-triples can have at most one root 2-overlap path, any number of crossover individuals, but zero 2-overlap path.

3.1.4. Splitting Operator

Considering the existence of root 2-overlap path and crossover in acceptable path-triples, we propose a splitting operator to transform a path-triple with crossover individuals to a noncrossover path-triple without changing the contribution from this path-triple to . The main purpose of using the splitting operator is to simplify the path-counting formula derivation process. We first use an example in Figure 9 to illustrate how the splitting operator works. In Figure 9, there is a crossover individual between and in the path triple in . The splitting operator proceeds as follows:(1)split the node to two nodes, and ;(2)transform the edges and to and , respectively;(3)add two new edges, and .

Lemma 4. Given a pedigree graph having crossover individuals regarding shown in Figure 9, let denote the lowest crossover individual, where no descendant of can be a crossover individual among the three paths , , and . After using the splitting operator for the lowest crossover individual in , the number of crossover individuals in is decreased by 1.

Proof. The splitting operator only affects the edges from to and . If there is a new crossover node appearing, the only possible node is either or . Assume becomes a crossover individual; it means that is able to reach and from two separate paths. It contradicts the fact that is the lowest crossover individual between and .

Next, we introduce a canonical graph which results from applying the splitting operator for all crossover individuals. The canonical graph has zero crossover individual.

Definition 5 (Canonical Graph). Given a pedigree graph having one or more crossover individuals regarding , If there exists a graph which has no crossover individuals with regards to such that(i)any acceptable path-triple in has an acceptable path-triple in which has the same contribution to as the one in for ;(ii)any acceptable path-triple in has an acceptable path-triple in which and has the same contribution to as the one in for .
We call a canonical graph of regarding .

Lemma 6. For a pedigree graph having one or more crossover individuals regarding , there exists a canonical graph for .

Proof (Sketch). The proof is by induction on the number of crossover individuals.
Induction hypothesis: assume that if has or less crossovers, there is a canonical graph for .
In the induction step, let be a graph with crossovers; let be the lowest crossover between paths and in . We apply the splitting operator on in and obtain having crossovers by Lemma 4.

3.1.5. Path-Counting Formula for

Now, we present the path-counting formula for : where , : the inbreeding coefficient of , : a triple-common ancestor of , , and , Type 1: has zero root 2-overlap, Type 2: has one root 2-overlap path ending at the individual and : the length of the path (also applicable for , , and ).

For completeness, the path-counting formula for is given in Appendix A; and the correctness proof of the path-counting formula is given in Appendix B.

3.2. Path-Counting Formulas for Four Individuals

3.2.1. Path-Pair Level Graphical Representation of

Given a path-quad and , the path-quad can have 11 scenarios shown in Figure 10 where all four paths are considered symmetrically.

In Figure 11, we introduce three building blocks . For and , the rules presented in Figure 7 are also applicable for Figure 11. For , we only consider root overlap, because the crossover individuals can be eliminated by using the splitting operator introduced in Section 3.1.4. Note that for , if , then it is equivalent to the scenario in Figure 8 Therefore, we only need to consider when .

3.2.2. Building Block-Based Cases Construction for

For a scenario in Figure 11, we first decompose to one or multiple building blocks. For a scenario , it has only one building block, and all acceptable cases can be obtained directly. For , there is no need to consider the conflict between the edges in and because and are disconnected. Let denote all acceptable cases of the path-pairs in , and let denote all acceptable cases for . Therefore, we obtain where denotes the Cartesian product operator from relational algebra.

For , we obtain . For , we define the largest subgraph of based on which we construct .

Definition 7 (Largest Subgraph). Given a scenario and , the largest subgraph of , denoted as , is defined as follows:(1) is a proper subgraph of ;(2)if contains , then must also contain ;(3)no such exists that is a proper subgraph of while is also a proper subgraph of .

For each scenario and , we list the largest subgraph of , denoted as , in Table 2.

For a scenario and , let denote the set of building blocks in but not in , where is the largest subgraph of . Let and denote the number of edges in and , respectively. According to Table 2, we can conclude that . In order to leverage the dependency among building blocks, we consider only in . For example, . Let denote all acceptable cases for . And let denote the set of acceptable cases for . Then, we can use and Diff to construct all acceptable cases for . Then, we apply this idea for constructing all acceptable cases for each in Table 2.

Given a path-quad , an acceptable case has the following properties:(1)if there is one root 3-overlap path, there can be at most one root 2-overlap path;(2)otherwise, there can be at most two root 2-overlap paths.

3.2.3. Path-Counting Formula for

Now, we present the path-counting formula for as follows: where , , : the inbreeding coefficient of , : a quad-common ancestor of , , , and , Type 1: zero root 2-overlap and zero root 3-overlap path, Type 2: one root 2-overlap path ending at and : the length of the path (also applicable for , etc.).

For completeness, the path-counting formulas for and are presented in Appendix A. The correctness of the path-counting formula for four individuals is proven in Appendix C.

3.3. Path-Counting Formulas for Two Pairs of Individuals

3.3.1. Terminology and Definitions

(1) 2-Pair-Path-Pair. It consists of two pairs of path-pairs denoted as , where , is a common ancestor of and , and is a common ancestor of and . If , then is a quad-common ancestor of , , , and .

(2) Homo-Overlap and Heter-Overlap Individual. Given two pairs of individuals , if (or , we call a homo-overlap individual when and (or and ) pass through the same parent of . If , where and , we call a heter-overlap individual when and pass through the same parent of .

(3) Root Homo-Overlap and Heter-Overlap Path. Given a 2-pair-path-pair , if is a homo-overlap individual and the homo-overlap path extends all the way to the quad-common ancestor , then we call it a root homo-overlap path. If is a heter-overlap individual and the heter-overlap path extends all the way to the quad-common ancestor , then we call it a root heter-overlap path.

Example 8. is quad-common ancestor for , , , and in Figure 12. For (a), is a homo-overlap individual between and .
is a homo-overlap individual between and . And, and are root homo-overlap paths. For (b), is a heter-overlap individual between and . is a heter-overlap individual between and . And and are root heter-overlap paths.

(a)

(b)

3.3.2. Path-Counting Formula for

Now, we present a path-pair level graphical representation for shown in Figure 13. The options for an edge can be . (Refer to Section 3.1.1 for definitions of , and ). Based on the different types of presented in (14), all cases for are summarized in Table 3, where is the last individual of a root homo-overlap path (i.e., the path ending at ) and and are the last individuals of root heter-overlap paths and , respectively.

Given a pedigree graph having one or multiple progenitors , we define that the generation of a progenitor is 0, denoted as . If an individual has only one parent , then we define . If an individual has two parents and , we define .

The path-counting formula for is as follows: where : a quad-common ancestor of , , , and , : a common ancestor of and , and : a common ancestor of and . For , there are four types (i.e., Type 1 to Type 4). Type 1: zero root homo-overlap and zero root heter-overlap. Type 2: zero root homo-overlap and one root heter-overlap ending at , Type 4: one root homo-overlap ending at and two root heter-overlap ending at and, and. For , there is one type (i.e., Type 5). Type 5: has zero overlap individual, has zero overlap individual.

At most one path-pair can have crossover individuals.

Between a path from and a path from , there are no overlap individuals, but there can be crossover individuals, , where and :

Note that if and have zero quad-common ancestors, we have the following formula for : Type 6: is a nonoverlapping path-pair and is a nonoverlapping path-pair. Between a path from and a path from , there are no overlap individuals, but there can be crossover individuals.

and are defined as in Type 5.

The correctness of the path-counting formula for is proven in Appendix C. For completeness, please refer to [18] for the path-counting formulas for , , , and .

3.4. Experimental Results

In this section, we show the efficiency of our path-counting method using NodeCodes for condensed identity coefficients by making comparisons with the performance of a recursive method used in [10]. We implemented two methods: (1) using recursive formulas to compute each required kinship coefficient and generalized kinship coefficient; (2) using path-counting method coupled with NodeCodes to compute each required kinship coefficient and generalized kinship coefficient independently. We refer to the first method as Recursive, the second method as NodeCodes. For completeness, please refer to [18] for the details of the NodeCodes-based method.

Nodecodes of a node is a set of labels each representing a path to the node from its ancestors. Given a pedigree graph, let be the progenitor (i.e., the node with 0 in-degree). (For simplicity, we assume there is one progenitor, , as the ancestor of all individuals in the pedigree. Otherwise, a virtual node can be added to the pedigree graph and all progenitors can be made children of .) For each node in the graph, the set of NodeCodes of , denoted as , are assigned using a breadth-first-search traversal starting from as follows.(1)If is then contains only one element: the empty string.(2)Otherwise, let be a node with , and be ’s children in sibling order; then for each in (), a code is added to (), where , and indicates the gender of the individual represented by node .

Computations of kinship coefficients for two individuals and generalized kinship coefficients for three individuals presented in [11, 12, 14, 15] are using NodeCodes. The NodeCodes-based computation schemes can also be applied for the generalized kinship coefficients for four individuals and two pairs of individuals. For completeness, please refer to [18] for the details using NodeCodes to compute the generalized kinship coefficients for four individuals and two pairs of individuals based on our proposed path-counting formulas in Sections 3.2 and 3.3.

In order to test the scalability of our approach for calculating condensed identity coefficients on large pedigrees, we used a population simulator implemented in [11] to generate arbitrarily large pedigrees. The population simulator is based on the algorithm for generating populations with overlapping generations in Chapter 4 of [19] along with the parameters given in Appendix of [20] to model the relatively isolated Finnish Kainuu subpopulation and its growth during the years 1500–2000. An overview of the generation algorithm was presented in [11, 12, 14]. The parameters include starting/ending year, initial population size, initial age distribution, marriage probability, maximum age at pregnancy, expected number of children by time period, immigration rate, and probability of death by time period and age group.

We examine the performance of condensed identity coefficients using twelve synthetic pedigrees which range from 75 individuals to 195,197 individuals. The smallest pedigree spans 3 generations, and the largest pedigree spans 19 generations. We analyzed the effects of pedigree size and the depth of individuals in the pedigree (the longest path between the individual and a progenitor) on the computation efficiency improvement.

In the first experiment, 300 random pairs were selected from each of our 12 synthetic pedigrees. Figure 14 shows computation efficiency improvement for each pedigree. As can be seen, the improvement of NodeCodes over Recursive grew increasingly larger as the pedigree size increased, from a comparable amount of 26.83% on the smallest pedigree to 94.75% on the largest pedigree. It also shows that path-counting method coupled with NodeCodes can scale very well on large pedigrees in terms of computing condensed identity coefficients.

In our next experiment, we examined the effect of the depth of the individual in the pedigree on the query time. For each depth, we generated 300 random pairs from the largest synthetic pedigree.

Figure 15 shows the effect of depth on the computation efficiency improvement. We can see the improvement of NodeCodes over Recursive, ranging from 86.48% to 91.30%.

4. Conclusion

We have introduced a framework for generalizing Wright’s path-counting formula for more than two individuals. Aiming at efficiently computing condensed identity coefficients, we proposed path-counting formulas (PCF) for all generalized kinship coefficients for which are sufficient for expressing condensed identity coefficients by a linear combination. We also perform experiments to compare the efficiency of our method with the recursive method for computing condensed identity coefficients on large pedigrees. Our future work includes (i) further improvements on condensed identify coefficients computation by collectively calculating the set of generalized kinship coefficients to avoid redundant computations, and (ii) experimental results for using PCF in conjunction with encoding schemes (e.g., compact path-encoding schemes [13]) for computing condensed identity coefficients on very large pedigrees.

Appendices

A. Path-Counting Formulas of Special Cases

A.1. Path-Counting Formula for

For , we introduce a special case, where and are mergeable.

Definition A.1 (Mergeable Path-Pair). A path-pair is mergeable if and only if the two paths and are completely identical.

Next, we present a graphical representation of in Figure 16.

Lemma A.2. For and in Figure 16, cannot be a mergeable path-pair.

Proof. For and , if is mergeable, then any common individual between and is also a shared individual between and . It means which contradicts the fact that .
Considering all three scenarios in Figure 16, only can have a mergeable path-pair by Lemma A.2. Now, we present our path-counting formula for where is not an ancestor of : where : a common ancestor of and .
When is not mergeable, Type 1: has no root 2-overlap. Type 2: has one root 2-overlap path ending at the individual .
When is mergeable,
Type 3: is a nonoverlapping path-pair For the sake of completeness, if is an ancestor of , there is no recursive formula for in [10], but we can use either the recursive formula for or the path-counting formula for to compute .

A.2. Path-Counting Formula for

Given a path-quad , if is not mergeable, then we process the path-quad as equivalent to . If is mergeable, the path-quad can be condensed to scenarios for .

Now, we present a path-counting formula for where is not an ancestor of and as follows: where : a quad-common ancestor of , , , and .

When is not mergeable, Type 1: zero root 2-overlap and zero root 3-overlap path; Type 2: one root 2-overlap path ending at When is mergeable, Type 4: has zero root 2-overlap path; Type 5: has one root 2-overlap path ending at

Note that if is an ancestor of either or , or both of them, then the path-counting formula of is applicable to compute .

A.3. Path-Counting Formula for

A special case of for is introduced when is mergeable. With the existence of a mergeable path-triple, can be condensed to .

Definition A.3 (Mergeable Path-Triple). Given three paths , , and , they are mergeable if and only if they are completely identical.

Lemma A.4. Given a path-quad , there must be at least one mergeable path-pair among , , .

Proof. For an individual with two parents and , the paternal allele of the individual is transmitted from and the maternal allele is transmitted from . At allele level, only two descent paths starting from an ancestor are allowed. For a path-quad , there must be at least one mergeable path-pair among , , and .

For simplicity, we treat as a default mergeable path-pair.

Now, we present the path-counting formula for where is not an ancestor of as follows: where : a common ancestor of and .

When there is only one mergeable path-pair (let us consider as the mergeable path-pair), Type 1: has zero root 2-overlap path, Type 2: has one root 2-overlap path ending at .

When is mergeable, Type 3: is nonoverlapping

Note that if is an ancestor of , we treat . Then, we apply the path-counting formula for to compute .

B. Proof for Path-Counting Formulas of Three Individuals

We first demonstrate that, for one triple-common ancestor , the path-counting computation of is equivalent to the computation using recursive formulas. Then, we prove the correctness of the path-counting computation for multiple triple-common ancestors.

B.1. One Triple-Common Ancestor

Considering the different types of path-triples starting from a triple-common ancestor in a pedigree graph contributing to and , can have 5 different cases:

Based on the 5 cases from Case 2.1 to Case 3.2, we first construct a dependency graph shown in Figure 17, consistent with the recursive formulas (3), (4), and (5) for the generalized kinship coefficients for three individuals.

Then, we take the following steps to prove the correctness of the path-counting formulas (12) and (A.1):(i)for , the correctness of the path-counting formula (i.e., Wright’s formula) is proven in [21]. For Case 2.1 and Case 2.2, the correctness is proven based on the correctness of Cases 3.1 and 3.2;(ii)for Case 2.3, it has no cycle but only depends on . Thus, we prove the correctness of Case 2.3 by transforming the case to ;(iii)for Cases 3.1 and 3.2, the correctness is proven by induction on the number of edges, , in the pedigree graph .

B.1.1. Correctness Proof for Case 3.1

Case 3.1. For , does not have any path triples with root overlap.

Proof (Basis). There are two basic scenarios: (i) one individual is a parent of another; (ii) no individual is a parent of another, among , , and .
Using the recursive formula (3) to compute , for Figure 18(a), ; for Figure 18(b), .
Using the path-counting formula (12), if a path-triple has no root overlap (i.e., Type 1), then the contribution of to can be computed as follows: , where .
For Figure 18(a), is the only triple-common ancestor and we obtain ; for Figure 18(b), we obtain .
Induction Step. Let denote the number of edges in . Assume true for , where . Then, we show it is true for .
For Figures 19(a) and 19(b), among , and , let be the individual having the longest path starting from their triple-common ancestor in the pedigree graph with () edges. If we remove the node and cut the edge from , then the new graph has edges. In terms of computing , satisfies the condition for induction hypothesis.
For Figure 19(a), . Based on the recursive formula (3), where and are parents of . In , only has one parent ; thus, it indicates . Then, we can plug-in the path-counting formula for to obtain Similarly, for Figure 19(b), we obtain .
Thus, it is true for .

(a)

(b)

(a)

(b)

B.1.2. Correctness Proof for Case 3.2

Case 3.2. For , has path triples with root overlap.

Proof (Basis). There are three basic scenarios: (i) there are two individuals who are parents of another; (ii) there is only one individual who is parent of another; (iii) there is no individual who is a parent of another, among , , and .
Using the recursive formula (3) to compute : in Figure 20, for Figure 20(a), ; for Figure 20(b),; for Figure 20(c), .
Using the path-counting formula (12), if a path-triple has root overlap (i.e., Type 2), then the contribution of to can be computed as follows:, where and is the last individual of the root overlap path .
For Figure 20(a), is the only triple-common ancestor and we obtain . Similarly, for Figures 20(b) and 20(c), we obtain and , respectively.
Induction Step. Let denote the number of edges in . Assume true for , where . Show that it is true for .
For Figures 21(a), 21(b), and 21(c), among , and , let be the individual who has the longest path and let be a parent of . Then, we cut the edge from and obtain a new graph which satisfies the condition of induction hypothesis. For Figure 21(a), we use the path-counting formula for in .
In is the only parent of , according to the recursive formula (3), we have . Then, we can plug-in the and obtain For Figures 21(b) and 21(c), we take the same steps as we calculate for Figure 21(a).
In summary, it is true for .

(a)

(b)

(c)

(a)

(b)

(c)

B.1.3. Correctness Proof for Case 2.3

Case 2.3. For , the path-triples in the pedigree graph have mergeable path-pair.

Proof. Considering the relationship between and , has two scenarios: (i) is not an ancestor of ; (ii) is an ancestor of . Using the path-counting formula (A.1), if a path-triple Type 3, which means that it has a mergeable path-pair, then the contribution of to can be computed as follows: , where .
Using the recursive formula (4), we obtain .
For Figure 22(a), is a common ancestor of and .
For , we use Wright’s formula and obtain where denotes all nonoverlapping path-pairs .
Then, we have .
For Figure 22(b), we can also transform the computation of to .
In summary, it shows that the path-counting formula (A.1) is true for Case 2.3.

(a)

(b)

B.1.4. Correctness Proof for Cases 2.1 and 2.2

For , when there is no path-triple having mergeable path-pair, (i.e., the path-triple belongs to either Case 2.1 or Case 2.3), can be transformed to , which is equivalent to the computation of for Cases 3.1 and 3.2. The correctness of our path-counting formula for Cases 3.1 and 3.2 is proven. Thus, we obtain the correctness for when the path-triple belongs to either Case 2.1 or Case 2.2.

B.2. Multiple Triple-Common Ancestors

Now, we provide the correctness proof for multiple triple-common ancestors regarding the path-counting formulas (12) and (A.1).

Lemma A. Given a pedigree graph and three individuals , , having at least one trip-common ancestor, is correctly computed using the path counting formulas (12) and (A.1).

Proof . Proof by induction on the number of triple-common ancestors
Basis. has only one triple-common ancestor of , , and .
The correctness of (12) and (A.1) for with only one triple-common ancestor of , , and is proven in the previous section.
Induction Hypothesis. Assume that if has or less triple-common ancestors of , , and , (12) and (A.1) are correct for .
Induction Step. Now, we show that it is true for with triple-common ancestors of , , and .
Let denote all triple-common ancestors of , and in , where . Let be the most top triple-common ancestor such that there is no individual among the remaining ancestors who is an ancestor of . Let denote the contribution from to .
Because is the most top triple-common ancestor, there is no path-triple from to , , and which passes through . Then, we can remove from and delete all out-going edges from and obtain a new graph which has triple-common ancestors of , , and . It means .
For the new graph , we can apply our induction hypothesis and obtain .
For the most top triple-common ancestor , there are two different cases considering its relationship with the other triple-common ancestors:(1)there is no individual among who is a descendant of ;(2)there is at least one individual among who is a descendant of .For (1), since no individual among is a descendant of , the set of path-triples from to , and is independent of the set of path-triples from to , , and . It also means that the contribution from to is independent of the contribution from the other triple-common ancestors.
Summing up all contributions, we can obtain .
For (2), let be one descendant of . Now both and can reach , , and .
, a path-triple from to , , and .
If , , and all pass through , then the path-triple is not an eligible path-triple for . When we compute the contribution from to , we exclude all such path-triples where , , and all pass through a lower triple-common ancestor. In other words, an eligible path-triple from regarding cannot have three paths all passing through a lower triple-common ancestor. Therefore, we know that that the contribution from to is independent of the contribution from the other triple-common ancestors. Summing up all contributions, we obtain .

C. Proof for Four Individuals and Two Pairs of Individuals

Here, we give a proof sketch for the correctness of path counting formulas for four individuals. First of all, for four individuals in a pedigree graph , we present all different cases based on which we construct a dependency graph. The correctness of the path-counting formulas for two-pair individuals can be proved similarly.

C.1. Proof for Four Individuals

Consider the existence of different types of path-quads regarding , , and ; there are 15 cases for a pedigree graph :

Then, we construct a dependency graph shown in Figure 23 for all cases for four individuals.

According to the dependency graph in Figure 23, the intermediate steps including Cases 3.4 and 3.5 are already proved for the computation of . The correctness of the transformation from Case 4.2 to Case 3.4 can be proved based on the recursive formula for and . Similarly, we can obtain the transformation from Case 4.3.1 to Case 3.5.

C.2. Proof for Two Pairs of Individuals

Consider the existence of different types of 2-pair-path-pair regarding ; there are 9 cases which are listed as follows.

Case 4.1. has with zero root homo-overlap and zero root heter-overlap.

Case 4.2. has with zero root homo-overlap and one root heter-overlap.

Case 4.3.1. has with zero root homo-overlap and two root heter-overlap.

Case 4.3.2. has with one root homo-overlap and two root heter-overlap.

Case 4.4. has with one root homo-overlap and zero root heter-overlap.

Case 4.5. has with two root homo-overlap and zero root heter-overlap.

Case 4.6. has path-triples with zero root overlap.

Case 4.7. has path-triples with one root overlap.

Case 4.8. has path-pairs with zero root overlap.

Then, we construct a dependency graph for the cases relating to in Figure 24.

According to the dependency graph in Figure 24, Cases 4.6, 4.7, and 4.8 are the intermediate steps which already are proved for the computation of . The correctness of the transformation from Case 4.2 to Case 4.6 can be proved based on the recursive formula for and . Similarly, we can obtain the transformation from Cases 4.3.1 and 4.3.2 to Case 4.7 as well as from Case 4.4 to Case 4.8 accordingly.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors thank Professor Robert C. Elston, Case School of Medicine, for introducing to them the identity coefficients and referring them to the related literature [7, 10, 17]. This work is partially supported by the National Science Foundation Grants DBI 0743705, DBI 0849956, and CRI 0551603 and by the National Institute of Health Grant GM088823.

References

Surgeon General’s New Family Health History Tool Is Released, Ready for “21st Century Medicine”, http://compmed.com/category/people-helping-people/page/7/.
M. Falchi, P. Forabosco, E. Mocci et al., “A genomewide search using an original pairwise sampling approach for large genealogies identifies a new locus for total and low-density lipoprotein cholesterol in two genetically differentiated isolates of Sardinia,” The American Journal of Human Genetics, vol. 75, no. 6, pp. 1015–1031, 2004.
View at: Publisher Site | Google Scholar
M. Ciullo, C. Bellenguez, V. Colonna et al., “New susceptibility locus for hypertension on chromosome 8q by efficient pedigree-breaking in an Italian isolate,” Human Molecular Genetics, vol. 15, no. 10, pp. 1735–1743, 2006.
View at: Publisher Site | Google Scholar
Glossary of Genetic Terms, National Human Genome Research Institute, http://www.genome.gov/glossary/?id=148.
C. W. Cotterman, A calculus for statistico-genetics [Ph.D. thesis], Columbus, Ohio, USA, Ohio State University, 1940, Reprinted in P. Ballonoff, Ed., Genetics and Social Structure, Dowden, Hutchinson & Ross, Stroudsburg, Pa, USA, 1974.
G. Malecot, Les mathématique de l'hérédité, Masson, Paris, France, 1948, Translated edition: The Mathematics of Heredity, Freeman, San Francisco, Calif, USA, 1969.
M. Gillois, “La relation d'identité en génétique,” Annales de l'Institut Henri Poincaré B, vol. 2, pp. 1–94, 1964.
View at: Google Scholar
D. L. Harris, “Genotypic covariances between inbred relatives,” Genetics, vol. 50, pp. 1319–1348, 1964.
View at: Google Scholar
A. Jacquard, “Logique du calcul des coefficients d’identite entre deux individuals,” Population, vol. 21, pp. 751–776, 1966.
View at: Google Scholar
G. Karigl, “A recursive algorithm for the calculation of identity coefficients,” Annals of Human Genetics, vol. 45, no. 3, pp. 299–305, 1981.
View at: Google Scholar
B. Elliott, S. F. Akgul, S. Mayes, and Z. M. Ozsoyoglu, “Efficient evaluation of inbreeding queries on pedigree data,” in Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM '07), July 2007.
View at: Publisher Site | Google Scholar
B. Elliott, E. Cheng, S. Mayes, and Z. M. Ozsoyoglu, “Efficiently calculating inbreeding on large pedigrees databases,” Information Systems, vol. 34, no. 6, pp. 469–492, 2009.
View at: Publisher Site | Google Scholar
L. Yang, E. Cheng, and Z. M. Özsoyoǧlu, “Using compact encodings for path-based computations on pedigree graphs,” in Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (ACM-BCB '11), pp. 235–244, August 2011.
View at: Publisher Site | Google Scholar
E. Cheng, B. Elliott, and Z. M. Ozsoyoglu, “Scalable computation of kinship and identity coefficients on large pedigrees,” in Proceedings of the 7th Annual International Conference on Computational Systems Bioinformatics (CSB '08), pp. 27–36, 2008.
View at: Google Scholar
E. Cheng, B. Elliott, and Z. M. Özsoyoĝlu, “Efficient computation of kinship and identity coefficients on large pedigrees,” Journal of Bioinformatics and Computational Biology (JBCB), vol. 7, no. 3, pp. 429–453, 2009.
View at: Publisher Site | Google Scholar
S. Wright, “Coefficients of inbreeding and relationship,” The American Naturalist, vol. 56, no. 645, 1922.
View at: Google Scholar
R. Nadot and G. Vaysseix, “Kinship and identity algorithm of coefficients of identity,” Biometrics, vol. 29, no. 2, pp. 347–359, 1973.
View at: Google Scholar
E. Cheng, Scalable path-based computations on pedigree data [Ph.D. thesis], Case Western Reserve University, Cleveland, Ohio, USA, 2012.
V. Ollikainen, Simulation Techniques for Disease Gene Localization in Isolated Populations [Ph.D. thesis], University of Helsinki, Helsinki, Finland, 2002.
H. T. T. Toivonen, P. Onkamo, K. Vasko et al., “Data mining applied to linkage diseqilibrium mapping,” The American Journal of Human Genetics, vol. 67, no. 1, pp. 133–145, 2000.
View at: Publisher Site | Google Scholar
W. Boucher, “Calculation of the inbreeding coefficient,” Journal of Mathematical Biology, vol. 26, no. 1, pp. 57–64, 1988.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2014 En Cheng and Z. Meral Ozsoyoglu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

2506

Downloads

1287

Citations