Abstract
Ancestral sequence reconstruction is a wellknown problem in molecular evolution. The problem presented in this study is inspired by sequence reconstruction, but instead of leafassociated sequences we consider only their lengths. We call this problem ancestral gene length reconstruction. It is a problem of finding an optimal labeling which minimizes the total length’s sum of the edges, where both a tree and nonnegative integers associated with corresponding leaves of the tree are the input. In this paper we give a linear algorithm to solve the problem on binary trees for the Manhattan cost function .
1. Introduction
Ancestral sequence reconstruction (ASR) is a wellrecognized problem in molecular evolution [1]. Let be a (phylogenetic) tree with leaf nodes, and strings over one alphabet (gene sequences) assigned to leaves . ASR may be defined in the following way: assignment of strings to inner nodes “in the best possible way.” There are two main paradigms for ASR: maximum parsimony (MP) and probabilisticbased reconstruction. The latter includes maximum likelihood (ML) and Bayesian reconstructions. MP reconstruction has a time complexity linear in the number of sequences analyzed. The problem of the parsimonious reconstruction of ancestral states for the given tree with the given states of its leaves (the most parsimonious assignment of the labels of internal nodes for a fixed tree topology) is a wellstudied problem [2–4]. Efficient algorithms have also been developed for different types of MLbased reconstructions (reviewed in [5]). ASR methods require as input both a phylogenetic tree and a set of gene sequences associated with corresponding leaves of the tree [6].
ASR is related to gene sequence evolution while the problem presented in this paper, being inspired by ASR, deals with gene length variation. Instead of considering leafassociated sequences we take into account only their lengths. Instead of the reconstruction of ancestral sequences, we search for the optimal reconstruction of ancestral gene lengths. The problem may be called ancestral gene length reconstruction (AGLR). AGLR is actually a problem of finding an optimal labeling which minimizes the total “length” sum of the edges, the minimum sum problem where both a tree and nonnegative integers associated with corresponding leaves of the tree are the input.
In the graph theory vertex labeling related problems were intensively studied [8]. Typically, the problems can be described as follows: for a given graph, find the optimal way of labeling the vertices with distinct integers. The problems and their solutions were described in [9–12]. In [13] we presented the algorithms to solve the minimum sum problem where both a tree and positive integers associated with all leaves of the tree are the input (finding the optimal way of labeling the vertices with positive integers). Here we would like to formulate the minimum sum problem where both a tree and positive integers associated with some of the leaves of the tree are the input (finding the optimal way of labeling the vertices with nonnegative integers). This problem reflects a situation in which the genome tree is constructed by one or another method for a set of genomes, the leaves of the tree are linked with the corresponding genomes of the set, and the leaves are labeled by integers designating lengths of genes of a chosen gene family. Some leaves would be labeled zero because corresponding genomes have no genes of the chosen gene family. Alternatively, it may be a case of a missing value but in this study we do not consider this case: in the problem definition that we bring here zero means “no value.”
In this paper we provide a linear algorithm to solve max sum problem on binary trees for the Manhattan cost function . The algorithm uses dynamic programming technique and the properties of the Manhattan distance.
2. Preliminaries
Let be a tree with leaf nodes, vertex set , and edge set . . Let us number the leaf nodes of . Let us number the root of . An integer labeling of is a mapping from to a set of nonnegative integers, where label is an outofthe ordinary label meaning “absent value.” Let us denote integer labeling of the leaf nodes of . Let us denote by and minimum and maximum positive integers labeling leaf nodes: ; ; .
Let us introduce a cost function of the edge : where the nonnegative cost function has the following distance properties:(i)(ii)(iii)(iv). is a gain penalty, is a loss penalty, and is a length change penalty function. Since the likelihoods of loss and gain events are likely to differ, we may need to weight them differently. This is achieved by introducing different penalties ; the loss penalty is normally assigned a value close to , whereas the gain penalty should be larger due to biological considerations. They suggest that, on average, gene loss might be a more likely event than gene gain. Therefore, different gain penalties were used in our study similarly to as it was done in [14].
An example of a function is . In case of we obtain an absolute value of the difference between labelings and : . In case of we obtain a square of the difference between labelings and : .
2.1. An Arbitrary Tree and an Arbitrary Cost Function
Given a tree , an integer labeling of the leaves of , the gain penalty , the loss penalty , and a cost function ((1)–(5)), the minimum sum problem is to find a labeling which minimizes the total cost:
2.2. A Binary Tree Problem
Given a binary tree , an integer labeling of the leaves of , the “gain” penalty , and the “loss” penalty , the Manhattan minimum sum problem is to find the labelings which minimize the sum over all where is a number of edges of type , and is a number of edges of type .
3. Problem Solutions
3.1. DP Algorithm (for the Problem (1))
Due to the properties ((2)–(5)) of the cost function all labels of the optimal labeling must be either equal to 0 or in the interval . As a consequence of this, the dynamic programming (DP) method is applicable for the problem. It will be easier to explain the DP method on a binary tree using notation. The quantity will be interpreted as the minimal cost, given that node is assigned integer , to the subtree with the node as a root of the subtree.
3.1.1. DP Algorithm for a Binary Tree
Up Phase. A procedure called DP_up calculates the costs of all nodes of the tree , given a cost function φ.
When we compute for the root node (the index of the root is ), then we simply choose the minimum of these values:
Initiation. Given labeling of the leaf nodes of at the tips of the tree the are easy to compute. The cost is 0 if the observed integer is integer , and infinite otherwise.
Iteration. For the immediate common ancestor of the nodes and , node , we have
The interpretation of this equation is immediate. The smallest possible cost given that node is assigned zero is either the cost or the “gain” penalty plus the minimum of , the least of the two plus the minima of corresponding values associated with the right descendant tree. The smallest possible cost given that node is assigned is a sum of two values: the first one is either the cost of the edge from node to node , plus the cost of the left descendant subtree given that node is in state , or the “loss” penalty plus ; the second one is the cost of the edge from the node to the node , plus the cost of the right descendant subtree given that node is in state . We select those values of and which minimize that sum. Equation (10) is applied successively to each inner node in the tree, doing a postorder tree traversal. Finally it computes all the , and then (8) is used to find the minimum cost for the whole tree. The complexity of the Up_phase of the algorithm is .
Traceback. The procedure calculates the labels of all nodes of the tree .
Choose any integer which provides the minimum of the —it is the root label. It may be either zero or a positive . Doing a preorder tree traversal, successively label each inner node in the tree: for any inner node , and given that a parent label was reconstructed, the label is easily reconstructed as well.
3.1.2. DP Algorithm for an Arbitrary Tree
UpPhase. A procedure DP_up calculates the costs of all nodes of the tree.
Suppose that the descendant nodes of the node are called . The following equation will therefore be similar to (10) replacing the sum of and by the total sum of , while traverses all values of :
This equation is applied successively to each node in the tree, doing a postorder tree traversal. Finally it computes all the , and then (8) is used to find the minimum cost for the whole tree.
Down Phase. As Traceback above: Consider the following.
3.2. DP Algorithm for a Manhattan Sum for a Binary Tree (Problem (2))
Manhattan distance is an absolute value of the difference between labelings and . This distance measure has the following property: if siblings have positive labels, then all integers that lie between these values may equally serve as optimal labels of a parent.(i)If , then for all the score .(ii)If , then for all the score .(iii)If , then for all the score .
So, as it would be proven below, at the bottomup stage of the DP algorithm it would be sufficient to assign to each node in the tree four values: left, right, , and . The meanings of the values are as follows: left and right are bounds of an interval associated with the node , is a cost value , and is a cost for any integer from the interval: left ≤ ≤ right.
Initiation. Given labeling of the leaf nodes of these four values are easy to compute for the leaf nodes: for if () ; left; right; else ; left; right.
3.2.1. Examples
Let us consider the simplest trees with two, three, and four labeled leaves. The simplest tree configuration is presented in Figure 1. There is only one node to label—the root node.(i)Figure 1(a): no genes are assigned to the leaves → no gene is assigned to the root.(ii)Figure 1(b): the left leaf has no gene, and the right leaf has a gene with the length equal to 136 → the root is labeled by 136; the score is equal to the loss penalty .(iii)Figure 1(c): any label is good to label the root; the score is equal to .
(a)
(b)
(c)
The next simplest tree topology—threeleaf trees—is presented in Figure 2. There are two nodes to label, the inner node and the root.(i)Figure 2(a): the inner node is labeled analogically to the root in Figure 1(c): any is equally good to label the inner node; the root node is labeled analogically to the root in Figure 1(b): → the root is labeled by any , that is, by 125.(ii)Figure 2(b): labeling is similar to that of Figure 1(a).(iii)Figure 2(c): the inner node is labeled analogically to Figure 2(a): any label is good to label it; the score is equal to . The root should be labeled by 136 because .
(a)
(b)
(c)
Determination of the optimal labeling of the fourleaf trees is very similar to the examples described above. Figure 3 illustrates labeling of the tree where all four leaves have nonzero labels: ((125, 141), (136, 150)). Labeling of the inner nodes is as above (Figure 2(c)): and . All integers of the intersection between these two close intervals are optimal values to label the root: . In Figure 3 we present the value 136 as a chosen suitable label.
Examples of the trees with very distinct subtrees are presented in Figures 4 and 5. In Figure 4 we present a tree obtained by merging two very different subtrees. The left 4leaf subtree has very obvious intuitive labeling of internal nodes: all nodes should be labeled by zero. The right subtree is identical to the tree presented in Figure 2(c). Merging of these two subtrees produces bottomup stage values (left, right, , and ) to the new root equal to . In spite of assignins the interval to the root only the value 136 provides the optimal solution. (We would like to express our gratitude to the anonymous reviewer for bringing our attention to this situation.) We formulate this rule below describing traceback stage of the algorithm. Figure 4 is chosen to illustrate labeling of nodes similar to the root of the tree.
After considering these few simple examples, we describe the algorithm.
3.2.2. BottomUp Stage
Initiation. Given labeling of the leaf nodes of at the tips of the tree the are easy to compute. The cost is 0 if the observed integer is integer , and
Iteration. Doing a postorder tree traversal assign successively to each node in the tree the abovementioned four values left, right, , and . An interval [left, right] is assigned according to the following rule: if anyone of two children intervals is not defined, then assign the interval of the other child; otherwise, a parent interval is either an intersection of the intervals of its children or an interval that lies between these intervals if their intersection is empty. is a cost value , where for the Manhattan distance we can rewrite (10) as
3.2.3. Pseudocode
For more details see Pseudocode 1.

3.2.4. Traceback Stage
Interval Correction Rule. Following the bottomup stage four values left, right, , and are assigned to every internal node of the tree. An interval (left, right) should be diminished if one of the edges connecting the node with its son becomes of type , . Let us denote sons of the node by and . Correction condition would be formulated as If is TRUE, then the bounds of the corrected interval would be obtained by intersection of the interval associated with the node with the corrected interval associated with the corresponding son: Otherwise, the bounds of the corrected interval would not be changed from the original ones:
Initiation. Labeling of the root: if , then correct the root interval according to (15)–(17), and then choose an integer from the corrected interval assigned to the root node otherwise choose 0—it is the root label .
Iteration. Doing a preorder tree traversal, successively label each node in the tree either by an integer from the corrected interval assigned to this node which is the nearest to its parent label (it may be either the value equal to the parent label or the boundary value of the interval assigned with the node) or by 0.
The proof of the correctness of a simpler algorithm (without zerolabeled leaves) is published in [15]. In the Appendix there are several lemmas, from which the correctness of the algorithm presented here results.
3.2.5. Example
In Figure 5 of [7] the consensus trees obtained from 100 genome trees were presented. The trees were produced on the basis of 80% randomly chosen COGs, and the right tree was produced on the basis of 15%jackknifing (the explanations in the text of [7]). This tree possesses phylogenetic reasonableness.(a)The representatives of both prokaryotic Kingdoms: Eubacteria and Archaea are clustered separately. In other words, Archaeal organisms (genomes 0, 1, 8, 29–32, and 35–50) form a monophyletic group.(b)Euryarchaeota and Crenarchaeota form monophyletic groups.
A part of this tree was selected to illustrate the algorithm. We took the upper part of the tree related exclusively to Archaea (see A/B marked arrow in Figure from [7]) and placed the root at the point dividing all Archaeal genomes into Euryarchaeota and Crenarchaeota (see E/C marked arrow in Figure from [7]). Thus, Figure 5 is a part of Figure from [7] labeled according to COG0835. This COG was randomly selected as suitable for purposes of illustration. Table 1 presents a list of Archaeal genomes from the whole set of genomes that were used for a genome tree construction (Figure from [7]). Table 2 presents the lengths of the Archaeal proteins of this COG.
To assign labels to the leaves of the tree of Figure 5 two preprocessing steps were done: (1) taking off outliers, the lengths 328 of the . marismortui protein and 344 of the M. hungatei protein are obvious outliers; (2) taking the median value of paralog’s lengths of the genomes 30, 31, 46, 48, and 50. Figure 5 presents results of application of the bottomup and traceback stages of the algorithm to this tree: a quartet that was assigned to a node at the bottom upstage is shown under the edge linking the node and its parent node, a label that was assigned to the node at the traceback stage, is shown over the same edge.
As we can see the root is labeled by zero. There are two genebirth events and one genedeath event. One gene was born with the length of 155 and another gene birth is labeled by 146. Genome number 32 (Haloquadratum walsbyi) has no protein from COG0835, while other Haloarchaea (genomes 29–31) do have. Thus, the edge connecting with leaf labeled by 32 is marked with a geneloss symbol.
4. Discussion
In [15] the algorithms to find the optimal labeling of the vertices of the tree under Wagner parsimony were presented. A simple extension of the problem could be finding the optimal labeling of the vertices of the tree with nonnegative integers. This more realistic approach requests special consideration of zero labeling. Wedges of type , , should be scored differently from wedges of type , , because the notes gene loss, while notes gene gain. These events should be scored differently. Interestingly, this differentiated scoring in addition to tree labeling resulted in reconstruction of “parsimonious” evolutionary scenario. Reconstruction of a gene evolution along a species tree is an interesting and principal problem. Lyubetsky and his coworkers contributed a lot to formulation and solving this problem. In their studies [16–22] the authors tackled mainly two important and sophisticated phylogenetic problems. The obtained results are partially reviewed in the first section of [22] which also provides an extended biological background and relevant references. Reconstruction of a gene evolution along a species tree (to build the evolutionary scenario), following the approach of Lyubetsky et al., is to find an optimal mapping of a gene tree into a species tree. (An example of a different approach was presented in [14].) The second problem is to construct a supertree from the given set of gene trees.
As it was mentioned in [22], the first problem, stated as a treeintotree mapping, is solved in polynomial (often linear, and at maximum cubic) time even for the case of time slices and horizontal gene transfers. The algorithms presented in our study are polynomial as well.
Choosing (a gain penalty), (a loss penalty), and (a label change penalty) is crucial for reconstruction of trustworthy evolutionary scenario. However, it is very difficult task and we cannot claim categorically that choosing “correct” parameters of the model will result in truly reliable reconstruction. We do plan to make a comparison between results obtained by abovementioned methods of Lyubetsky and ours (work in progress).
To prepare input for the algorithm, as it was done above for 3.2.5, the original data is to be transformed to the following format: to each (genome, COG) pair one standardized protein length should be assigned (as we described in [7]). For a given COG, each organism is represented by a calculated length—a median length of all paralogous proteins. A natural extension would be to formulate the labeling problem taking into account existence of paralogs.
We may define a tuple integer labeling of as a mapping from to a set of tuples composed of integers , where for all . The simplest extension would be to introduce the case with identical sizes of tuples composed of nonnegative integers. A uniform tuple integer labeling of is characterized by a constant for all . The stretch of the edge in a is a simple sum is defined as in (1). Given a uniform tuple integer labeling of the leaves of G the minimum sum problem is to find a labeling which minimizes the total sum of the stretches of the edges. Some . The minimum sum problem is that of minimizing over all for given . By some modifications of the algorithms presented in this paper the minimizing tuple labeling can be found. This model again is a gainloss model. More sophisticated extension must provide more realistic definition of distance between two tuples composed of positive integers by introducing duplication events.
Appendix
Lemma A.1 (root optimal label). Suppose that the root node is called and suppose that its children are called and . The claim is(1)if (), then else(2)if (), then else(3)if (), then else(4)if (), then else(5).
Proof. If we consider a subtree with the node as a root then designates the minimal cost, given that node has a label : Proof of the subclaims (1)–(3) is trivial. Case (4) is Let us introduce a new numbering by changing only the root label: . It is easy to see that . It means that for optimal integer labeling the following is correct: . Likewise, we prove that for optimal integer labeling . Let us denote :
Lemma A.2 (leaf parent optimal interval). Every node of the optimal integer labeling that all its children are leaf nodes has either a zero label or a label between labels of its children.
Suppose that the root node is called a and suppose that its children are called and . The claim is(1)if , then else(2)if , then else(3)if , then else(4)if , then else(5).
Proof. Figure 1 illustrates this lemma. Proof of the subclaims (1)–(3) is trivial. In case of condition (4) let us prove that for optimal integer labeling . Suppose . Let us denote ; . Let us introduce a new numbering by changing only the label of the node : . It is to show that . Indeed,
Likewise, we prove that for optimal integer labeling . Q.e.d. .
Lemma A.3 (parent optimal interval—(I)). An optimal label of a parent either is equal to zero or lies between extreme values of optimal labels of its children. If an optimal integer labeling provides the labels of two siblings and satisfying the conditions & , then the label of their parent satisfies . Proof is as for Lemma A.1.
Lemma A.4 (parent optimal interval—(II)). An optimal label of a parent in case of the empty intersection of the optimal intervals of its children lies between these intervals.
If an optimal integer labeling provides the labels of two siblings and satisfying the conditions & then if then else if () then .
Proof. (1) . Let us assume ; . Then we introduce a new labeling by changing labels for three nodes: ; ; :
From assumption follows that is not an optimal labeling. Similarly, we can prove that from assumption follows that is not an optimal labeling.
(2) . Similarly to (1) let us assume ; . Then we introduce a new labeling by changing labels for three nodes: ; ; . , so we have a contradiction with the statement that is an optimal labeling.
Lemma A.5 (parent optimal interval—(III)). An optimal interval of a parent is either an intersection of the optimal intervals of its children or an interval that lies between these intervals in case that their intersection is empty.
If an optimal integer labeling provides the labels of two siblings and satisfying the conditions & , then the label of their parent satisfies the following condition: if then else if then else .For example, if , then Lemma A.5 states that satisfies the following condition . Proof is similar to proof in Lemma A.4.