Abstract

Several polynomial time computable metrics on the class of semibinary tree-sibling time consistent phylogenetic networks are available in the literature; in particular, the problem of deciding if two networks of this kind are isomorphic is in P. In this paper, we show that if we remove the semibinarity condition, then the problem becomes much harder. More precisely, we prove that the isomorphism problem for generic tree-sibling time consistent phylogenetic networks is polynomially equivalent to the graph isomorphism problem. Since the latter is believed not to belong to P, the chances are that it is impossible to define a metric on the class of all tree-sibling time consistent phylogenetic networks that can be computed in polynomial time.

1. Introduction

After the realization that reticulation processes, like hybridizations, recombinations, or lateral gene transfers, have been more relevant in the evolution of life on Earth than previously thought [1], there has been a growing interest in the development of algorithms for the reconstruction of phylogenetic networks: graphical models of evolutionary histories that go beyond phylogenetic trees by including hybrid nodes of in-degree greater than one representing reticulation events. As the number of available such algorithms increases, the need of methods for the comparison of phylogenetic networks also increases, as they are used, for instance, to assess the reliability and robustness of these algorithms [2, 3].

One of the types of phylogenetic networks for which there exist reconstruction methods [4, 5] are the tree-sibling time consistent networks, TSTC networks, for short (see [6] for a formal definition). Two metrics on the class of all semibinary TSTC networks, where all hybrid nodes have in-degree two, have been proposed in the last years. Both metrics are based on encodings of phylogenetic networks that turn out to single out any TSTC network among all such networks: the -vectors (where each node in the network is represented by the vector of numbers of paths from it to each leaf) [7] and the nested labels (where each node in the network is represented by a certain Newick-like representation of the subnetwork rooted at it) [8, 9]. Actually, this last metric turns out to be sound also for the class of all semibinary tree-sibling networks, without the time-consistency restriction [6].

But, although there have been several attempts to define a metric on the class of all TSTC networks on a given set of taxa [10], none of the metrics for phylogenetic networks computable in polynomial time proposed so far satisfies the separation axiom (distance 0 means isomorphism) for TSTC networks [8, 11, 12]. In this paper we show why it should come as no surprise; such a metric would solve in polynomial time the graph isomorphism problem.

The graph isomorphism problem is one of the most important decision problems for which the computational complexity is not known yet [13, 14]. It is believed to be neither in P nor NP complete, and subexponential time solutions for it are known. A problem is said to be graph isomorphism-complete when it is polynomially equivalent to the graph isomorphism problem [15]. In this paper we show that, for every set with more than two elements, the isomorphism problem for TSTC phylogenetic networks with taxa bijectively labeled in is graph isomorphism-complete.

2. Preliminaries

Let be a nonempty rooted directed acyclic graph (a rDAG, for short). A node of is a leaf if it has out-degree , internal if its out-degree is , of tree type if its in-degree is , of hybrid type if its in-degree is , and elementary if it is a tree node of out-degree 1. A node is a child of another node (and, hence, is a parent of ) if . Two nodes and are siblings of each other if they share a parent. An arc in a rDAG is a tree arc when is a tree node and a hybridization arc when is a hybrid node. The height of a node is the longest length of a directed path from to a leaf, and the depth of is the longest length of a directed path from the root to .

Given a finite set of labels, a -rDAG is a rDAG with its leaves injectively labelled by . By an isomorphism of -rDAGs we understand an isomorphism of directed graphs that preserves and reflects the labelling, that is, that matches each leaf in one network with the leaf with the same label in the other network. In a -rDAG, we will always identify without any further reference every leaf with its label.

A phylogenetic network on a set of taxa is a -rDAG such that(i)no tree node is elementary;(ii)every hybrid node has out-degree , and its single child is a tree node.

A phylogenetic tree is a phylogenetic network without hybrid nodes.

We will say that a phylogenetic network is tree-sibling if every hybrid node has at least one sibling that is a tree node.

A temporal assignment [16] on a network is mapping such that(a)if is a hybrid node and , then ;(b)if is a tree node and , then .

We will say that a phylogenetic network is time-consistent if it admits a temporal assignment. The following alternative characterization of time consistency will be used later. For a proof, see [16, 17].

Proposition 1. Let be a phylogenetic network, let be its set of hybridization arcs, and let be the directed graph with the same set of nodes as and set of arcs . Then, is time consistent if, and only if, does not have any cycle containing some tree arc of .

For short, we will refer henceforth to tree-sibling time consistent phylogenetic networks simply as TSTC networks.

The underlying biological motivation for the definitions on phylogenetic networks introduced so far is the following. In a phylogenetic network, tree nodes model species (either extant, the leaves, or nonextant, the internal tree nodes), while hybrid nodes model reticulation events, where different species interact to create new species. The parents of the hybrid node represent the species involved in this event and its single child being the resulting species. The tree children of a tree node represent direct descendants through mutation. The first condition in the definition of phylogenetic network says that every nonextant species is assumed to have at least two different direct descendants. This is a very common restriction in any definition of phylogeny (be it a tree or a network), since species with only one child cannot be reconstructed from biological data.

The tree-sibling condition says then that, for every reticulation event, at least one of the species involved in it must have some descendant through mutation. This condition was introduced with the name class I in L. Nakhleh’s Ph.D. thesis [10], and it has reappeared in several phylogenetic network reconstruction methods [4, 5]. As far as the time consistency goes, we understand that the time assigned to a node represents the time when the corresponding species existed, or when the reticulation event took place. The first condition in time consistency means then that the species involved in a reticulation event must coexist in time in order to interact, while the second condition means that speciation takes some amount of time to take place.

3. Main Results

It is well known [13, 18] that the isomorphism problem for rDAGs is graph isomorphism-complete. It turns out that the isomorphism problem for rDAGs with their leaves injectively labeled in any given set of labels is also graph isomorphism-complete; since we have not been able to find a proof of this easy result in the literature, we provide one here.

Proposition 2. For every nonempty set of labels, the isomorphism for -rDAGs is graph isomorphism-complete.

Proof. Without any loss of generality, we assume that .
Let us prove first that the isomorphism of -rDAGs reduces to the isomorphism of rDAGs. For every -rDAG , let be the rDAG obtained from by unlabelling its leaves and, then, for each , if contained a leaf labeled with , then adding to this leaf tree-children leaves; see Figure 1. The construction of from adds nodes and arcs, and therefore it is polynomial in the size of . And can be reconstructed from by simply replacing, for each , the node of height 1 with leaves by a leaf labeled with . Then, it is straightforward to check that, for every pair of -rDAGs and over , as -rDAGs if, and only if, as rDAGs.
Let us prove now that the isomorphism of rDAGs reduces to the isomorphism of -rDAGs. For every rDAG , let be the -rDAG obtained from by adding a new node , arcs from each leaf of to , and finally labeling the new node with ; see Figure 2. The construction of from adds 1 node and arcs, and therefore it is polynomial. And can be reconstructed from by simply removing its leaf and all arcs pointing to it. It is straightforward to check that, for every pair of rDAGs and over , if, and only if, as -rDAGs.

Let us see now that the isomorphism problem for -rDAGs reduces to the isomorphism problem for TSTC networks on a new set of labels consisting of and two extra labels. This entails that the isomorphism of TSTC networks on sets with at least three labels is graph isomorphism-complete.

Theorem 3. For every set with , the isomorphism of TSTC networks on is graph isomorphism-complete.

Proof. Without any loss of generality, we assume that .
The isomorphism of TSTC networks on clearly reduces to the isomorphism of -rDAGs, since the former is a special case of the latter. Let us prove now the converse reduction.
We will associate to each -rDAG a TSTC network on . If is a phylogenetic tree, then it is already a TSTC network, and in this case we take . Consider now the case when has some hybrid node or some elementary node, and let be the largest label actually appearing in . In this case, we define the TSTC network as follows.(1)For every hybrid node in , remove all arcs from to its children, and then add a new (tree) node , an arc from to , and new arcs from to the children of in . If was a leaf, say with label , then becomes the new leaf labeled with .(2)For every hybridization arc in the resulting -rDAG, split it into arcs and , with a new (tree and, for the moment, elementary) node. Let denote the resulting -rDAG after these two first steps.(3)For every elementary node in , add a new (tree) node and an arc .(4)Split the arc in pointing to the leaf into two arcs and .(5)Add two new nodes and , and, for every node added in step (3), add arcs and . Add also arcs and . Notice that the nodes and will be hybrid.(6)Add a tree leaf children labelled to and another one labelled to .
An example of this construction is displayed in Figure 3.
Let us prove now that is a tree-sibling time consistent phylogenetic network. (i)It is rooted (with the same root as ) and acyclic, because all new arcs are either used to split arcs in into pairs of consecutive arcs, or to define paths that end in the new leaves or without forming cycles.(ii)It has no elementary node. Indeed, any elementary node in gets an extra child in step (3), and the tree nodes that are added to either get an extra child in step (3) or they get two children in (5).(iii)Its hybrid nodes have only one child, and it is a tree node; this is ensured for the hybrid nodes in in step (1), and for the new hybrid nodes and by construction.(iv)It is tree-sibling. All hybrid nodes in get a tree sibling in steps (2) and (3) (for every hybrid node in , if is any arc pointing to , then the tree child of the new node added in the middle of is such a tree sibling of ), and the hybrid nodes and have the tree sibling .(v)It is time consistent. To check this, we use Proposition 1 (and the notations introduced therein). Since we already know that is acyclic, any cycle in must contain some inverse of a hybridization arc. There are two possibilities for this inverse. If it has the form , with one of the new hybrid nodes or introduced in step (5) and one of the tree nodes introduced in step (3) or the tree node introduced in step (4), then the only tree arcs that can be reached from in are those pointing to the leaves , or , and therefore no cycle in contains this arc together with a tree arc. And if this inverse is of the form , with a hybrid node in and one of the tree nodes introduced in step (2), then it must be followed in the cycle by the arc added in step (3), and, as we have just said, the only tree arcs that can be reached from point to a leaf, and hence no cycle in contains this arc and a tree arc, either.
It is clear that the construction of from adds nodes and arcs to , and thus it is polynomial in the size of . Notice also that in this case always contains hybrid nodes, and in particular that it is never a phylogenetic tree. Moreover, in this nontree case, the -rDAG can be easily reproduced from by simply undoing its construction as follows.(1)Remove the leaves and and their hybrid parents and , together with all arcs pointing to them.(2)Remove the elementary parent of the leaf (which will be the remaining leaf with largest label in ) and replace it by an arc from the parent of the removed node to .(3)Remove all nonlabeled leaves of the resulting rDAG together with the arcs pointing to them.(4)Remove each parent of every hybrid node, and replace it by an arc from the parent of to the hybrid child of .(5)Remove the only tree child of each hybrid node, and replace it by an arc from the hybrid node to each one of the children of the removed node.(6)The resulting -rDAG is .
It is straightforward to check now that, for every pair of -rDAGs and , if, and only if, as phylogenetic networks over .

We cannot remove the condition in the previous result because there are only two TSTC networks with less than 3 leaves (up to the actual names of the labels). In particular, this implies that, in the proof of the previous result, we cannot add less than 2 new leaves in the construction of from .

Proposition 4. There is only one TSTC network with one leaf, and only one TSTC phylogenetic with two leaves (up to relabeling), and in both cases they are trees.

Proof. The -rDAG consisting of a single node, labeled 1, and the -rDAG consisting of the phylogenetic tree with Newick code (1,2); are clearly TSTC networks. Let us check now that any other (up to relabeling) TSTC network has at least 3 leaves.
Let be a TSTC network other than those described in the last paragraph, let be a time assignment, and let be an internal node with largest -value and, among those with this largest time assignment, of largest depth.
If is a tree node, then all its children are either leaves or hybrid nodes with leaf children (because any tree descendant node of has time assignment larger than ). And ’s hybrid children would have the same time assignment as but depth largest than ’s depth, against the assumption. Therefore all children of are leaves, and it has at least 2 children, because it cannot be elementary. Now, if has more than 2 children, we are done, while if it has only two children, say the leaves 1 and 2, then will have a parent in (because is not the tree (1,2);). If the parent of is a tree node, let be this node, and let be another child of . Since does not contain cycles, and any path to or must contain , we deduce that any descendant leaf of must be different from 1 or 2; this gives at least 3 leaves. If, on the contrary, the parent of is a hybrid node , let be the parent of that has a tree child, say . The time consistency prevents to be a descendant of (because ) and, therefore, since any path leading to 1 or 2 must contain , any leaf that is a descendant of will be different from ; this gives again at least 3 leaves.
If is a hybrid node, then its child is a leaf, say 1. Let be a parent of that has a tree child. Since is the largest value of an internal node of , this tree child must be a leaf, say 2. Now let be another parent of . Since it is a tree node, it must have another child other than , say . If is a tree node, it is a leaf, as we have just seen. If is hybrid, then since , the tree child of must be a leaf. In both cases, we obtain a leaf that is different from 1 and 2; that is, contains at least 3 leaves.

It is usual in the literature to define a phylogenetic network on a set of taxa as an rDAG with its leaves bijectively labeled in . Theorem 3 also holds in this case.

Corollary 5. For every set with , the isomorphism of TSTC networks with leaves bijectively labeled on is graph isomorphism-complete.

Proof. The isomorphism of TSTC networks with leaves bijectively labeled on clearly reduces to the isomorphism of TSTC networks with leaves injectively labeled on , since the former is a special case of the latter. For the converse reduction, let and be two TSTC networks with leaves injectively labeled on , let be the leaf labels of , and let be the leaf labels of . If , then and are not isomorphic. If , let and be the TSTC networks obtained by adding to the roots of and , respectively, leaf children bijectively labeled on . These TSTC networks and have their leaves bijectively labeled on , their construction from and is polynomial in the size of , , and , and it is clear that if, and only if, .
This shows that the isomorphism problem for TSTC networks with leaves bijectively labeled on is polynomially equivalent to the isomorphism problem for TSTC networks with leaves injectively labeled on , which is graph isomorphism-complete by Theorem 3.

4. Conclusion

We have proved that, unless the graph isomorphism problem belongs to P, there is no hope of defining a polynomially computable metric on the class of all TSTC networks on a set of at least 3 taxa. It remains open the problem of defining polynomially computable metrics on the class of all TSTC networks on a given set with all their hybrid nodes of in-degree bounded by some . When , the -distance [7] and Nakhleh’s metric [8, 9] are such metrics, but they are no longer metrics for (Figure  4 in [8]). Actually, we do not even know whether the isomorphism problem for TSTC networks on a given set of taxa with globally bounded in-degree hybrid nodes (but without bounding the out-degree of the tree nodes; otherwise, Luks’ theorem [19] would apply) is always in P, but we conjecture that this is the case.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to thank Antoni Lozano for his comments on an early version of this paper. The research reported in this paper has been partially supported by the Spanish government and the UE FEDER Program, through Project no. MTM2009-07165.