Abstract
Phylogenetic networks are a generalization of phylogenetic trees that allow for the representation of evolutionary events acting at the population level, such as recombination between genes, hybridization between lineages, and horizontal gene transfer. The researchers have designed several measures for computing the dissimilarity between two phylogenetic networks, and each measure has been proven to be a metric on a special kind of phylogenetic networks. However, none of the existing measures is a metric on the space of partly reduced phylogenetic networks. In this paper, we provide a metric, distance, on the space of partly reduced phylogenetic networks, which is polynomialtime computable.
1. Introduction
Phylogenies reveal the history of evolutionary events of a group of species, and they are central to comparative analysis methods for testing hypotheses in evolutionary biology [1]. Computing the distance between a pair of phylogenies is very important for understanding the evolutionary history of species.
A metric on a space satisfies four properties for all :(I) (nonnegative property);(II) if and only if (separation property);(III) (symmetry property);(IV) (triangle inequality).
Phylogenetic network can represent reticulate evolutionary events, such as recombinations between genes, hybridization between lineages, and horizontal gene transfer [2–5]. For the comparison of phylogenetic networks, there are many metrics on the restricted subclasses of networks including the tripartition metric on the space of treechild phylogenetic networks [6–9], the distance on the space of treesibling phylogenetic networks [10], and the distance on the space of reduced phylogenetic networks [11]. Later the distance was also proved to be a metric on the space of treechild phylogenetic networks, semibinary treesibling time consistent phylogenetic networks, and multilabeled phylogenetic trees [12–15].
For any rooted phylogenetic network , we can obtain its reduced version by removing all nodes in maximal convergent sets (will be discussed in the following) and all the nodes, with indegree 1 and outdegree 1, from . The reduced versions of all rooted phylogenetic networks form the space of reduced phylogenetic networks (distance, defined by Nakhleh, is on this space). In this paper, we will discuss the partly reduced version of a phylogenetic network by removing the nodes in a part of the convergent sets and all the nodes, with indegree 1 and outdegree 1, from the phylogenetic network. The partly reduced versions of all rooted phylogenetic networks form the space of partly reduced phylogenetic networks. Then we will introduce a novel metric on the space of partly reduced phylogenetic networks. The space is not the space of rooted phylogenetic networks, but it is the largest space on which a polynomialtime computable metric has been defined so for. The papers [16, 17] have proved that the isomorphism for rooted phylogenetic networks is graph isomorphismcomplete. Unless the graph isomorphism problem belongs to , there is no hope of defining a polynomialtime computable metric on the space of all rooted phylogenetic networks. However, our paper’s aim is mainly to find a larger space on which a polynomialtime computable metric can be defined such that the space is closer to the space of rooted phylogenetic networks.
2. Preliminaries
Let be a directed acyclic graph, or DAG for short. We denote the indegree of a node as indeg() and the outdegree of as outdeg(). We will say that a node is a tree node if . Particularly, is a root of if of . If a single root exists, we will say that the DAG is rooted. We will say that a node is a reticulate node if . A tree node is a leaf if . A node is called an internal node if its . For a DAG , we will say that is a child of if ; in this case, we will also say that is a parent of . Note that any tree node has a single parent, except for the root of the graph. Whenever there is a directed path from a node to , we will say that is a descendant of or is an ancestor of .
The height of a node is the length of a longest path starting at the node and ending in a leaf. The absence of cycles implies that the nodes of a DAG can be stratified by means of their heights: the nodes of height 0 are the leaves; if a node has height , then all its children have heights that are smaller than and at least one of them has height exactly .
The depth of a node is the length of a longest path starting at the root and ending in the node. Similarly, the absence of cycles implies that the nodes of a DAG can also be stratified according to their depths: the node of depth 0 is the root; if a node has depth , then all its parents have depths that are smaller than and at least one of them has depth exactly .
Let be a set of taxa. A rooted phylogenetic network on is a rooted DAG such that(i)no tree node has outdeg 1;(ii)its leaves are labeled by by a bijective mapping .
We use the notation (or ) for the rooted phylogenetic network and the notation for its leaf set.
Definition 1. Two rooted phylogenetic networks and are isomorphic if and only if there is a bijection from to such that (i) is an edge in if and only if is an edge in ;(ii) for all .
Moret et al. (2004) discussed the concept of reduced phylogenetic networks from a reconstruction standpoint. Subsequently, we briefly review the concept of reduced phylogenetic networks and introduce a new definition of partly reduced phylogenetic networks. In the following section, we present a metric on the space of all partly reduced phylogenetic networks. First we review the concept of a maximal convergent set that has been given in [7, 11].
Definition 2. Given a network , we say that a set of internal nodes in is convergent if and every leaf reachable from some node in is reachable from all nodes in .If there is no convergent set containing except itself, we say that is a maximal convergent set.
Here the leaf set reachable from the nodes in a convergent set is called the leaf set of .
We will take Figure 1 as an example in the following. The two networks , on are adapted from refinements () and () in Table in [11].
Example 3. Consider the networks in Figure 1. The set is the only maximal convergent set of and the set is the only maximal convergent set of .
For a phylogenetic network on , the reduced version of can be obtained by the following reduction procedures:(1)For each maximal pendant subtree (i.e., the maximal clade that includes no reticulate nodes) , rooted at node , create a new node and an edge (), where is the parent of , delete the edge () and the subtree , and label as . Then we denote the resulting network as .(2)Repeat the following two steps on until no change occurs:(I)For each maximal convergent set with leaf set , remove all nodes and edges on the paths from a node in to the parent of leaf in , including all nodes in and excluding the parent of leaf in . For each edge (), where lies outside the deleted set and lies inside the deleted set, replace it with a set of edges is the parent of leaf in .(II)For each node in the network, with indeg() = outdeg() = 1, remove the edges (), () and the node , add an edge (), where is the parent of and is the child of . Repeat this step until no such node can be removed.(3)Replace each leaf labeled by the subtree by its root .
Figure 2 shows the results of applying the reduction procedures to the network . For the networks in Figure 1, their reduced versions are the same (see Figure 3). The reduced versions of all rooted phylogenetic networks form the space of reduced phylogenetic networks. Nakhleh has introduced a polynomialtime computable metric on this space [11]. In order to enlarge the space in which a polynomialtime computable metric can be defined, we will introduce a new metric and a new space that contains the space of reduced phylogenetic networks.
Definition 4. Given a network , let be the set of parents of a node in . We say that is a super convergent set, if(i) is a convergent set;(ii) for any two nodes ;(iii) is a convergent set for a node , if .
Example 5. The set is the only superconvergent set for any one network in Figure 4, while the networks in Figure 1 have no superconvergent set.
We will obtain the new reduction procedures, called partial reduction procedures, from the above reduction procedures by just processing superconvergent sets rather than maximal convergent sets in step (I) of step (2). After applying the partial reduction procedures to a rooted phylogenetic network , the partly reduced version of is obtained. The partly reduced versions of all rooted phylogenetic networks form the space of partly reduced phylogenetic networks. This space contains the space of reduced phylogenetic networks, but they are not identical. Next we will introduce a polynomialtime computable metric for the partly reduced phylogenetic networks.
We begin with the notion of node semiequivalence. For the sake of simplicity, we will hereafter refer to the rooted phylogenetic networks as the networks.
3. A Metric
Definition 6. Given a network , we say that two nodes (not necessarily different) are semiequivalent, denoted by , if (i) and or(ii)node has children ; node has children , and for .
By the definition, it follows that the semiequivalence of nodes is an equivalence relation; that is, it is reflexive, symmetric, and transitive, and the semiequivalent nodes must have the same height.
Example 7. Consider the network in Figure 1. For any node , is only semiequivalent to itself, while the nodes and are semiequivalent.
Property 1. If are semiequivalent from the network , then are the same nodes or there are the nodes ( or a descendant of ), ( or a descendant of ( or a descendant of ) such that have the same children. See Figure 5.
Proof. We use induction on the height of to prove it. If , obviously are the only leaf. Thus, in this case, the property holds. We assume that the result is tenable when , and let . Then the children of are semiequivalent, respectively (let the children of be for ; then are semiequivalent for ), and their height is at most by the property of node height. By the induction hypothesis, the children of satisfy the property. The descendants of children of are the descendants of . Thus, the property holds.
Definition 8. Given a network , we say that two nodes (not necessarily different) are equivalent, denoted by , if , and(i) are the root or(ii)node has parents ; node has parents , and for .
For any node in , it is equivalent to itself. The equivalence of nodes is also an equivalence relation. The equivalent nodes have the same height and depth.
Example 9. Consider the network in Figure 1. For any node , it is equivalent to itself. Consider the network in Figure 4. For any node , it is equivalent to itself, while the nodes and are equivalent to each other.
Property 2. If are equivalent in the network , then are the same nodes or there are the nodes ( or an ancestor of ), or an ancestor of ( or an ancestor of ) such that have the same parents. See Figure 6.
Proof. We use induction on the depth of to prove it. If , then are the unique root node. Thus, in this case, the property holds. We assume that the result is tenable when , and let . Then the parents of are equivalent, respectively (let the parents of be for ; then are equivalent for ), and their depth is at most by the property of node depth. By the induction hypothesis, the parents of satisfy the property. The ancestors of the parents of are the ancestors of . Thus, the property holds.
In this paper, we are mainly concerned with comparing networks; the notion of node semiequivalence and equivalence will be extended to nodes from two different networks, as established in the semiequivalence and equivalence mapping of Definitions 10 and 13, respectively.
Given a set , we use to denote the set of all subsets of .
Definition 10. Let and be two networks on . We define the semiequivalence mapping between and , , such that , for and , if(i), , and or(ii)node has children ; node has children , and for .
Further, while inequation holds in phylogenetic trees, it is not always the case for general phylogenetic networks.
Example 11. Consider the networks in Figure 1. is a semiequivalence mapping between and . For the reticulate nodes and in , and . For the other nodes in , , , , , and .
Theorem 12. Let and be two networks on , and let be two nodes in and a semiequivalence mapping between and . Assume that and . Then, if and only if , for and .
Proof. For the “only if” direction, let , , and . Obviously, , , , and have the same height . Then, we use induction on such height to prove . In particular, if , that is, , and , then and . Thus, in this case, . We assume that the result is tenable when , and let . We assume that node has children . Due to , it follows that node has children , and (). Due to and , it follows that has children , and (), has children , and (). The height of , , , and is at most . By the induction hypothesis, . Thus, .
For the “if” direction, let , , and . Similarly, we also use induction on the same height of , , , and to prove . If , that is, , and , then and . Thus, in this case, . We assume that the result is tenable when , and let . We assume that node has children . Since , node has children , and (). Since and , has children , and (), has children , and (). The height of , , , and is at most by the property of node height. By the induction hypothesis, . Thus, .
Theorem 12 tells us that the semiequivalence mapping keeps the semiequivalence of nodes. Thus, all nodes in are semiequivalent. Sometimes we use to denote an arbitrary node in the set. We say that the nodes in are semiequivalent with .
Definition 13. Let and be two networks on . We define the equivalence mapping between and , , such that , for and , if , and(i) are the roots or(ii)node has parents ; node has parents , and , for , where is a semiequivalence mapping between and .
Example 14. Consider the networks in Figure 1. is the semiequivalence mapping between and discussed in Example 11. is an equivalence mapping between and defined in Definition 13. For any node , , while when
Theorem 15. Let and be two networks on , and let be two nodes in . is an equivalence mapping between and . Assume that and . Then, if and only if , for and .
Proof. Let , . Then , based on Definition 13. For the “only if” direction, let . We can deduce that according to Theorem 12, and , and and have the same depth . Then, we use induction on to prove that . If , that is, , are the unique root node of , then , are the unique root node of . Thus, in this case, . We assume that the result is tenable when , and let . We assume that node has parents . Due to , node has parents , and (). Due to and , has parents , and (), has parents , and (). The depth of , , , and is at most by the property of node depth. By the induction hypothesis, . Thus, .
For the “if” direction, let , , and . We can deduce first that according to Theorem 12. Similarly, we also use induction on the same depth of , and , to prove that . If , that is, , are the unique root node of , then , are the unique root node of . Thus, in this case, . We assume that the result is tenable when , and let . We assume that node has parents . Due to , node has parents , and (). Due to and , has parents , and (), has parents , and (). The depth of , , , and is at most . So, by the induction hypothesis, . Thus, .
Theorem 15 tells us that the equivalence mapping keeps the equivalence of nodes. Thus, all nodes in are equivalent. Sometimes we use to denote an arbitrary node in the set. We say that the nodes in are equivalent to .
Lemma 16. Let be a network and , two equivalent nodes. Then , belong to a superconvergent set.
Proof. This lemma is obtained easily from Properties 1 and 2.
Lemma 17. Let be a partly reduced phylogenetic network. Then for any two nodes .
Proof. From the partial reduction procedures of the network, we have that all superconvergent sets in a partly reduced network have been deleted.
Given two networks and , assume that . The unique nodes of , denoted by , is defined by the following processes. First let . Then for each one node , if there exists no node such that , add to . We define in a similar way. Further for each node , we define and similarly for each node . We define for any network . When the context is clear, we drop the subscript of . We are now in a position to define the measure on pairs of partly reduced phylogenetic networks.
Definition 18. Let and be two phylogenetic networks on . Then equals where is a node in that is equivalent to , and if no such equivalent node exists, then .
Lemma 19. If for two networks and , then .
Proof. Let and be two equivalence mappings from Definition 13. Since , it follows that (where denotes a node , which is equivalent to and in along with for all and (where denotes a node , which is equivalent to and in along with for all . From this and Theorem 15, we have that (due to ) and (due to ). Thus .
Theorem 20. Let and be two partly reduced networks. Then, and are isomorphic if and only if .
Proof. Let be an equivalence mapping, as given in Definition 13. From Lemma 19, it follows that and for all . From Lemmas 16 and 17, we have that is defined and unique for each . We now prove that if , then , where and . Given that , that is, and are equivalent, this implies that and have equivalent parents. Since is defined and unique, is a parent of . Thus, . It shows that the mapping g is bijective, which also preserves the labels of the leaves and the edges of networks. Thus, and are isomorphic.
The converse implication is obvious.
From the definition of the measure, the symmetry property follows immediately.
Lemma 21. For any pair networks and , one has .
The measure can be viewed as half of the symmetric difference of two multisets on the same set of elements, where the multiplicity of element in is and similarly for . Since the symmetric difference defines a metric on multisets [12], we have the following triangle inequality.
Lemma 22. Let , , and be three networks. Then, .
From Theorem 20 and Lemmas 21 and 22, we have the following main result.
Theorem 23. The measure is a metric on the space of partly reduced phylogenetic networks.
Proof. It follows from Theorem 20 and Lemmas 21 and 22 and the fact that max.
Let and be two phylogenetic networks. For a node in , we refer to its semiequivalent nodes from as internal semiequivalence (equivalence) nodes and its semiequivalent (equivalence) nodes from as external semiequivalence (equivalence) nodes. When computing the distance between two networks, we first compute internal and external equivalence nodes for every node in the two networks; subsequently by formula (1) we obtain the distance between the two considered networks. The maximum of measure is , when any node in and in has no external equivalence nodes.
In order to show the results of the distance computed by formula (1), we give an example as follows.
Example 24. Consider the networks in Figure 1. , are two different networks on . However, in [11], they are indistinguishable and their distance [11] is 0. Now, we compute the distance between them: (see Example 14).
4. Computational Aspects
From the definition of semiequivalent nodes, whether in the same network or in two different networks, we have that the semiequivalent nodes can be computed by means of a bottomup technique. Similarly, the equivalent nodes can be computed by means of a topdown technique. Let and be two phylogenetic networks. For a pair of nodes and , whether in the same network or in different networks, the following shows the pseudocode (Algorithm 1) that decides whether they are internal semiequivalent to each other, the pseudocode (Algorithm 2) that decides whether they are internal equivalent to each other, and the pseudocode (Algorithm 3) that computes the distance for a pair of networks (where ISE is the abbreviation for the set of internal semiequivalent nodes, ESE is the abbreviation for the set of external semiequivalent nodes, IE is the abbreviation for the set of internal equivalent nodes, and EE is the abbreviation for the set of external equivalent nodes). If two nodes and from the same network are semiequivalent, then we add to the ISE of and add to the ISE of . Obviously, this decision costs at most time, where . So, it takes totally time to find out all internal and external semiequivalent nodes for every node in the two networks. In a similar way, we have that it also takes time to find out all internal and external equivalent nodes for every node in the two networks. Subsequently we spend time computing the formula (1). In conclusion, it costs totally time to compute the distance between two networks, where is the maximum between their node numbers.



5. Conclusion
In [11], Nakhleh introduced a polynomialtime computable mdistance in the space of reduced phylogenetic networks. In order to enlarge the space of phylogenetic networks we can compare, we devised a polynomialtime computable distance on the space of partly reduced phylogenetic networks, which can be viewed as half of the symmetric difference of two multisets on the same set of elements. To our knowledge, the space is the largest space that has a polynomialtime computable metric. distance is also a metric on the space of reduced phylogenetic networks which is included in the space of partly reduced phylogenetic networks. In general, for two phylogenetic networks, their distance is larger than their mdistance. From [12], we have that the distance is also a metric on the space of treechild phylogenetic networks, semibinary treesibling time consistent phylogenetic networks, and multilabeled phylogenetic trees. However, the distance is not a metric on the space of all rooted phylogenetic networks; for example, in the two phylogenetic networks in Figure 4, their distance is 0, but they are not isomorphic.
distance can also apply to computing the dissimilarity for other types of networks, such as spiking neural networks [18–20], which will be a direction of further research.
Competing Interests
The author declares that they have no competing interests.
Acknowledgments
This work was supported by the Natural Science Foundation of Inner Mongolia province of China (2015BS0601).