## Scalable Data Mining Algorithms in Computational Biology and Biomedicine

View this Special IssueResearch Article | Open Access

Juan Wang, Zhibin Zhang, Yanjuan Li, "Constructing Phylogenetic Networks Based on the Isomorphism of Datasets", *BioMed Research International*, vol. 2016, Article ID 4236858, 7 pages, 2016. https://doi.org/10.1155/2016/4236858

# Constructing Phylogenetic Networks Based on the Isomorphism of Datasets

**Academic Editor:**Yungang Xu

#### Abstract

Constructing rooted phylogenetic networks from rooted phylogenetic trees has become an important problem in molecular evolution. So far, many methods have been presented in this area, in which most efficient methods are based on the incompatible graph, such as the CASS, the LNETWORK, and the BIMLR. This paper will research the commonness of the methods based on the incompatible graph, the relationship between incompatible graph and the phylogenetic network, and the topologies of incompatible graphs. We can find out all the simplest datasets for a topology and construct a network for every dataset. For any one dataset , we can compute a network from the network representing the simplest dataset which is isomorphic to . This process will save more time for the algorithms when constructing networks.

#### 1. Introduction

The evolutionary history of species is usually represented as a (rooted) phylogenetic tree, in which one species has only one parent. Actually, the evolution of species has caused reticulate events such as hybridizations, horizontal gene transfers, and recombinations [1–5], so species may have more than one parent. Then, the phylogenetic trees cannot describe well the evolutionary history of species. However, phylogenetic networks can represent the reticulate events, and they are a generalization of phylogenetic trees. Phylogenetic networks can also represent the conflicting evolution information that may be from different datasets or different trees [6–9].

Phylogenetic networks can be classified into unrooted [10–12] and rooted networks [4, 13–19]. An unrooted phylogenetic network is an unrooted graph whose leaves are bijectively labelled by the taxa. A rooted phylogenetic network is a rooted directed acyclic graph (DAG for short) whose leaves are bijectively labelled by taxa [20–22]. The rooted phylogenetic networks have been studied widely for representing the evolution of taxa, as evolution of species is inherently directed. The paper will study relevant properties of the rooted phylogenetic networks constructed from the rooted trees.

The algorithms constructing rooted phylogenetic networks from rooted phylogenetic trees are mainly classified into three types: the cluster network [17] based on the Hasse diagram; the galled network [16] based on the seed-growing algorithm; the CASS [23], the LNETWORK [24], and the BIMLR [25] based on the decomposition property of networks. In particular, the third type of methods (CASS, LNETWORK, and BIMLR) can construct more precise networks than the other methods. In the following, unless otherwise specified, we refer to rooted phylogenetic networks as networks.

Let be a set of taxa. A proper subset of (except for both and ) is called a cluster. A cluster is trivial if ; otherwise, it is nontrivial. Let be a rooted phylogenetic tree on ; if there is an edge in such that the set of taxa which are descendants of equals , we say that represents . Figure 1 shows two rooted phylogenetic trees and and all nontrivial clusters represented by and . Here, all trivial clusters are not listed. Given a network and a cluster , when just connecting one incoming edge and disconnecting all other incoming edges for each reticulate node (i.e., its incoming edges >1), if there is a tree edge (i.e., incoming edge of ) in such that the set of taxa which are descendants of equals , we say that represents in the softwired sense. On the other hand, if there is a tree edge in such that the set of taxa which are descendants of equals , we say that represents in the hardwired sense.

The abovementioned three types of methods constructing networks are based on clusters; that is, they first compute all of the clusters represented by input trees and then construct a network representing all clusters in the softwired sense. In this process, the third type of methods (CASS, LNETWORK, and BIMLR) will recur to the incompatibility graph (will be discussed in the following). This paper will discuss the relationship between the incompatibility graphs and the constructed networks.

#### 2. Preliminaries

A rooted phylogenetic network on is a rooted DAG, and its leaves are bijectively labelled as . The indegree of a node is denoted by . A node with is called a reticulate node, a node with is called a tree node, and, specially, the tree node with indegree 0 is the root node. The reticulation number in a network is .

Given a set of taxa , two clusters and on are called compatible, if they are disjoint or one contains the other; that is, or or ; otherwise, they are incompatible. Obviously, a trivial cluster and any one cluster are compatible. Given two incompatible clusters and , is called the incompatible taxa with respect to and . A set of clusters on is called compatible, if is pairwise compatible; otherwise, it is incompatible. For a set of clusters , its incompatibility graph is an undirected graph with node set and edge set , where an edge connects two incompatible clusters.

Given a cluster set on and a subset of , the result of removing all elements in from each cluster in is called the restriction of to , denoted by . If (where ) and any one cluster are compatible and is also compatible, then we say that is an ST-set (Strict Tree Set) with respect to . If there are no other ST-sets containing except itself, we say that is maximal. For a maximal ST-set , there is a subtree constructed by the set of clusters .

For each maximal ST-set with respect to , after collapsing it into a single taxon , the result set is denoted as Collapse. For example, is the only maximal ST-set; then, Collapse. Then, the taxa of Collapse are , denoted as (Collapse). A set of clusters is called the simplest if it has no maximal ST-set with respect to .

Let be a set of clusters on and let be a network representing . Usually, a tree edge in can represent more than one cluster in and a cluster in can be represented by more than one tree edge in . A mapping is defined from to the set of tree edges of , such that is a tree edge of that represents for any one cluster . A network is decomposable with respect to if there exists a mapping ( is the set of tree edges of ) such that(i)for any two clusters , and lie in the same connected component of the incompatibility graph if and only if two tree edges and are contained in the same biconnected component of .

Then, we also say that the network has the decomposition property. The decomposition property makes the network constructed by an appropriate divide-and-conquer (DC for short) strategy; that is, it first constructs a subnetwork for each one connected component of the incompatibility graph and then merges all subnetworks into a whole network. Then, the constructed network is called DC network, and the algorithms are called DC algorithms. The paper [23] has proven the DC networks satisfying the decomposition property.

Given a set of clusters , the DC algorithms first compute the incompatibility graph and then compute the subnetwork for the result set after collapsing each one maximal ST-set into one taxon for each biconnected component of ; next, “decollapse,” that is, replace each leaf labelled by a maximal ST-set by a maximal subtree, and finally integrate those subnetworks into a final network. The paper [25] has proven that there exists a DC network for any one set of clusters . Figure 2 shows the construction process of the DC algorithms for the set of clusters in Figure 1, in which constructing subnetwork for each one connected component (i.e., Step 2) is crucial.

The CASS, the LNETWORK, and the BIMLR algorithms are the DC algorithms, which can construct the networks with fewer reticulations than other algorithms. The networks constructed by the BIMLR and the LNETWORK have fewer redundant clusters except for the input clusters than other available methods. When constructing phylogenetic networks, the BIMLR and the LNETWORK are faster than the CASS, and the constructed networks are more stable, that is, the difference between constructed networks for the same dataset when different input orders are used is smaller than the CASS. Figure 3 shows three networks constructed by the CASS for the same dataset with different input orders, while BIMLR and LNETWORK can construct only one network for the dataset with different input orders [25].

#### 3. Topologies of Incompatibility Graphs

*Definition 1. *Two networks and on are isomorphic if and only if there exists a bijection from to such that(i) is an edge in if and only if is an edge in ;(ii)the label of is equal to the label of for any one leaf .

Given two sets of clusters on and on , let and be the results after collapsing all maximal ST-sets of and , respectively, on and on .

*Definition 2. * and are isomorphic, if and only if there is a bijection from to such that(i) and are in the same cluster if and only if and are in the same cluster .

By Definition 2, we have that the isomorphism of the cluster sets is an equivalence relation; that is, it is reflexive, symmetric, and transitive.

Lemma 3. *Given a DC network representing the set of clusters , then any one maximal ST-set with respect to is a maximal subtree in .*

*Proof. *From the constructing process of DC networks, this conclusion is obvious.

Lemma 4. *Let and be two sets of clusters on and , respectively. and are isomorphic. There exists a DC network representing if and only if there exists a DC network representing .*

*Proof. *There must exist a DC network for . Given a tree edge , the subtree of the root in is a maximal subtree if and only if the set of taxa is a maximal ST-set with respect to , where the taxa in are labels of leaves which are descendants of . Replace each maximal subtree of by a node, and then denote the result network as . Obviously, represents the set of clusters . From Definition 2, there exists a bijection from to such that and are in the same cluster if and only if and are in the same cluster .

Then, we can obtain a network from by replacing each one taxon in by in . Obviously, represents . Finally, we replace each leaf labelled by a maximal ST-set with respect to in by a maximal subtree, and the result network is denoted as which represents .

For two isomorphic sets of clusters and , let be a DC network representing . Lemma 4 tells us that there is a DC network representing , which can be obtained from .

Lemma 5. *Let , where is a biconnected component with two nodes. Then, any one element in is isomorphic to .*

*Proof. *Any one element has two incompatible clusters. Let and be two sets of clusters in , where and are incompatible and and are incompatible. Let be the incompatible taxa with respect to and , and let be the incompatible taxa with respect to and . Let , , , and ; then, and .

Each one of , , , , , and is a maximal ST-set if it contains more than one taxon; then, we can collapse it into one taxon which is also denoted by itself. Denote the set of clusters after collapsing all maximal ST-sets as and . Obviously, there is a bijection from to , and any two taxa are in the same cluster in if and only if and are in the same cluster in . Hence, and are isomorphic. Accordingly, any one set of clusters is isomorphic to because .

For a cluster set , there may be several cluster sets isomorphic to , but the simplest set of clusters isomorphic to is only one, denoted as . Let be the DC network representing . Then, we can obtain a DC network representing from . Lemmas 4 and 5 show there is a DC network for any one set of clusters whose incompatible graph is a biconnected component with two nodes, and it is obtained from the network (see Figure 3) representing .

Lemma 6. *Let , where is a linear biconnected component with three nodes (see Figure 4). Let , , , and . Then, any one set of clusters is isomorphic to one of , , , and .*

*Proof. *Figure 4 shows the topology of the linear biconnected component with three nodes. is the simplest set of clusters, and its incompatible graph is the topology in Figure 4. Next, we will prove that are all simplest sets of clusters for the topology in Figure 4.

Any one set of clusters in has three clusters denoted as , , and . Let be the incompatible taxa with respect to and , and let be the incompatible taxa with respect to and ; then and have the following cases: (i) ; (ii) ; (iii) ; (iv) ; (v) , and .*(i) **.* Since there is no edge between and , and are compatible; that is, , or , or . Because and , we have that . Therefore, or . Then, we have the simplest set of clusters , and any one set of clusters in this case is isomorphic to .*(ii) **.* Assume that . It is similar to the case (i), and we have that . Then, the simplest set of clusters is , and any one set of clusters in this case is isomorphic to .*(iii) **.* This case is similar to case (ii). The sets of clusters are in case (ii) if and only if they are in case (iii). Hence, any one set of clusters in case (iii) and are isomorphic. *(iv) **.* Then, . We have that and in the simplest set of clusters, since they can be collapsed if or . Assume that and . We have that and in the simplest set of clusters, since they can be collapsed if or . Then, and in the simplest set of clusters. and are the simplest sets of clusters in this case. Therefore, any one set of clusters in this case is isomorphic to or . *(v) **, ** and **.* Let and , where , , and are not empty. We have , and or . If , then . So , which contradicts the case that . Similarly, we can get the contradiction when . Thus, there exists no set of clusters in this case.

Figure 5 shows the DC networks for the simplest sets of clusters , , , and , respectively.

Lemma 7. *Let , where is a nonlinear biconnected component with three nodes (see Figure 6). Let , , , , , , , , , , , and . Then, any one set of clusters in is isomorphic to one of .*

*Proof. *Figure 6 shows the topology of the nonlinear biconnected component with three nodes. Here, , , and are the clusters, and , , and are the incompatible taxa corresponding to them. All cases are as follows: (i) ; then, or ; (ii) ; then, , and ; (iii) ; then, and ; (iv) , , ; then, and .*(i) **.* If , then , , and . We have in the simplest set of clusters; otherwise, can be collapsed into one taxon. Similarly, we have in the simplest set of clusters. Let and ; then, we can obtain the only simplest set of clusters . Any one set of clusters meeting this case will be isomorphic to .

If , then . There is in the simplest set of clusters; otherwise, can be collapsed into one taxon. Let ; then, we can obtain the only simplest set of clusters . Any one set of clusters in this case will be isomorphic to .*(ii) **, **, and **.* Then, we can obtain the simplest sets of clusters and . Any one set of clusters in this case will be isomorphic to or .*(iii) **; then, ** and **.* Then, we can obtain the simplest sets of clusters and and and . Any one set of clusters in this case will be isomorphic to one of , , , and .*(iv) **, *, *; then, ** and **.* Let ; then, and . We have in the simplest set of clusters; otherwise, can be collapsed into one taxon. Let . Then, , , and . For the first case, we can obtain the simplest sets of clusters and and and . Any one set of clusters in this case will be isomorphic to one of them.

Figure 7 shows the DC networks for the simplest sets of clusters , respectively. Lemmas 5, 6, and 7 compute all simplest sets of clusters, whose incompatible graphs are the biconnected components with two nodes or three nodes. Figures 6 and 7 show the DC networks constructed by the BIMLR algorithm for all simplest sets of clusters; then, the DC network for a set of clusters can be obtained from the DC network representing the simplest set of clusters which is isomorphic to ; that is, it does not need to be constructed once again. This conclusion is very important to the construction of networks.

#### 4. Conclusion

This paper computes all simplest sets of clusters for the topologies of incompatible graph with two nodes and three nodes. We can construct the DC networks for those simplest sets of clusters and save them. When constructing DC networks for any one set of clusters , algorithms only need to read the DC network of the simplest set of clusters isomorphic to and then compute the DC network for from by replacing labels of leaves in by the taxa in , which will save more time for the algorithms.

We will compute the simplest sets of clusters for more topologies of incompatible graph in the future.

#### Competing Interests

The authors declare that they have no competing interests.

#### Acknowledgments

The work was supported by the Natural Science Foundation of Inner Mongolia Province of China (2015BS0601) and the National Natural Science Foundation of China (61300098, 31360289).

#### References

- Q. Zou, Q. Hu, M. Guo, and G. Wang, “Halign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy,”
*Bioinformatics*, vol. 31, no. 15, pp. 2475–2481, 2015. View at: Publisher Site | Google Scholar - D. Mrozek, M. Brozek, and B. Małysiak-Mrozek, “Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA,”
*Journal of Molecular Modeling*, vol. 20, no. 2, pp. 1–17, 2014. View at: Publisher Site | Google Scholar - D. Gusfield, D. Hickerson, and S. Eddhu, “An efficiently computed lower bound on the number of recombinations in phylogenetic networks: theory and empirical study,”
*Discrete Applied Mathematics*, vol. 155, no. 6-7, pp. 806–830, 2007. View at: Publisher Site | Google Scholar | MathSciNet - Q. Zou, X.-B. Li, W.-R. Jiang, Z.-Y. Lin, G.-L. Li, and K. Chen, “Survey of MapReduce frame operation in bioinformatics,”
*Briefings in Bioinformatics*, vol. 15, no. 4, pp. 637–647, 2014. View at: Publisher Site | Google Scholar - Y. Liu, X. Zeng, Z. He, and Q. Zou, “Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources,”
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, 2016. View at: Publisher Site | Google Scholar - D. H. Huson, R. Rupp, and C. Scornavacca,
*Phylogenetic Networks: Concepts, Algorithms and Applications*, Cambridge University Press, New York, NY, USA, 2011. - Q. Zou, Q. Hu, M. Guo, and G. Wang, “HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy,”
*Bioinformatics*, vol. 31, no. 15, pp. 2475–2481, 2015. View at: Publisher Site | Google Scholar - J. Wang, M.-Z. Guo, and L. L. Xing, “FastJoin, an improved neighbor-joining algorithm,”
*Genetics and Molecular Research*, vol. 11, no. 3, pp. 1909–1922, 2012. View at: Publisher Site | Google Scholar - Q. Zou, J. Zeng, L. Cao, and R. Ji, “A novel features ranking metric with application to scalable visual and bioinformatics data classification,”
*Neurocomputing*, vol. 173, pp. 346–354, 2016. View at: Publisher Site | Google Scholar - Q. Zou, J. Li, L. Song, X. Zeng, and G. Wang, “Similarity computation strategies in the microRNA-disease network: a survey,”
*Briefings in Functional Genomics*, vol. 15, no. 1, pp. 55–64, 2016. View at: Publisher Site | Google Scholar - D. Bryant and V. Moulton, “Neighbor-net: an agglomerative method for the construction of phylogenetic networks,”
*Molecular Biology and Evolution*, vol. 21, no. 2, pp. 255–265, 2004. View at: Publisher Site | Google Scholar - D. H. Huson, T. Dezulian, T. Klopper, and M. A. Steel, “Phylogenetic super-networks from partial trees,”
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol. 1, no. 4, pp. 151–158, 2004. View at: Publisher Site | Google Scholar - D. H. Huson and T. H. Kloepper, “Computing recombination networks from binary sequences,”
*Bioinformatics*, vol. 21, supplement 2, pp. ii159–ii165, 2005. View at: Publisher Site | Google Scholar - L. van Iersel, S. Kelk, R. Rupp, and D. Huson, “Phylogenetic networks do not need to be complex: using fewer reticulations to represent conflicting clusters,”
*Bioinformatics*, vol. 26, no. 12, pp. i124–i131, 2010. View at: Publisher Site | Google Scholar - Y. Wu, “Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees,”
*Bioinformatics*, vol. 26, no. 12, Article ID btq198, pp. i140–i148, 2010. View at: Publisher Site | Google Scholar - D. H. Huson, R. Rupp, V. Berry, P. Gambette, and C. Paul, “Computing galled networks from real data,”
*Bioinformatics*, vol. 25, no. 12, pp. i85–i93, 2009. View at: Publisher Site | Google Scholar - D. H. Huson and R. Rupp, “Summarizing multiple gene trees using cluster networks,” in
*Algorithms in Bioinformatics*, K. A. Crandall and J. Lagergren, Eds., vol. 5251, pp. 296–305, Springer, New York, NY, USA, 2008. View at: Publisher Site | Google Scholar - L. van Iersel, J. Keijsper, S. Kelk, L. Stougie, F. Hagen, and T. Boekhout, “Constructing level-2 phylogenetic networks from triplets, computational Biology and Bioinformatics,”
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol. 6, no. 4, pp. 667–681, 2009. View at: Publisher Site | Google Scholar - L. van Iersel and S. Kelk, “Constructing the simplest possible phylogenetic network from triplets,”
*Algorithmica*, vol. 60, no. 2, pp. 207–235, 2011. View at: Publisher Site | Google Scholar | MathSciNet - J. Wang, “A new algorithm to construct phylogenetic networks from trees,”
*Genetics and Molecular Research*, vol. 13, no. 1, pp. 1456–1464, 2014. View at: Publisher Site | Google Scholar - D. Mrozek,
*High-Performance Computational Solutions in Protein Bioinformatics*, Springer Publishing Company, Incorporated, 2014. - D. Mrozek, P. Gosk, and B. Małysiak-Mrozek, “Scaling Ab initio predictions of 3D protein structures in microsoft azure cloud,”
*Journal of Grid Computing*, vol. 13, no. 4, pp. 561–585, 2015. View at: Publisher Site | Google Scholar - L. Van Iersel, S. Kelk, R. Rupp, and D. Huson, “Phylogenetic networks do not need to be complex: using fewer reticulations to represent conflicting clusters,”
*Bioinformatics*, vol. 26, no. 12, Article ID btq202, pp. i124–i131, 2010. View at: Publisher Site | Google Scholar - J. Wang, M. Guo, X. Liu et al., “Lnetwork: an efficient and effective method for constructing phylogenetic networks,”
*Bioinformatics*, vol. 29, no. 18, pp. 2269–2276, 2013. View at: Publisher Site | Google Scholar - J. Wang, M. Guo, L. Xing, K. Che, X. Liu, and C. Wang, “BIMLR: a method for constructing rooted phylogenetic networks from rooted phylogenetic trees,”
*Gene*, vol. 527, no. 1, pp. 344–351, 2013. View at: Publisher Site | Google Scholar

#### Copyright

Copyright © 2016 Juan Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.