International Journal of Genomics

Volume 2018, Article ID 7329576, 11 pages

https://doi.org/10.1155/2018/7329576

## Gene Coexpression Network Comparison via Persistent Homology

^{1}Department of Mathematics and Statistics, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia^{2}Department of Systems Engineering, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia

Correspondence should be addressed to Ali Nabi Duman; as.ude.mpufk@namudila

Received 2 May 2018; Revised 21 July 2018; Accepted 26 July 2018; Published 19 September 2018

Academic Editor: Atsushi Kurabayashi

Copyright © 2018 Ali Nabi Duman and Harun Pirim. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Persistent homology, a topological data analysis (TDA) method, is applied to microarray data sets. Although there are a few papers referring to TDA methods in microarray analysis, the usage of persistent homology in the comparison of several weighted gene coexpression networks (WGCN) was not employed before to the very best of our knowledge. We calculate the persistent homology of weighted networks constructed from 38 Arabidopsis microarray data sets to test the relevance and the success of this approach in distinguishing the stress factors. We quantify multiscale topological features of each network using persistent homology and apply a hierarchical clustering algorithm to the distance matrix whose entries are pairwise bottleneck distance between the networks. The immunoresponses to different stress factors are distinguishable by our method. The networks of similar immunoresponses are found to be close with respect to bottleneck distance indicating the similar topological features of WGCNs. This computationally efficient technique analyzing networks provides a quick test for advanced studies.

#### 1. Introduction

Quantitative skills have become much more essential to distill meaning from the vast emerging and increasing diverse data sets since the technological advances in DNA sequencing that occurred at the end of the 20th century. Modern technological developments in high throughput data technologies such as microarrays and RNA-sequencing enable the generation of terabytes of data in a short amount of time. The type of the data generated comprises levels regarding the abundance of RNA, quantification of protein-protein interactions (PPI), and many other biological molecular interactions. The generated data is embraced for statistical inference and computational analysis including low-level data processing and high-level algorithmic analysis with computations and machine learning techniques. Making use of the data is a reverse engineering approach. Gene coexpression microarrays measure interactive activities of thousands of genes. In network terms, the nodes of the coexpression matrix represent the gene products and the edges of the matrix represent the relationship between the products (usually expressed by correlations). After a chip scanning and an image processing process, a matrix of coexpression values is obtained. The rows of the matrix refer to the gene products while the columns refer to the experiments/samples/tissues. The numeric values of the matrix are the expression values of genes across the experiments. The experiments may be “control versus treated” or “time course.”

The networks constructed based on the gene expression similarity are called gene coexpression networks [1]. They can be named association, correlation, and influence networks [2] as well. Coexpression network analysis requires the selection of a similarity measure between genes and a clustering algorithm to decompose the network into functional clusters/modules following a meaningful experiment design [3, 4]. However, there exist clustering algorithms that do not require a distance matrix as an input, but rather they require the network itself (e.g., some community structure finding algorithms). The modules found by running a clustering algorithm require biological inference.

In order to have a high-level overview of a coexpression network construction and analysis, a few common processes can be summarized [5]: (i)Obtaining a filtered data(ii)Making use of network inference or guilt by association as in clustering(iii)Enrichment analysis to see the biological relevance of computational outputs(iv)Extension of the model(s) integrating multiple data types such as mRNA, miRNA data from RNA-req, TF, DNA-binding data from ChIP-seq, and protein interaction data from mass spectrometry

Ideally, network decomposition results in tight clusters/modules with dense intracluster and sparse intercluster connections. Tight clusters are supposed to include biologically relevant genes in terms of functions or residing in the same pathway.

One of the widely used steps in constructing a gene coexpression network is trimming some of the edges based on a threshold [6–8]. Persistent homology, which is first developed to explore the topological features of point cloud data, is a topological invariant, and it addresses the problem of choosing a reasonable threshold. Our method employs persistent homology once the correlation similarity is calculated on the filtered networks.

Persistent homology is a new tool for studying the shape of a point cloud in application areas such as digital images [9, 10], dynamical systems [11], biomolecules [12], and high-dimensional data mining [13]. The persistent homological framework enables us to analyze multiscale networks in a consistent manner [14–16]. The output of the persistent homology of a network can be summarized visually using a *persistent diagram*, and the distance between two persistent diagrams can be measured via *bottleneck distance*. Here, we are using the persistent homology framework to do a threshold-free analysis of weighted gene coexpression networks constructed from 38 Arabidopsis microarray data sets. We list several advantages of our method:
(i)Our method does not require a choice of fixed threshold as it considers the networks at every possible threshold.(ii)It gives a more robust result than an analysis of unweighted networks for which the results might depend on the choice of the threshold.(iii)Persistent diagrams can be used for a standard data analysis method such as cluster analysis.(iv)Our method eliminates the computational burden of analyzing many networks obtained for different thresholds.

Topological data analysis (TDA) has been applied to biological data before. Arsuaga et al. [17] associate a two-dimensional (2D) point cloud with each array comparative genomic hybridization (aCGH) profile and generate a sequence of simplicial complexes. They use these mathematical objects to identify DNA copy number aberrations by interrogating the topological properties. Camara et al. [18] use TDA for mapping meiotic recombination at fine scales. Comparing to standard linkage-based methods, they find that TDA can deal with a larger number of genomes in a computationally efficient way. Cang et al. [19] propose a support vector machine algorithm for protein classification. They choose the machine learning feature vectors from the persistent homology of the protein structure. Chan et al. [20] use persistent homology to capture both vertical and horizontal evolutions. They show that horizontal evolution exhibits nontrivial topology of dimension greater than zero. Nicolau et al. [21] introduce a topological method that identifies a unique subgroup of estrogen receptor-positive (ER+) breast cancers that express high levels of c-MYB and low levels of innate inflammatory genes. Perea et al. [22] present a novel method based on persistent homology to classify periodic or nonperiodic signals of microarray time series data. Their method successfully identifies the periodic genes in microarray data from the yeast cell cycle.

Here are the main results of the paper: (i)We quantify topological features of WGCNs using persistent homology and apply the hierarchical clustering algorithm to the distance matrix whose entries are pairwise bottleneck distance values between the networks.(ii)The immunoresponses to different stress factors are distinguishable by our method. The networks of similar immunoresponses are found to be close with respect to bottleneck distance indicating the similar topological features of WGCNs. Hence, persistent diagrams of the networks can be used to determine the topological and biological similarities.

#### 2. Methods

Topological methods address several problems that arise in biological data analysis [13]. We now summarize three of these which are related to the analysis of our microarray data sets. First of these problems is to extract qualitative information from a given set of data prior to quantitative methods. This might include studying the characteristics of the data space such as determining the connected components, loops, and higher dimensional surfaces. In the biological context, these methods have already been used, for example, in identifying a novel subgroup of a certain disease [17, 21], cataloguing the type of exchange of genomic material [20], classifying protein domain [23], and discovering periodicity in gene expression time series data [22].

The second issue in biological data analysis is the choice of a natural coordinate system. A particular choice of a coordinate system might not have an essential meaning during the analysis. Topological methods, which are coordinate-free and depend only on the chosen metric, enable us to compare the data sets given in different coordinate systems where there is a concept of similarity in general, not only the Euclidean metric.

Thirdly, fixing an optimal parameter in conventional clustering algorithms might not reveal sufficient information about the data set under consideration. Hence, it is preferred to consider the entire set of parameters at once. This raises the question of what the relationship between the information obtained from different parameters is. Topology deals with this problem via the concept of *functoriality* which is used to compute the topological invariants from discrete approximations.

Topology ideally aims to find the homeomorphism type of a topological space. Roughly speaking, we would like to classify the spaces up to stretching and bending but not tearing and gluing. However, in most of the cases, it is very hard to get the homeomorphism type of a space. Hence, we need to consider other invariants: homotopy, homology, cohomology, and so on. In order to find these topological invariants of a data space, we need to construct a combinatorial approximation of the space called *simplicial complex*:

*Definition 1. *A *simplicial complex * consists of a set of objects, , called *vertices* and a set, , of finite nonempty subsets of , called simplices such that (i) any nonempty subset of a simplex is also a simplex, (ii) every one element set , where , is a simplex, and (iii) the intersection of any two simplices is also a simplex.

The *dimension of a simplex * is defined as : simplices consisting of a single element are zero-dimensional, and simplices consisting of two elements are one-dimensional and so on. The *dimension of the complex * is defined as the largest dimension of any of its simplices.

One can construct a simplicial complex from a data set in using a *Rips complex* (see Figure 1).

*Definition 2. *Let be a set of points in . The *Rips complex * (also called Vitoris-Rips complex) is the simplicial complex whose *-*simplices correspond to -tuples of points which are pairwise within distance .