International Journal of Genomics

Volume 2015 (2015), Article ID 197895, 6 pages

http://dx.doi.org/10.1155/2015/197895

## A New Binning Method for Metagenomics by One-Dimensional Cellular Automata

^{1}Masters Program in Biomedical Informatics and Biomedical Engineering, Feng Chia University, No. 100, Wenhwa Road, Seatwen, Taichung 40724, Taiwan^{2}Department of Applied Mathematics, Feng Chia University, No. 100, Wenhwa Road, Seatwen, Taichung 40724, Taiwan

Received 7 January 2015; Accepted 9 February 2015

Academic Editor: Hai Jiang

Copyright © 2015 Ying-Chih Lin. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

More and more developed and inexpensive next-generation sequencing (NGS) technologies allow
us to extract vast sequence data from a sample containing multiple species. Characterizing
the taxonomic diversity for the planet-size data plays an important role in the metagenomic
studies, while a crucial step for doing the study is the *binning* process to group sequence reads
from similar species or taxonomic classes. The metagenomic binning remains a challenge work
because of not only the various read noises but also the tremendous data volume. In this work,
we propose an unsupervised binning method for NGS reads based on the one-dimensional cellular
automaton (1D-CA). Our binning method facilities to reduce the memory usage because 1D-CA
costs only linear space. Experiments on synthetic dataset exhibit that our method is helpful to
identify species of lower abundance compared to the proposed tool.

#### 1. Introduction

With the rapid development of next-generation sequencing (NGS) technologies, the ability to gain experimental data has far surpassed the capability to proceed with further analysis. High-throughput NGS machine is capable of sequencing millions to even billions of reads (short DNA fragments) in parallel from a sample containing many species. Within a reasonable cost, an individual laboratory can generate terabase scales of sequencing data within a day [1], which also inspires many mining tools to interpret these data [2]. Instead of traditional works for studying microbial genome on an individual bacterial strain, NGS technologies as a powerful tool greatly facilitates researchers to study the genomes of multiple microorganisms from environmental samples, while it is known as* metagenomics*. Several metagenomic projects have successfully offered valuable insights to the diverse microbial communities, such as the soil [3] and human gut [4].

An important step in metagenomic analysis is the* binning* procedure to keep together reads from similar species or taxonomic classes. There are two major methodologies for binning algorithm: supervised and unsupervised methods [5]. The former is taxonomy-dependent and similarity-based where individual reads are taxonomically grouped by aligning them to known genomes in reference databases, and subsequently reads aligned to similar genomes are grouped into bins. However, in a typical metagenomic scenario, most reads (up to [6]) come from genomes of hitherto unknown organisms, which are then nonexistent in current reference databases. Taxonomy-dependent binning methods fail to identify such reads, and generally categorize them as unassigned. One alternative approach is to align the taxonomic marker genes, for example, recA, rpoB, and 16S ribosomal RNA (rRNA) [7], or particular genomic regions, for example, the internal transcribed spacer (ITS) regions [8].

As for the unsupervised method, it is taxonomy-independent and groups reads from the dataset based on the genomic signatures, such as -mer distribution, G + C content, and codon usage [5, 9], which can be directly extracted from the nucleotide sequences. According to different signatures or observations, a number of composition-based methods are proposed as the binning tools. AbundanceBin [10] utilizes the -mer frequency to group reads, while TOSS [11] is based on sufficiently long mers and integrates AbundanceBin into separating reads from species with different abundances. Both fail to tackle reads from different species with similar abundance ratio [12]. The series of unsupervised binning tools of MetaCluster [12] are developed according to multiple observations, and MetaCluster 5.0 can compute the number of species shaped by the sequence reads. However, it often gives inaccurate number of species for the relatively large number of species in the dataset from the performance comparison to the binning tool MCluster [13].

On the other hand, a* cellular automaton* (CA) is a discrete computational model studied for the complex systems in mathematics, computer science, economics, biology, and so forth. It consists of a regular array of cells, with each being a finite state automaton (FSA), while the array can be in a positive number of dimensions. The state of a cell at time is a function of the states of its neighboring cells at time , where the function is a set of* transition rules*. One-dimensional CA considers the cells over a one-dimensional array and has been used for solving synchronization problems [14], prime generation [15], data clustering [16], real-time language recognition [17], and so on. In this work, we propose a new binning approach for NGS reads from metagenomic sequences based on one-dimensional CA by the extension of previous work [16]. Since a one-dimensional CA requires only linear memory space when running, our binning method moderates the tremendous amount of memory usage caused by NGS data. In addition, we conduct experiments to evaluate the performance and compare it with the proposed tool.

This paper is organized as follows. Section 2 introduces one-dimensional CA and our binning method step by step. Subsequently, we take the simulated dataset to assess the performance in Section 3. Finally, Section 4 draws our conclusion.

#### 2. Binning by One-Dimensional Cellular Automaton

##### 2.1. One-Dimensional Cellular Automaton

Cellular automata are discrete models for dynamic systems, where it was originally introduced as a computational medium for machine self-replication guided by a set of rules. The classical version of CA is based on the use of a regular array, local variables, and a function working over a neighborhood. More formally, the regular grid of CA is a set of locally interconnected FSAs that is typically placed over a regular -dimensional lattice [18]. Take the two-dimensional CA as an example, it consists of a lattice of squares called* cells*, where each is in one of a finite number of states. The* neighborhood* of a cell is a set of topologically neighboring cells around , and a* transition rule * applied to defines the change of each cell in the neighborhood of from its current state to a new one. At each iteration, the transition rule is performed on all cells. Though the number of CA applications to engineering problems is relatively few, CA has been largely involved in the simulations of complex systems [18].

We introduce herein a mathematical model based on the one-dimensional CA, whose cells are placed over a linear lattice , to describe the binning procedure for metagenomic sequences. The number of cells in the discrete lattice equals the number of data items in the dataset. At the th iteration, each cell for is the th cell of and associates with a specific item in the dataset. For a particular , applying the transition rule to the cell, updates the neighborhood of within the range , where the parameter can be calculated from the number of cells . A greater value of allows a greater size of cluster. Moreover, the boundary condition of one-dimensional CA can be periodic, fixed, or reflecting among others [18]. Here, we set the periodic manner to simulate a circular boundary; that is, .

##### 2.2. Transition Rules

In the beginning of , the data item is randomly assigned to a cell in , and then each cell evolves according to the function of its current state and neighboring cells, which is identified by the transition rule for . The value of starts from 3 due to the minimum requirement of three neighboring cells. We say that an iteration is finished if all transition rules are performed on each cell in . There are two common terminated criteria to the whole process: one is the user-defined value for the maximum number of iterations and the other is the convergence of to a stable state; that is, , . We adopt the latter criterion in this work. In other words, our algorithm is terminated when there is no state change between two consecutive iterations.

At the th iteration, let be the data item at the cell of . The transition rule to the cell concentrates on three items of , and by comparing their distances, which is the relation measurement between two items. Let be the* distance* between two data items and . The rule applied to swaps and , provided that , illustrated by Figure 1; otherwise, leaves the data items on and unchanged. For example, the rule compares the distance with when applying it on , with on and so on. When is applied on the cell , it considers the distances and , due to the periodic boundary condition.