BioMed Research International

Volume 2017 (2017), Article ID 5404180, 7 pages

https://doi.org/10.1155/2017/5404180

## Construction of Multilevel Structure for Avian Influenza Virus System Based on Granular Computing

School of Science, Jiangnan University, Wuxi 214122, China

Correspondence should be addressed to Ping Zhu

Received 11 September 2016; Revised 1 December 2016; Accepted 14 December 2016; Published 16 January 2017

Academic Editor: Hao-Teng Chang

Copyright © 2017 Yang Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Exploring the genetic structure of influenza viruses attracts the attention in the field of molecular ecology and medical genetics, whose epidemics cause morbidity and mortality worldwide. The rapid variations in RNA strand and changes of protein structure of the virus result in low-accuracy subtyping identification and make it difficult to develop effective drugs and vaccine. This paper constructs the evolutionary structure of avian influenza virus system considering both hemagglutinin and neuraminidase protein fragments. An optimization model was established to determine the rational granularity of the virus system for exploring the intrinsic relationship among the subtypes based on the fuzzy hierarchical evaluation index. Thus, an algorithm was presented to extract the rational structure. Furthermore, to reduce the systematic and computational complexity, the granular signatures of virus system were identified based on the coarse-grained idea and then its performance was evaluated through a designed classifier. The results showed that the obtained virus signatures could approximate and reflect the whole avian influenza virus system, indicating that the proposed method could identify the effective virus signatures. Once a new molecular virus is detected, it is efficient to identify the homologous virus hierarchically.

#### 1. Introduction

Exploring the genetic structure of biological population attracts the focus in the field of population biology, molecular ecology, and medical genetics [1]. Influenza A virus is a negative-strand RNA virus, which encodes the 8 structural proteins and 2 nonstructural proteins. In the past several decades, some subtypes of influenza viruses have been identified to infect humans, whose epidemics cause morbidity and mortality worldwide [2, 3]. Subtyping identification of a virus is typically based on viral hemagglutinin (HA) and neuraminidase (NA) fragments among the 10 encoded proteins [4, 5]. So far, dozens of subtypes, combination of the 16 HA and 9 NA types, make up the whole viral system and it was verified that different labeled viruses descend from the same ancestor according to microscopic structural features and genome organization analysis [6]. Evolutionary forces, treated as the most important molecular mechanisms, such as natural selection acting upon rapidly mutating viral populations could shape the genetic structure of influenza viruses in different hosts, geographic regions, and periods of time with genetic mutation [7]. In addition, influenza viruses are equipped with antigenic changes, known as antigenic shifts among different subtypes of influenza viruses, which results in structural changes to escape the immunity [8]. It is of crucial importance to identify the subtypes and analyze the evolutionary relationships for developing antiviral drugs and vaccines. Thus, accessing the viral genomes in a timely fashion and developing effective analyzing methods are urgently needed.

The dramatic progress in sequencing technologies provides unprecedented prospects for the exploration of virus homologous and mutation trajectory in space and time. Understanding the evolution of influenza viruses has benefited from phylogenetic reconstructions of the hemagglutinin protein [9]. In an alternative approach, Lapedes and Farber [10] applied a technique called multidimensional scaling to study antigenic evolution of influenza. Plotkin et al. [8] clustered hemagglutinin protein sequences using the single-linkage clustering algorithm and found that influenza viruses group into several clusters. Upon the dimensional projection technique to characterize hemagglutination inhibition (HI) data, a low-dimensional clustering method that can detect the clusters containing an incipient dominant strain was presented by He and Deem [11]. However, those works just focused on the one fragment, especially HA protein, to explore the evolutional relationships. And large volume of data poses some daunting challenges for exploring the structure of the complex system and the intrinsic relationship. Therefore, there is a need for less computationally intensive methods.

In recent years, the granular computing (GrC) theory has become a hotspot in the field of artificial intelligence and machine learning, which comes from the idea that people solve the problems from different levels and views [12]. Clustering technique is an effective way to generate granules of complex system. Y. Y. Yao and J. T. Yao accomplished a series of research work for applying the theory to data mining and some other fields [13]. Hartmann et al. [14] proposed supervised hierarchical clustering in fuzzy model identification by using hierarchical tree construction. Tang et al. [15, 16] introduced the granular space to describe the hierarchical structural information by using the algebraic topology based on the fuzzy quotient space theory [12]. He also studied the hierarchical clustering structure and analyzed the fuzzy equivalence (or proximity) relation based on the fuzzy granular space. Constructing the hierarchical structure of complex system and extracting the essential information among the granules on different granularities are the goals.

In this paper, our aim is to explore the evolutional relationships of the avian influenza viruses in the same subtype and among the subtypes considering both HA and NA fragments in the virus system. Moreover, the complex virus system should be reduced for further exploration, faced with thousands of samples in the dataset. Jointing the two protein sequences, the feature vectors are extracted from HA and NA proteins, respectively, for labeling the specific virus. Furthermore, the granular signatures in the viral granules are identified based on the obtained features to reduce the systematic and computational complexity and then its performance will be evaluated. This will provide the supports for the rationality of subtype identification. Once a new molecular virus is detected, it could be analyzed with obtained viral signatures and then the prevention and treatment measures can follow what were applied in the viral signature.

#### 2. Materials and Methods

##### 2.1. Materials

The influenza virus dataset was downloaded from the NCBI Influenza Virus Resource (http://www.ncbi.nlm.nih.gov/genomes/FLU/) [17]. The influenza virus contains eight linear negative-strand RNA fragments, which encode 10 viral proteins, that is, PB1, PB2, PA, HA, NP, NA, M1, M2, NS1, and NS2, among which most are structural proteins except NS1 and NS2. Notably, HA and NA fragments play the direct and important roles in the viral subtyping identification and the functions [18]. It has been verified that 8 subgroups of avian influenza virus (H5N1, H5N2, H7N2, H7N3, H7N7, H9N2, H10N7, and H7N9) could infect people, which occurred from 1902 to 2015 around the world. The avian influenza viruses are labeled with unambiguous symbols such as the host, outbreak time, and detection sites. Removing some vague and uncompleted viruses, there are 8274 influenza viruses which reserve HA and NA protein fragments simultaneously (13143 HA protein fragments and 9401 NA protein fragments), compositing the whole avian virus protein system, denoted as . According to the physicochemical property [19], amino acids are divided into four types, namely, the polar and hydrophilic (*pq*), polar and hydrophobic (*pr*), nonpolar and hydrophilic (*sq*), and nonpolar and hydrophobic (*sr*). Considering the adjacency statistical information, the 16-dimension feature vector is extracted by calculating the frequency from one protein sequence. Therefore, 32-dimension feature vector is extracted to represent a virus molecule.

##### 2.2. The Optimization Model for Extracting the Hierarchical Structure

A relation on a universe is a fuzzy proximity (FP) relation if it satisfies the reflexivity and symmetry [16, 20]. Furthermore, if is an FP relation on the universe and satisfies the separable condition (, ), then is called a separable FP relation (or SFP relation).

In [16], the granular space of FP (or SFP) relations on the universe was introduced, and then their properties were explored. Let be an FP (or SFP) relation on a finite universe , where is a dataset of -dimension space. For any , we define a relation , where is a crisp proximity relation that satisfies the reflexivity and symmetry. Then, the equivalent classes of the transitive closure can be marked by , which is derived by , and then is a granularity corresponding to . The set represents a fuzzy granular space on , which is an ordered set, and satisfies that the bigger the threshold is, the finer the granularity is, denoted by [16].

The granularity derived by is marked as , where satisfying the conditions that ( stands for the number of the elements in a set) and . Some properties are explored, such as () is the center of granule and the center of is . From the perspective of statistical theory, two indexes are introduced to measure the deviations within the classes and among the classes on the granulation [18, 20], defined, respectively, as follows:where stand for the 2-norm number in -dimension space.

By analyzing the variance within and among the classes in statistics [21], is monotone increasing, with the granularity changing from the coarse to the fine, while is gradually decreasing. Notably, the total deviation () is always constant . Additionally, and . Therefore, a fuzzy hierarchical evaluation index (FHEI) based on the fuzzy granular space is proposed as follows:

We establish an optimization model to determine the reasonable granulation in the granular space with the minimal objective; that is, reaches the minimum. There exists only one to meet the optimization model, marked as Model (2):

*Remark 1. *Model (2) is a global optimization model without constraints on the hierarchical structure of the finite universe . Compared with [18], their model for determining the optimal hierarchical clustering has the restriction .

Given an FP relation (or SFP relation) on the finite set and , satisfying , an algorithm is presented to detect the optimized hierarchical clustering and construct the hierarchy of complex system based on the fuzzy granular space [16].

*Algorithm A. *

Input: an FP relation (or SFP relation).

Output: the optimized hierarchical structure and the corresponding threshold.*Step** 1**Step** 2**Step** 3**Step** 4**Step** 5*. For any , , .*Step** 6*. For , if , satisfying , , . *Step** 7**Step** 8*. If , ; otherwise, go to Step 5.*Step** 9*. If , ; otherwise, go to Step 2.*Step** 10*. If , , go to Step 2. *Step** 11*. Output , and .

The computational complexity of Algorithm A is . The concrete problems are decomposed hierarchically, which is consistent with the core idea of GrC. Given an FP (or SFP) relation on the finite set , the optimization clustering structure constructed by Algorithm A is its first level structure. Furthermore, its second level structure is obtained if Algorithm A is repeatedly applied to all the equivalent classes in its first level structure. Therefore, Algorithm A can be used to construct multilevel structure in practical application.

##### 2.3. Identification of Granular Signature

Once the optimal granularity of the complex system is determined, it is of crucial importance to construct information granules for abstracting original samples. Generally, the granules are obtained according to the principle: the samples with the same features assemble in one granule. And the average of all samples in one class or the center of the class is efficacious to represent the core information. Suppose that a multilevel structure (or granularity) is constructed, where . To reduce the complexity of the system, feature viruses (or signature viruses) could be extracted to approximately represent the equivalent class. According to the nearest-to-center principle, an objective function to select the signature is established, and it is formulated as follows:where is the signature item of the granule and is a signature set of the granularity . In some way, the signature set can be used to represent approximately the complex system .

##### 2.4. Validation of Granular Signature Set

To evaluate the performance of selected signature set , a classifier is designed for classifying the rest of the samples of the corresponding classes according to the principle of maximum similarity, marked as Model (3). Given a virus (), the classifier is designed:where , , and is the class the virus belongs to.

Model (3) states that the signature viruses are treated as the classifying targets and the other samples in are assigned to classes. All samples in are divided into classes according to Model (3), marked as , . The accuracy ratio is introduced to measure the efficiency of signature set for constructing the multilevel structure . It is defined as

In formula (11), the overlapped ratio is proposed, which measures the rationality of the obtained signature to represent the whole virus system. And the bigger the value is, the better the result is.

#### 3. Results and Analysis

In this section, we apply the proposed model to the avian influenza virus system for constructing the evolutionary structure, which contains 8274 viral HA and NA protein fragments simultaneously within 8 subtypes, listed in Table 1.