Computational and Mathematical Methods in Medicine

Volume 2016 (2016), Article ID 1782732, 8 pages

http://dx.doi.org/10.1155/2016/1782732

## Use of a Novel Grammatical Inference Approach in Classification of Amyloidogenic Hexapeptides

^{1}Faculty of Computer Science and Materials Science, University of Silesia, Ulica Zytnia 12, 41-200 Sosnowiec, Poland^{2}Department of Computer Engineering, Faculty of Electronics, Wroclaw University of Science and Technology, Wybrzeże Wyspianskiego 27, 50-370 Wroclaw, Poland

Received 22 October 2015; Accepted 17 February 2016

Academic Editor: Humberto González-Díaz

Copyright © 2016 Wojciech Wieczorek and Olgierd Unold. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The present paper is a novel contribution to the field of bioinformatics by using grammatical inference in the analysis of data. We developed an algorithm for generating star-free regular expressions which turned out to be good recommendation tools, as they are characterized by a relatively high correlation coefficient between the observed and predicted binary classifications. The experiments have been performed for three datasets of amyloidogenic hexapeptides, and our results are compared with those obtained using the graph approaches, the current state-of-the-art methods in heuristic automata induction, and the support vector machine. The results showed the superior performance of the new grammatical inference algorithm on fixed-length amyloid datasets.

#### 1. Introduction

Grammatical inference (GI) is an intensively studied area of research that sits at the intersection of several fields including formal languages, machine learning, language processing, and learnability theory. The main task of the field is about finding some unknown rule when given some elements: examples and counterexamples. This presentation of elements may be finite (in practice) or infinite (in theory). As this study will be especially focused on obtaining a regular expression from finite positive and negative data, the various models of incremental learning and their decidability questions have not been mentioned. The book by de la Higuera [1] can be of major help on such theoretical aspects of grammatical inference.

Here and subsequently stands for a sample where is the set of examples and is the set of counterexamples over a fixed alphabet . Our aim is to obtain a compact description of a finite language satisfying all the following conditions: (i) , (ii) , and (iii) . We will consider a star-free regular expression (i.e., without the Kleene closure operator) as the compact description of a language . It is worthy to emphasize that such a formulation of an induction problem is justified by intended applications in bioinformatics. A sample in biological or medical domains consists of positive and negative objects (mainly proteins) with certain properties, whereas a star-free regular expression may serve to predict new objects. The data explored by Tian et al. [2] and Maurer-Stroh et al. [3] are good illustrations. They consist of examples and counterexamples of amyloids, that is, proteins which have been associated with the pathology of more than 20 serious human diseases. In the experimental part of the present paper, we are going to undertake an examination of binary classification efficiency for selected real biological/medical data. By binary classification, we mean mapping a string to one out of two classes by means of induced regular expressions (regex). For classification, especially for two-class problems, a variety of measures has been proposed. Since our experiments lie in a (bio)medical context, the Matthews Correlation Coefficient is regarded as a primary score, as the goal of this whole process is to predict new strings that are likely to be positive.

There is a number of closely related works to our study. Angluin showed that the problem of inferring minimum-size regular expression satisfying (i), (ii), and (iii) remains NP-complete even if a regex is required to be star-free (containing no “” operations) [4]. In our previous work [5] similar bioinformatics datasets have been analyzed, but with different acceptors—directed acyclic word graphs. Some of classical automata learning algorithms like ECGI [6], -RI [7], and -TSSI [8] could be applied to the problem, but they do not make use of counterexamples. Many authors advocated the benefit of viewing the biological sequences as sentences derived from a formal grammar or automaton. As a good bibliographical starting point, see articles by Coste and Kerbellec [9], Sakakibara [10], and Searls [11]. In connection with this problem of data classification, it is worth remembering that there is a field of computer science that can be also involved, namely, machine learning (ML), which includes such methods as classification trees, clustering, the support vector machine [12], and rough sets [13]. All above-mentioned ML methods are aimed at compact description of input data, though in various ways. In view of our applications, they have, however, a drawback. The problem is that they are not suited for variable-length data.

In the present algorithm a star-free regular expression (SFRE) is achieved based on a learning sample containing the examples and counterexamples (these examples and counterexamples are also called positive and negative words). It is a two-phase procedure. In the first phase an initial graph is built in order to reveal possible substring interchanges. In the second phase all maximal cliques of the graph are yielded to build a SFRE. We have implemented our induction algorithm of a SFRE and started applying it to a real bioinformatics task, that is, classification of amyloidogenic hexapeptides. Amyloids are proteins capable of forming fibrils instead of the functional structure of a protein [14] and are responsible for a group of diseases called amyloidosis, such as Alzheimer’s, Huntington’s disease, and type II diabetes [15]. Furthermore, it is believed that short segments of proteins, like hexapeptides consisting of 6-residue fragments, can be responsible for amyloidogenic properties [16]. Since it is not possible to experimentally test all such sequences, several computational tools for predicting amyloid chains have emerged, inter alia, based on physicochemical properties [17] or using machine learning approach [18–21].

To test the performance of our SFRE approach, the following six additional programs have been used in experiments: the implementation of the Trakhtenbrot-Barzdin state merging algorithm, as described in [22]; the implementation of Rodney Price’s Abbadingo winning idea of evidence-driven state merging [23]; a program based on the Rlb state merging algorithm [24]; ADIOS (for Automatic Distillation of Structure)—a context-free grammar learning system, which relies on a statistical method for pattern extraction and on structured generalization [25]; our previous approach with directed acyclic word graphs [5]; and, as an instance of ML methods, the support vector machine [26].

A rigorous statistical procedure has been applied to compare all the above methods in terms of a correlation between the observed and predicted binary classification (Matthews Correlation Coefficient, MCC). The proposed approach significantly outperforms both GI-based methods and ML algorithm on fixed-length amyloid datasets.

#### 2. Materials and Methods

##### 2.1. Datasets

The algorithm for generating star-free regular expressions SFRE has been tested over three recently published Hexpepset datasets, that is, Waltz [3], WALTZ-DB [27], and exPafig [5]. The first two databases consist of only experimentally asserted amyloid sequences. Note that the choice of experimental verified short peptides is very limited since very few data are available. The Waltz dataset has been published in 2010 and is composed of 116 hexapeptides known to induce amyloidosis () and by 161 hexapeptides that do not induce amyloidosis (). The WALTZ-DB has been prepared by the same science team in the Switch Lab from KU Leuven and published in 2015. This dataset expands the Waltz set to total number of hexapeptides of 1089. According to Beerten et al. (2015), additional 720 hexapeptides were derived from 63 different proteins and combined with 89 peptides taken from the literature [27]. In the WALTZ-DB database, 244 hexapeptides are regarded as positive for amyloid formation () and 845 hexapeptides as negative for amyloid formation ().

SFRE algorithm was also validated and trained on database (denoted by exPafig), which was computationally obtained with Pafig method [2], and then statistically processed [5]. exPafig consists of 150 amyloid positive hexapeptides () and 2259 negative hexapeptides (). As seen, the database is strongly imbalanced.

##### 2.2. An Algorithm for the Induction of a SFRE

###### 2.2.1. Definitions

*Definition 1. * will be a finite nonempty set, the* alphabet*. will denote the set of all nonempty strings over the alphabet . If , the concatenation of and , written , will denote the string formed by making a copy of and following it by a copy of . If , then

To simplify the representations for finite languages, we define the notion of star-free regular expressions over alphabet as follows.

*Definition 2. *The set of* star-free regular expressions* (SFREs) over will be the set of strings such that (1) which represents the empty set;(2); each element of the alphabet represents language ;(3)if and are SFREs representing languages and , respectively, then and representing , , respectively, where the symbols are not in .We will freely omit unnecessary parentheses from SFREs assuming that concatenation has higher priority than union. If represents language , we will write .

*Definition 3. *A* sample * over will be an ordered pair where , are finite subsets of and . will be called the* positive part of *, and the* negative part of *. A star-free regular expression is* consistent* (or* compatible*) with a sample if and only if and .

*Definition 4. *A* graph * is a finite nonempty set of objects called* vertexes* together with a (possibly empty) set of unordered pairs of distinct vertexes of called* edges*. The vertex set of is denoted by , while the edge set is denoted by . The edge is said to* join* the vertexes and . If is an edge of a graph , then and are* adjacent vertexes*. In a graph , a* clique* is a subset of the vertex set such that every two vertexes in are adjacent. By definition, a clique may be also composed of only one vertex. If a clique does not exist exclusively within the vertex set of a larger clique, then it is called a* maximal clique*.

*Definition 5. *Let be an alphabet and let be a graph. Suppose that every vertex in is associated with an ordered pair of nonempty strings over ; that is, , where for . Let be a clique in . Then is a star-free regular expression over * induced by *.

For the simplicity’s sake, we also denote the set by and the set by in the context of .

###### 2.2.2. The Algorithm

In this section, we are going to show how to generate a SFRE compatible with a given sample. These expressions do not have many theoretical properties but have marvelous accomplishment in the analysis of some bioinformatics data in terms of classification quality.

Let be a sample over in which every string is at least of length 2. Construct the graph with vertex set and with edge set given by Next, find a set of cliques in such that . For this purpose one can take advantage of an algorithm proposed by Tomita et al. [28] for generating all maximal cliques. Although it takes time in the worst case for an -vertex graph, computational experiments described in Section 3 demonstrate that it runs very fast in practice (a few seconds for thousands of vertexes). Finally, return the union of SFREs induced by all maximal cliques ; that is, .

In order to reduce the computational complexity of the induction, instead of Tomita’s algorithm, the ensuing randomized procedure could be applied. Consecutive cliques with their catenations are determined until . The catenations emerge in the following manner. In step , a vertex for which is chosen at random. Let and . Then sets and are updated by adding words from the randomly chosen neighbor of , say , and subsequently by adding words from the randomly chosen neighbor of , and so forth. In the end, a maximal clique is obtained for which . Naturally, fulfills , and the whole procedure runs in polynomial time with respect to the input size.

Here are some elementary properties of a resultant expression and the complexity of the induction algorithm.(i) is implied from (4).(ii)If all strings in a sample have equal length, let us say , then all strings from also are of the same length .(iii)Let . A graph , based on (3) and (4), may be constructed in time. Determining a set of cliques and corresponding regular expressions also takes no more than time, assuming that the graph is represented by adjacency lists. Thus, the overall computational complexity is .

###### 2.2.3. An Illustrative Run

Suppose , is a sample (one of possible explanations for the input is, each follows at least one ). A constructed graph is depicted in Figure 1. It has three maximal cliques and regardless of the method—either Tomita’s or randomized algorithm was selected—all of them would be determined in this case. The final SFRE induced by the cliques is Among all words of length four over the alphabet it does not accept , but accepts .