Abstract

The present paper is a novel contribution to the field of bioinformatics by using grammatical inference in the analysis of data. We developed an algorithm for generating star-free regular expressions which turned out to be good recommendation tools, as they are characterized by a relatively high correlation coefficient between the observed and predicted binary classifications. The experiments have been performed for three datasets of amyloidogenic hexapeptides, and our results are compared with those obtained using the graph approaches, the current state-of-the-art methods in heuristic automata induction, and the support vector machine. The results showed the superior performance of the new grammatical inference algorithm on fixed-length amyloid datasets.

1. Introduction

Grammatical inference (GI) is an intensively studied area of research that sits at the intersection of several fields including formal languages, machine learning, language processing, and learnability theory. The main task of the field is about finding some unknown rule when given some elements: examples and counterexamples. This presentation of elements may be finite (in practice) or infinite (in theory). As this study will be especially focused on obtaining a regular expression from finite positive and negative data, the various models of incremental learning and their decidability questions have not been mentioned. The book by de la Higuera [1] can be of major help on such theoretical aspects of grammatical inference.

Here and subsequently stands for a sample where is the set of examples and is the set of counterexamples over a fixed alphabet . Our aim is to obtain a compact description of a finite language satisfying all the following conditions: (i) , (ii) , and (iii) . We will consider a star-free regular expression (i.e., without the Kleene closure operator) as the compact description of a language . It is worthy to emphasize that such a formulation of an induction problem is justified by intended applications in bioinformatics. A sample in biological or medical domains consists of positive and negative objects (mainly proteins) with certain properties, whereas a star-free regular expression may serve to predict new objects. The data explored by Tian et al. [2] and Maurer-Stroh et al. [3] are good illustrations. They consist of examples and counterexamples of amyloids, that is, proteins which have been associated with the pathology of more than 20 serious human diseases. In the experimental part of the present paper, we are going to undertake an examination of binary classification efficiency for selected real biological/medical data. By binary classification, we mean mapping a string to one out of two classes by means of induced regular expressions (regex). For classification, especially for two-class problems, a variety of measures has been proposed. Since our experiments lie in a (bio)medical context, the Matthews Correlation Coefficient is regarded as a primary score, as the goal of this whole process is to predict new strings that are likely to be positive.

There is a number of closely related works to our study. Angluin showed that the problem of inferring minimum-size regular expression satisfying (i), (ii), and (iii) remains NP-complete even if a regex is required to be star-free (containing no “” operations) [4]. In our previous work [5] similar bioinformatics datasets have been analyzed, but with different acceptors—directed acyclic word graphs. Some of classical automata learning algorithms like ECGI [6], -RI [7], and -TSSI [8] could be applied to the problem, but they do not make use of counterexamples. Many authors advocated the benefit of viewing the biological sequences as sentences derived from a formal grammar or automaton. As a good bibliographical starting point, see articles by Coste and Kerbellec [9], Sakakibara [10], and Searls [11]. In connection with this problem of data classification, it is worth remembering that there is a field of computer science that can be also involved, namely, machine learning (ML), which includes such methods as classification trees, clustering, the support vector machine [12], and rough sets [13]. All above-mentioned ML methods are aimed at compact description of input data, though in various ways. In view of our applications, they have, however, a drawback. The problem is that they are not suited for variable-length data.

In the present algorithm a star-free regular expression (SFRE) is achieved based on a learning sample containing the examples and counterexamples (these examples and counterexamples are also called positive and negative words). It is a two-phase procedure. In the first phase an initial graph is built in order to reveal possible substring interchanges. In the second phase all maximal cliques of the graph are yielded to build a SFRE. We have implemented our induction algorithm of a SFRE and started applying it to a real bioinformatics task, that is, classification of amyloidogenic hexapeptides. Amyloids are proteins capable of forming fibrils instead of the functional structure of a protein [14] and are responsible for a group of diseases called amyloidosis, such as Alzheimer’s, Huntington’s disease, and type II diabetes [15]. Furthermore, it is believed that short segments of proteins, like hexapeptides consisting of 6-residue fragments, can be responsible for amyloidogenic properties [16]. Since it is not possible to experimentally test all such sequences, several computational tools for predicting amyloid chains have emerged, inter alia, based on physicochemical properties [17] or using machine learning approach [1821].

To test the performance of our SFRE approach, the following six additional programs have been used in experiments: the implementation of the Trakhtenbrot-Barzdin state merging algorithm, as described in [22]; the implementation of Rodney Price’s Abbadingo winning idea of evidence-driven state merging [23]; a program based on the Rlb state merging algorithm [24]; ADIOS (for Automatic Distillation of Structure)—a context-free grammar learning system, which relies on a statistical method for pattern extraction and on structured generalization [25]; our previous approach with directed acyclic word graphs [5]; and, as an instance of ML methods, the support vector machine [26].

A rigorous statistical procedure has been applied to compare all the above methods in terms of a correlation between the observed and predicted binary classification (Matthews Correlation Coefficient, MCC). The proposed approach significantly outperforms both GI-based methods and ML algorithm on fixed-length amyloid datasets.

2. Materials and Methods

2.1. Datasets

The algorithm for generating star-free regular expressions SFRE has been tested over three recently published Hexpepset datasets, that is, Waltz [3], WALTZ-DB [27], and exPafig [5]. The first two databases consist of only experimentally asserted amyloid sequences. Note that the choice of experimental verified short peptides is very limited since very few data are available. The Waltz dataset has been published in 2010 and is composed of 116 hexapeptides known to induce amyloidosis () and by 161 hexapeptides that do not induce amyloidosis (). The WALTZ-DB has been prepared by the same science team in the Switch Lab from KU Leuven and published in 2015. This dataset expands the Waltz set to total number of hexapeptides of 1089. According to Beerten et al. (2015), additional 720 hexapeptides were derived from 63 different proteins and combined with 89 peptides taken from the literature [27]. In the WALTZ-DB database, 244 hexapeptides are regarded as positive for amyloid formation () and 845 hexapeptides as negative for amyloid formation ().

SFRE algorithm was also validated and trained on database (denoted by exPafig), which was computationally obtained with Pafig method [2], and then statistically processed [5]. exPafig consists of 150 amyloid positive hexapeptides () and 2259 negative hexapeptides (). As seen, the database is strongly imbalanced.

2.2. An Algorithm for the Induction of a SFRE
2.2.1. Definitions

Definition 1. will be a finite nonempty set, the alphabet. will denote the set of all nonempty strings over the alphabet . If , the concatenation of and , written , will denote the string formed by making a copy of and following it by a copy of . If , then

To simplify the representations for finite languages, we define the notion of star-free regular expressions over alphabet as follows.

Definition 2. The set of star-free regular expressions (SFREs) over will be the set of strings such that (1) which represents the empty set;(2); each element of the alphabet represents language ;(3)if and are SFREs representing languages and , respectively, then and representing , , respectively, where the symbols are not in .We will freely omit unnecessary parentheses from SFREs assuming that concatenation has higher priority than union. If represents language , we will write .

Definition 3. A sample over will be an ordered pair where , are finite subsets of and . will be called the positive part of , and the negative part of . A star-free regular expression is consistent (or compatible) with a sample if and only if and .

Definition 4. A graph is a finite nonempty set of objects called vertexes together with a (possibly empty) set of unordered pairs of distinct vertexes of called edges. The vertex set of is denoted by , while the edge set is denoted by . The edge is said to join the vertexes and . If is an edge of a graph , then and are adjacent vertexes. In a graph , a clique is a subset of the vertex set such that every two vertexes in are adjacent. By definition, a clique may be also composed of only one vertex. If a clique does not exist exclusively within the vertex set of a larger clique, then it is called a maximal clique.

Definition 5. Let be an alphabet and let be a graph. Suppose that every vertex in is associated with an ordered pair of nonempty strings over ; that is, , where for . Let be a clique in . Then is a star-free regular expression over induced by .
For the simplicity’s sake, we also denote the set by and the set by in the context of .

2.2.2. The Algorithm

In this section, we are going to show how to generate a SFRE compatible with a given sample. These expressions do not have many theoretical properties but have marvelous accomplishment in the analysis of some bioinformatics data in terms of classification quality.

Let be a sample over in which every string is at least of length 2. Construct the graph with vertex set and with edge set given by Next, find a set of cliques in such that . For this purpose one can take advantage of an algorithm proposed by Tomita et al. [28] for generating all maximal cliques. Although it takes time in the worst case for an -vertex graph, computational experiments described in Section 3 demonstrate that it runs very fast in practice (a few seconds for thousands of vertexes). Finally, return the union of SFREs induced by all maximal cliques ; that is, .

In order to reduce the computational complexity of the induction, instead of Tomita’s algorithm, the ensuing randomized procedure could be applied. Consecutive cliques with their catenations are determined until . The catenations emerge in the following manner. In step , a vertex for which is chosen at random. Let and . Then sets and are updated by adding words from the randomly chosen neighbor of , say , and subsequently by adding words from the randomly chosen neighbor of , and so forth. In the end, a maximal clique is obtained for which . Naturally, fulfills , and the whole procedure runs in polynomial time with respect to the input size.

Here are some elementary properties of a resultant expression and the complexity of the induction algorithm.(i) is implied from (4).(ii)If all strings in a sample have equal length, let us say , then all strings from also are of the same length .(iii)Let . A graph , based on (3) and (4), may be constructed in time. Determining a set of cliques and corresponding regular expressions also takes no more than time, assuming that the graph is represented by adjacency lists. Thus, the overall computational complexity is .

2.2.3. An Illustrative Run

Suppose ,   is a sample (one of possible explanations for the input is, each follows at least one ). A constructed graph is depicted in Figure 1. It has three maximal cliques and regardless of the method—either Tomita’s or randomized algorithm was selected—all of them would be determined in this case. The final SFRE induced by the cliques is Among all words of length four over the alphabet it does not accept , but accepts .

2.3. Validation with Other Methods

The SFRE classification quality over hexapeptides from three datasets was compared to three state-of-the-art tools for heuristic state merging DFA induction: the Trakhtenbrot-Barzdin state merging algorithm (denoted Traxbar) [22], Rodney Price’s Abbadingo winning idea of evidence-driven state merging (Blue-fringe) [23], Rlb state merging algorithm (Rlb) [24], and a context-free grammar learning system ADIOS [25]. The compared set of methods was extended by our previous approach with directed acyclic word graphs (DAWG) [5] and the support vector machine with linear kernel function (SVM) [26].

Trakhtenbrot and Barzdin described an algorithm for constructing the smallest DFA consistent with a complete labeled training set [29]. The input to the algorithm is the prefix-tree acceptor which directly embodies the training set. This tree is collapsed into a smaller graph by merging all pairs of states that represent compatible mappings from string suffixes to labels. This algorithm for completely labeled trees has been generalized by Lang [22] to produce a (not necessarily minimum) machine consistent with a sparsely labeled tree (we used implementations from the archive http://abbadingo.cs.nuim.ie/dfa-algorithms.tar.gz for the Traxbar and for the two remaining state merging algorithms).

The second algorithm that starts with the prefix-tree acceptor for the training set and folds it up into a compact hypothesis by merging pairs of states is Blue-fringe. This program grows a connected set of red nodes that are known to be unique states, surrounded by a fringe of blue nodes that will either be merged with red nodes or be promoted to red status. Merges only occur between red nodes and blue nodes. Blue nodes are known to be the roots of trees, which greatly simplifies the code for correctly doing a merge. The only drawback of this approach is that the pool of possible merges is small, so occasionally the program has to do a low scoring merge.

The idea that lies behind the third algorithm, Rlb, is as follows. It dispenses with the red-blue restriction and is able to do merges in any order. However, to have a practical run time, only merges between nodes that lie within a distance “window” of the root on a breadth-first traversal of the hypothesis graph are considered. This introduction of a new parameter is a drawback to this program, as is the fact that its run time scales very badly with training string length. However, on suitable problems, it works better than the Blue-fringe algorithm. The detailed description of heuristics for evaluating and performing merges can be found in Lang’s work [24].

ADIOS starts by loading the corpus (examples) onto a directed graph whose vertexes are all lexicon entries, augmented by two special symbols, begin and end. Each corpus sentence defines a separate path over the graph, starting at begin and ending at end, and is indexed by the order of its appearance in the corpus. Loading is followed by an iterative search for significant patterns, which are added to the lexicon as new units. The algorithm generates candidate patterns by traversing in each iteration a different search path, seeking subpaths that are shared by a significant number of partially aligned paths. The significant patterns are selected according to a context-sensitive probabilistic criterion defined in terms of local flow quantities in the graph. At the end of each iteration, the most significant pattern is added to the lexicon as a new unit, the subpaths it subsumes are merged into a new vertex, and the graph is rewired accordingly. The search for patterns and equivalence classes and their incorporation into the graph are repeated until no new significant patterns are found. The Java implementation of ADIOS made available to us by one of the authors was used in our experiments.

DAWG is a two-phase procedure. In the first phase, an initial directed graph is built in a way that resembles the construction of the minimal DFA, but nondeterminism is also allowed. In the second phase, the directed graph is extended in an iterative process by putting some additional labels onto the existing arcs. The order of putting new labels alters the results; hence a greedy heuristic has been proposed in order to obtain the words most consistent with a sample. We used the same implementation of DAWG as in our earlier work on classification of biological sequences [5].

SVM constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be used for classification, regression, or other tasks. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since, in general, the larger the margin, the lower the generalization error of the classifier. In the experiments, we took advantage of scikit SVM, a machine learning Python library with default parameters [30].

2.4. Experiment Design and Statistical Analysis

To estimate the SFRE’s and compared approaches’ ability to classify unseen hexapeptides repeated stratified -fold cross-validation (cv) strategy was used. Note that holdout method is the simplest kind of cross-validation, but multiple cv is thought to be more reliable than holdout due to its evaluation variance [31]. The simplest form of cross-validation is to split the data randomly into mutually exclusive folds, building a model on all but one fold, and to evaluate the model on the skipped fold. The procedure is repeated -times, each time evaluating the model on the next omitted fold. The overall assessment of the model is based on the mean of -individual evaluations. Since the cv assessment would depend on the random assignment samples, a common practice is to stratify the folds themselves [32]. In a stratified variant of cv, the pseudorandom folds are generated in such a way that each fold contains approximately the same percentage of samples of each class as the whole set. Although the cv is considered as one of the most utilized validation methods, it is well known that cv-based estimators have high variance and nonzero bias [3336]. It is therefore recommended to use a repeated cross-validation approach [37].

The main problem with (repeated) cv is that the training and test sets are not independent samples. Dietterich [31] found that comparing algorithms on the basis of repeated resampling of the same data can cause very high Type-I errors. It means that statistical hypothesis test, like the standard paired -test, incorrectly rejects a true null hypothesis (so-called false positive). Note that cv can be viewed as a kind of random subsampling. To correct the variance estimate of dependent samples, Nadeau and Bengio [38] proposed the following statistic of the corrected resampled -test:where is the difference of the performance quality between two compared algorithms on -run (). We assume that in each run samples are used for training and samples for testing. stands for the variance of the differences. This statistic obeys approximately Student’s -distribution with degrees of freedom. The only difference to the standard -test is that the factor in the denominator is by the factor . The corrected resampled -test has the Type-I error close to the significance level and—opposite to the McNemar test and the 5 × 2 cv test—low Type-II error (i.e., the failure to reject a false null hypothesis). If we consider test based on -times -fold cv, the statistic has degrees of freedom and is called corrected repeated k-fold cv test. To detect performance differentiation of compared algorithms we use 10 × 10 cv scheme with 10 (instead of 99) degrees of freedom. This scheme was shown [39] to have excellent replicability. Note that, to perform multiple comparisons involving a control method (i.e., SFRE), we are supposed to control the family-wise error (FWER) [40, 41]. FWER is the probability of making Type-I error when testing many null hypotheses simultaneously. Several methods of relaxing the FWER have been proposed [42]. To keep the probability of rejecting any true null hypothesis small, in our experiments we applied Holm correction [43].

The predictive performance of algorithms was evaluated with the confusion matrix and some of the figures of merit associated with it. First, the following four scores were defined as , , , and , representing the numbers of true positives (correctly recognized amyloids), false positives (nonamyloids recognized as amyloids), false negatives (amyloids recognized as nonamyloids), and true negatives (correctly recognized nonamyloids), respectively. The following three figures of merit were considered here, since they are widely used.

The Sensitivity, also known as true positive rate, represents the percentage of correctly identified positive cases and is defined as

Specificity, known is as true negative rate, represents the percentage of correctly identified negative cases and is calculated as

Matthews Correlation Coefficient is defined as

Note that several other scores derived from the confusion matrix can be used for estimating the prediction reliability. These three figures of merit, that is, Sensitivity, Specificity, and Matthews Correlation Coefficient, seem to be indispensable for the following reasons. Sensitivity and Specificity tend to be anticorrelated and monitor different aspects of the prediction process. Both of them may range from to , where means perfect prediction. Second, Matthews Correlation Coefficient [44] considers both the true positives and true negatives as successful predictions. MCC is always between and . A value of indicates total disagreement, random prediction, and perfect prediction. What is important in our case is, MCC is resistant to imbalanced dataset.

3. Result and Discussion

Figure 2 and Table 1, Figure 3 and Table 2, and Figure 4 and Table 3 summarize the performances of the SFRE algorithm and compared methods on Waltz, WALTZ-Db, and exPafig databases, respectively. The figures present boxplots representing the MCC values obtained from 10 × 10 cross-validation, whereas the tables give unadjusted and adjusted by Holm procedure values for the comparison of the SFRE algorithm (the control method) with the remaining algorithms. Note that adjusted for each method and each database is lower than desired level of a confidence , 0.05, in our experiments. These values indicate that there are significant performance differences between SFRE algorithm and compared methods.

SFRE algorithm outperforms all other compared methods in terms of MCC over both experimentally asserted datasets, Waltz and WALTZ-DB, and computationally generated exPafig. It is worth noting that all values except for comparing with SVM algorithm are lower than not only 0.05, but also the often used 0.01, hence confirming the superiority of the SFRE.

Comparative analysis of the three figures of merit (Sensitivity, Specificity, and Matthews Correlation Coefficient) is summarized in Table 4. These quantities are reported for seven compared predictors and three databases (Waltz, WALTZ-DB, and exPafig). Numerical results reported in Table 4 show that SFRE has the highest Average MCC (0.40) followed by SVM (0.31), ADIOS and Traxbar (0.25), Blue-fringe (0.22), and DAWG and Rlb (0.19). Furthermore, SFRE has the highest MCC score compared to the other predictors on each dataset (0.37, 0.38, and 0.44, resp.). Although the results of MCC score seem to be not high (at the level of 0.40), it should be noted that many of the amyloid predictors are reported to have similar or lower values [45]. It is also worth mentioning that all methods have gained the highest MCC values for the computationally generated exPafig dataset.

SFRE has a higher Specificity score than other methods except SVM in case of WALTZ-DB (0.95 to 0.98, resp.) and exPafig databases (both Spe of 1.00). These two predictors have a very good capacity at predicting nonamyloid hexapeptides, with Spe higher than 0.90 for each database. The counterpart is their poor Sensitivity. Concerning Sen score, DAWG, our earlier proposal, has the highest value on each database (0.90, 0.81, and 0.73, resp.). SFRE algorithm showed a low Sensitivity for each tested dataset (0.30, 0.33, and 0.25, resp.).

The evaluation of SFRE on three amyloidogenic hexapeptide datasets revealed its accuracy to predict nonamyloid segments. We showed that the new grammatical inference algorithm gives the best Matthews Correlation Coefficient in comparison to six other methods, including support vector machine.

4. Conclusions

In the present paper, the way in which regex induction may support predicting new hexapeptides has been revealed. We, therefore, studied the following problem: given a sample , find a “general” star-free regular expression such that , , and contain only strings of “similar characteristics” to those of . To this end, a new GI method has been proposed which is especially suited to the fixed-length datasets. The conducted experiments showed that our algorithm outperforms compared methods in terms of a correlation between the observed and predicted binary classification (MCC) and with real datasets taken from a biomedical domain.

The proposed idea is not free from objections. Among the most serious complications is the exponential computational complexity of generating maximal cliques, which is the second phase of the algorithm. However, it can be overcome by using a proposed randomized procedure instead. Our first experiments on larger datasets uncovered that this is a good direction for the future research.

The high Sensitivity of DAWG approach and high Specificity of SFRE method over tested databases suggest the second direction of future research. These two classifiers could be combined into a metapredictor having, hopefully, both good Sensitivity and Specificity. Such meta-approaches are reported to gain often better results in terms of aggregate indicators (as MCC) than individual predictors [45].

Competing Interests

The authors declare no conflict of interests.

Authors’ Contributions

Wojciech Wieczorek proposed and implemented SFRE algorithm; Olgierd Unold designed the methodology and experiments. Wojciech Wieczorek conceived and performed the experiments; Olgierd Unold designed and performed the statistical data analysis. Both authors wrote and approved the final paper.

Acknowledgments

This research was supported by National Science Center (Grant DEC-2011/03/B/ST6/01588) and by a statutory grant of the Wroclaw University of Technology.