Review Article

Navigating the Human Metabolome for Biomarker Identification and Design of Pharmaceutical Molecules

Table 1

Machine-learning algorithms often used in metabolomics.

TechniqueDescription

PCAThe Principal Component Analysis (PCA) is a frequently used method which is applied to extract the systematic variance in a data matrix. It helps to obtain an overview over dominant patterns and major trends in the data. The aim of PCA is to create a set of latent variables which is smaller than the set of original variables but still explains all the variance of the original variables. In mathematical terms, PCA transforms a number of correlated variables into a smaller number of uncorrelated variables, the so-called principal components.

PLSPartial Least Squares (PLS), also called Projection to Latent Structures, is a linear regression method that can be applied to establish a predictive model, even if the objects are highly correlated. The X variables (the predictors) are reduced to principal components, as are the Y variables (the dependents). The components of X are used to predict the scores on the Y components, and the predicted Y component scores are used to predict the actual values of the Y variables. In constructing the principal components of X, the PLS algorithm iteratively maximizes the strength of the relation of successive pairs of X and Y component scores by maximizing the covariance of each X-score with the Y variables. This strategy means that while the original X variables may be multicollinear, the X components used to predict Y will be orthogonal. Also, the X variables may have missing values, but there will be a computed score for every case on every X component. Finally, since only a few components (often two or three) will be used in predictions, PLS coefficients may be computed even when there may have been more original X variables than observations.

O-PLSThe Orthogonal Projections to Latent Structures (O-PLS) is a linear regression method similar to PLS. However, the interpretation of the models is improved because the structured noise is modeled separately from the variation common to X and Y. Therefore, the O-PLS loading and regression coefficients allow for a more realistic interpretation than PLS, which models the structured noise together with the correlated variation between X and Y. Furthermore, the orthogonal loading matrices provide the opportunity to interpret the structured noise.

PLS-DAPLS-Discriminant Analysis (PLS-DA) is a frequently used classification method that is based on the PLS approach, in which the dependent variable is chosen to represent the class membership. PLS-DA makes it possible to accomplish a rotation of the projection to give latent variables that focus on class separation. The objective of PLS-DA is to find a model that separates classes of objects on the basis of their X-variables. This model is developed from the training set of objects of known class membership. The -matrix consists of the multivariate characterization data of the objects. To encode a class identity, one uses as Y-data a matrix of dummy variables, which describe the class membership. A dummy variable is an artificial variable that assumes a discrete numerical value in the class description. The dummy matrix Y has G collumns (for G classes) with ones and zeros, such that the entry in the gth column is one and the entries in other columns are zero for observations of class g.

ANNArtificial Neural Networks (ANN) is a method, or more precisely a set of methods, based on a system of simple identical mathematical functions, that working in parallel yield for each multivariate input X a single or multiresponse answer. ANN methods can only be used if a comparably large set of multivariate data is available which enables ANN training by example and work best if they are dealing with nonlinear relationships between complex inputs and outputs. The main component of a neural network is the neuron. Each neuron has an activation threshold, and a series of weighted connections to other neurons. If the aggregate activation a neuron receives from the neurons connected to it exceeds its activation threshold, the neuron fires and relays its activation to the neurons to which it is connected. The weights associated with these connections can be modified by training the network to perform a certain task. This modification accounts for learning. ANN are often organized into layers, with each layer receiving input from one adjacent layer, and sending it to another. Layers are categorized as input layers, output layers, and hidden layers. The input layer is initialized to a certain set of values, and the computations performed by the hidden layers update the values of the output layers, which comprise the output of the whole network. Learning is accomplished by updating the weights between connected neurons. The most common method for training neural networks is back propagation, a statistical method for updating weights based on how far their output is from the desired output. To search for the optimal set of weights, various algorithms can be used. The most common is gradient descent, which is an optimization method that, at each step, searches in the direction that appears to come nearest to the goal.
SOMSelf-Organizing Maps (SOM) or Kohonen network is an unsupervised neural network method which has both clustering and visualization properties. It can be used to classify a set of input vectors according to their similarity. The result of such a network is usually a two-dimensional map. Thus, SOM is a method for projecting objects from a high dimensional data space to a two-dimensional space This projection enables the input data to be partitioned into "similar" clusters while preserving their topology, that is, points that are close to one another in the multidimensional space are neighbors in the two-dimensional space as well.

SVMSupport Vector Machines (SVM) perform classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories. A SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network. The task of choosing the most suitable representation is known as feature selection. A set of features that describes one object is called a vector. The goal of SVM modeling is to find the optimal hyperplane that separates clusters of vectors in such a way that objects with one category of the target variable are on one side of the plane and objects with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors.

-means -means is a classic clustering technique that aims to partition objects into k clusters. First, you specify k, that is, how many clusters are being sought. Then, k points are chosen at random as cluster centers. All objects are assigned to their closest cluster center according to the ordinary Euclidean distance metric. Next, the centroid, or mean, of the objects in each cluster is calculated. These centroids are taken to be the new center values for their, respective clusters. Finally, the whole process is repeated with the new cluster centers. Iteration continues until the same points are assigned to each cluster in consecutive rounds, at which stage the cluster centers have stabilized.

Genetic AlgorithmsGenetic algorithms are nondeterministic stochastic search/optimization methods that utilize the evolutionary concepts of selection, recombination or crossover, and mutation into data processing to solve a complex problem dynamically. Possible solutions to the problem as so-called artificial chromosomes, which are changed and adapted throughout the optimization process until an optimus solution is obtained. A set of chromosomes is called population and creation of a population from a parent population is called generation. In a first step, the original population is created. For each chromosome, the fitness is determined and a selection algorithm is applied to choose chromosomes for mating. These chromosomes are then subject to the crossover, and the mutation operators, which finally yield a new generation of chromosomes.