Research Article  Open Access
A MultikernelLike Learning Algorithm Based on Data Probability Distribution
Abstract
In the machine learning based on kernel tricks, people often put one variable of a kernel function on the given samples to produce the basic functions of a solution space of learning problem. If the collection of the given samples deviates from the data distribution, the solution space spanned by these basic functions will also deviate from the real solution space of learning problem. In this paper a multikernellike learning algorithm based on data probability distribution (MKDPD) is proposed, in which the parameters of a kernel function are locally adjusted according to the data probability distribution, and thus produces different kernel functions. These different kernel functions will generate different Reproducing Kernel Hilbert Spaces (RKHS). The direct sum of the subspaces of these RKHS constitutes the solution space of learning problem. Furthermore, based on the proposed MKDPD algorithm, a new algorithm for labeling new coming data is proposed, in which the basic functions are retrained according to the new coming data, while the coefficients of the retrained basic functions remained unchanged to label the new coming data. The experimental results presented in this paper show the effectiveness of the proposed algorithms.
1. Introduction
(a) Data Spaces and Data Distributions. Let represent the data and the data space. In mathematics, the data can be regarded as a random variable/vector/matrix defined on the data space . There will be different kinds of data on a data space. For example, the data space can be the one consisting of all images of pixels, while the data represents all face images of pixels and the data represents all landscape images of pixels; both of them are defined on the data space but subject to different probability distributions.
If the data is regarded as a random variable, the samples of data can be then regarded as the concrete realization of the random variable. In machine learning, the samples of data can be exploited to estimate the probability distribution of data (the probabilistic modeling of data). There are a lots of researches on the probabilistic modeling of data [1–3].
(b) Classification/Labels and Classifiers/Label Functions. The classification of the data may be different when the data are used in different applications. For example, let the data be the face images. In the application of identification recognition, the face images of the same person are all grouped into the same class, even though the expressions and postures of these face images may be different. However, in the application of expression recognition, the face images of the same expression are all grouped into the same class, even though these face images belong to different persons.
The classifier of data is a machine indicating the class to which a data point belongs. The classifiers of data are trained with the data samples [4, 5]. The classes of data are also called the labels of data and the classifiers of data are also called the label functions of data. In this paper, we adopt the terminology of labels and label functions of data.
(c) Kernel Tricks in Machine Learning. The applications of kernel tricks in machine learning can be roughly divided into two categories: the transformation of data spaces and the construction of label functions. In the transformation of data spaces, the kernel functions are used to transform data spaces into other spaces where the data can be linearly separated. The famous Kernel PCA [6] and kernel Fisher discriminant (KFD) [7] belong to this category. In the construction of label functions, the kernel functions are used to serve as the basic functions of label functions. The famous manifold regularization learning [8, 9] belongs to this category. In this paper we address the problems involved in the construction of label functions.
In the construction of label functions, the label functions are expressed as , where represents a kernel function, the labeled samples, and the unlabeled samples. The coefficients can be derived from solving the following learning problem:In the above equation, represents the solution space of the learning problem; represents the labels of the labeled samples ;represents the kernel matrix; and represents the cost function. We want to find a label function which will make the cost as small as possible.
There are two kinds of properties about the data. The first kind of properties is the natural properties of data. The probability distributions of data are the examples of natural properties of data. The second kind of properties is the semantic properties of data. The data labels are the examples of semantic properties of data. It is clear that, in the framework of learning problem shown in (1), the semantic properties hidden in the data labels have been fully utilized, while the natural properties hidden in data probability distributions seem not to be deeply exploited. At present, the usual way to learn more information other than the data labels is to add various regularization terms to the cost function. For example, in the manifold regularization learning [8–11], a manifold regularization term is added to the cost function:where is the socalled manifold regularization term, in which and is the Laplacian matrix reflecting the adjacency relations of data samples [12].
However, the addition of too many regularization terms to the cost function will make the learning problem complicated and difficult to solve. In this paper, rather than adding regularization terms, an alternative way to exploit the data probability distribution in the learning problem is proposed. For the convenience of description, let us denote the kernel function as , where represents the parameter of the kernel function. In the proposed algorithm, the parameters of the basic functions are adjusted based on the data distribution samplebysample; that is, , where is the probability distribution of data, . These basic functions are then used to span the solution space of the learning problem: . It is clear that the probability distribution of data is integrated into the basic functions.
According to the theory of kernel functions, if , then and are two different kernel functions and will generate two different RKHS. Now let denote the RKHS generated by the kernel function , ; then, theoretically speaking, the solution space can be regarded as a subspace of the direct sum space . Therefore, the proposed algorithm can be regarded as a kind of multikernel learning algorithm, but quite different from the commonly used multikernel learning algorithm [13–16]. Therefore, we call the proposed algorithm the multikernellike learning algorithm based on data probability distribution, referred to as MKDPD algorithm.
How to label the new coming data (the outofsamples) is the key topic in machine learning [12, 17, 18]. There are two extreme algorithms. One algorithm uses the original label function to label the new coming data ; that is, the labels of the new coming data are given by . This algorithm is best in efficiency, but worst in accuracy. Another algorithm regards the new coming data as the unlabeled samples and mixes the new coming data with the original samples to retrain a new label function . The labels of the new coming data are then given by . This latter algorithm is best in accuracy, but worst in efficiency. Various algorithms for labeling new coming data are the tradeoff between these two extreme algorithms. In the proposed MKDPD algorithm, there are two learning processes. In the first learning process, the basic functions of label function are trained. In the second learning process, the weights of basic functions in the label function are solved. Accordingly, a new algorithm for labeling the new coming data is proposed. In the proposed algorithm, the new coming data will be exploited in the first learning process to retrain the basic functions, but the weights of basic functions remained unchanged and combined with the retrained basic functions to label new coming data. The proposed labeling algorithm achieves a better tradeoff between the computational efficiency and accuracy.
The rest of the paper is arranged as follows: in Section 2, the literatures related to our work are reviewed briefly. In Section 3, the main theories of kernel functions and RKHS are introduced. In Section 4, the MKDPD algorithm is proposed. In Section 5, an MKDPDbased algorithm for labeling new coming data is proposed. In Section 6, the experimental results on toy and realworld data are presented to show the performance of the MKDPD algorithms. In Section 7, some conclusions are presented for reference.
2. Related Works
Learning from the given data is the main process to machine learning. So how to fully make use of the given samples is the key to the successful learning. In general, supervised learning has sufficient labeled samples and these kinds of algorithms are suitable for classification problems, such as the representative linear discriminant analysis (LDA) [19] and KFD [7]. In practice, a large number of samples are unlabeled, only a small number of samples are labeled. In this case, supervised learning algorithms do not effectively make use of the information hidden in unlabeled samples. To tackle the issue, semisupervised learning [18] is proposed and a wide range of semisupervised learning algorithms have been proposed and widely applied in many areas of machine learning.
In recent years, the study of semisupervised learning is not limited to the simple introduction of unlabeled data. Many researchers pay attention to exploit the intrinsic geometry of data and introduce the kernel learning to semisupervised methods. For example, manifold regularization learning proposed by Belkin et al. [8] exploits the underlying data structure by adding a manifold regular term to a generalpurpose learner. And following it, a serious of algorithms are proposed. Sindhwani et al. [20] proposed a linear MR (LMR) algorithm, in which a global linear mapping between the samples and their labels is constructed for labeling novel samples. Inspired by Gaussian fields and harmonic functions (GFHF) [21], local and global consistency (LGC) [22], and LMR [20], Nie et al. [23] extended LMR algorithm to flexible manifold embedding algorithm (FMA). FMA relaxes the hard linear constraint in LMR by adding a flexible regression residue. Geng et al. [10] proposed the ensemble manifold regularization (EMR) to deal with the aforementioned problems by learning an optimal graph Laplacian based on a set of given candidate graph Laplacian. With the spare assumption, Fan et al. [24] replaced the manifold regularizer with a sparse regularizer under the MR framework. Luo et al. [11] applied MR framework to solve the problem of multilabel image classification by learning a discriminative subspace.
Introducing kernel trick to semisupervised learning methods is an important progress in machine learning. The kernel function [17] is either used to map input sample into a high dimensional kernel space for learning problem nonlinearization, or used to span a RKHS for the learning of label function. Take the MR learning as example, the label function (classifier function) in MR is a linear combination of single kernel function on labeled and unlabeled samples and the performance of MR algorithms strongly depends on this label function.
The theory of RKHS plays an important role in kernel methods and RKHS has found a wide range of applications such as minimum variance unbiased estimation of regression coefficients, least squares estimation of random variables, detection of signals in Gaussian noise, problems in optimal approximation [25]. Some RKHSbased learning algorithms appearing recently find applications to online learning with the problem of classification or regression [26–28], while others find applications to the classification of hyperspectral images [29, 30]. Gurram and Kwon [29] achieved the weights for SVM separating hyperplanes by combining both local spectral and spatial information. Gu et al. [30] introduced the conception of MultipleKernel Hilbert Space (MKHS) to analyze spectral unmixing problems, and the resultant algorithm performs well in solving nonlinear problems.
In theory, RKHS can be generated with some specific functions called kernel functions such as Gaussian, Laplacian, and polynomial [31]. Modifying kernel functions is a way of improving the performance of kernel methods; for example, Wu and Amari [32] extended conformal transformation of kernel functions to improve the performance of Support Vector Machine classifiers, Gurram and Kwon [33] defined a new inner product to warp the RKHS structure to reflect the intrinsic geometry of the given samples, and the literatures [33, 34] obtained the best kernel parameters by calculating the derivatives of objective functions. The application of multiple kernels is hot topic in kernel methods and multikernel learning (MKL) [35] is a successful method, which enhances the interpretability of the classifier with a combination of base kernels and improves the performances of kernel methods. MKL algorithms have been widely investigated [13–16, 35–37] and the reviews of MKL algorithms can be found in [13, 14]. MKL offers a feasible scheme to ensemble multiple kernels, but high computational cost raised by optimization procedure is a bad limitation when it is used to process largescale data and a number of kernels. Therefore, one main research direction in MKL is how to effectively solve the MKL problem. Lots of MKL algorithms have been proposed; for example, SimpleMKL [36] proposed by Rakotomamonjy et al. is one of the stateoftheart algorithms used to solve MKL problem which is addressed by a simple subgradient descent method. However, the MKL task is still challenging because it must on one hand learn an optimal combination of multiple kernels and determine the optimal classifier in each iteration and on the other hand make sure that the two optimization procedures are feasible.
3. RKHS and Its Application to Machine Learning
3.1. RKHS
In machine learning RKHS are often used as the solution spaces of learning problems. The definition of RKHS is as follows.
Definition 1. Let be a Hilbert space; if there is a function , such that(a), ;(b), ,then, is said to be a Reproducing Kernel Hilbert Space (RKHS) and the function is called the reproducing kernel of .
Note that in , is a data space, is a linear space of functions defined on the data space , and is an inner product defined on . According to the theory of RKHS, RKHS can be generated from a kernel function. The kernel function is defined as follows.
Definition 2. Let be a function such that(a)symmetric: , ;(b)positive definite: for all the positive integer and all data , the following matrix is positive definite:then the function is called a kernel function.
A kernel function can be used to generate a RKHS such that the kernel function is the reproducing kernel of RKHS. The generating procedure is as follows.
First, a linear space can be generated from the kernel function , where is the set of all positive integers.
Second, an inner product is defined on ; that is, for all , since and , the inner product of and is then defined asIt is easy to prove that the definition of meets the requirements of the inner product and, therefore, is an inner product space.
It is worth noting that for all , since
That is to say, the functions in the inner space can be reproduced with the kernel function .
Third, the inner space can be completed if it is not completed. The completion of , denoted by , is then a RKHS and the kernel function is the reproducing kernel of . By the way, it can be seen from the completion that the inner space is dense in the RKHS .
3.2. Solution Spaces of Machine Learning Problems
In practice, it is impossible to take the space as the solution space of learning problem because it is infinitedimensional. It is reasonable to require that the solution space be both finitedimensional and sampledependent. Thus, for the given samples , a linear space can be generated as follows: It is clear that is both finitedimensional and sampledependent. Furthermore, is exactly a subspace of and therefore is a subspace of . Since is finitedimensional, then it is complete; that is, is a Hilbert space. However, is no longer a RKHS.
For all functions , since and , then, according to (6), we havewhere , , andSince the matrix is symmetric and positive definite, the inner product on can be defined by itself, not necessarily inherited from . In fact, for all , the inner product can be defined asIt can be easily proven that the definition of meets the requirements of inner product and therefore is an inner space. Again, since is finitedimensional, is complete. In machine learning, it is the space that is taken as the solution space of learning problem.
4. A MultikernelLike Learning Algorithm Based on Data Probability Distribution MKDPD
4.1. Motivation
As shown in (5), the space of label functions is as follows:This means that the functions play the role of basic functions of . Obviously, these basic functions are only dependent on the locations of given samples and seem too simple to adapt to various probability distributions of data. Take Gaussian kernel function as an example, the basic functions generated from Gaussian kernel function are identical with each other, only different in the locations of data space. A basic function can be derived from another basic function only by translation in the data space . In fact, if , thenFurthermore, since , then for all with , should give the label of . This means that , where is the support of and is the support of ; that is, If the relation is not true, there would be such that , but ; that is, cannot give the label of .
However, the union is dependent on the locations of the given data samples , not dependent on the data probability distribution . In practice, kernel functions are often compactly supported and the data are not evenly distributed over the data space. In these cases, label function will be overfitted in the areas where too many data samples are collected, or underfitted in the areas where there are too few data samples collected, or not fitted at all in the area where the union fails to cover.
Based on the above considerations, a learning algorithm based on the data probability distribution is proposed in this paper. In the proposed algorithm, the union is not only dependent on the locations of the given samples, but also dependent on the data probability distribution.
4.2. Construction of Solution Spaces
For the convenience of description, let denote the kernel function, where represents the parameter of the kernel function. Thus, for the given data samples and data probability distribution , the basic functions of solution space generated from the kernel function are expressed as , where , .
With these basic functions, we can span a linear space as follows:
It is clear that is a finitedimensional linear space. Further, in order to define an inner product on , we need to define a symmetric and positive definite matrix first:where , is a unit matrix andNote that, since , , is not symmetric and positive definite. However, is symmetric and definite positive and can be used to define an inner product on : for all , since and , thenwhere , .
It can be easily proven that meets the requirements of inner product and therefore is an inner product space. Furthermore, since is finitedimensional, is then complete; that is, is a Hilbert space. However, it is worth noting that is neither a RKHS, nor a subspace of . Recall that although is not RKHS, is a subspace of .
In the proposed algorithm, is taken as the solution space of learning problem:Below we explain the rationality of the definition :(1)If is an inner product of , according to the linearity of inner product, for all , we have Usually, the inner product of functional space is often defined as the integral of product of functions; therefore we have Substituting (21) into (20) gives However the matrix is only positive semidefinite and cannot be used to define an inner product. This problem can be easily solved by adding a regularization term , where is the unit matrix and is the regularization parameter: . The matrix is now symmetric and positive definite and can be used to define an inner product on : The regularization parameter can also alleviate the illcondition of the matrix and reduce the error stemming from the substitution of integral with summation in (29).(2)If the data probability distribution is uniform, that is, is constant on the support , then , . In this case, we have Combining (23) and (25) will give the following result: This means that if the parameter of the kernel function does not adjuste sample by sample, then .
4.3. Analytic Solutions to Learning Problems
4.3.1. Analytic Solutions to TwoClass Learning Problems
In the proposed algorithm, the Hilbert space is taken as the solution space of learning problems. Then for all , , we have
, . Based on the above results, the problem shown in (19) can be simplified as follows: Furthermore, if the cost function is set to be the square error function, that is, , we have where is the selection matrix, , and . Note that the matrix is a symmetric and positive definite matrix.
4.3.2. Analytic Solutions to Multiclass Learning Problems
In principle, the deduction shown above is also suitable for the multiclass learning problems. In fact, for the data sample , its label can take different values to indicate the different classes to which the data sample belongs. However, in practice, the different values of may be too close to facilitate the optimization calculation. Therefore, for the multiclass problems, we adopt another way to indicate the data labels.
For the data sample , let its label be a dimensional vector, where is the number of classes. If the data sample belongs to the th class, then the th component of is set to be 1 while the other components of are set to zero, where . Furthermore, a label function is set to describe the probability that the data belongs to the th class. Based on these notations, the multiclass problem can be expressed as follows:where
We first calculate the term . Since , then andwhere and .
Secondly, we calculate the term . Since then
Thirdly, we calculate the term . Let , , and ; thenAgain, the matrix is a selection matrix such that .
At last, substituting (32), (33), and (34) into (30) will give the following result:
4.4. The Framework of MultikernelLike Learning Algorithms
In the algorithm proposed in this paper, the label function is set to be , where the parameter is adjusted according to the data probability distribution on the data sample . In general, the data probability distribution is not uniform and therefore the parameter will be different sample by sample. As a result, the functions are different kernel functions and will produce different RKHS. In this sense, the algorithm proposed in this paper can be regarded as a kind of multikernel learning algorithms, but quite different from the commonly used multikernel learning algorithm.
In the commonly used multikernel learning algorithms, the multikernel function is a linear combination of multiple basic kernel functions: , where the functions are called basic kernel functions, while the function is called the multikernel function. Since the basic kernel functions are symmetric and positive definite, it can be easily proven that the multikernel function is also symmetric and positive definite. Therefore the label function based on the multikernel function can be expressed as where the coefficients and are determined through machine learning.
If we follow the ideas of the commonly used multikernel learning algorithms and regard the functions as the basic kernel functions, then the multikernel function becomes . Thus, according to (36), the label function based on the multikernel function becomes It can be seen from (37) that, no matter how to adjust the coefficients , it is impossible to make . From the perspective of solution spaces, in the solution space of , there are functions around the data sample , while in the solution space of , there is only one function around the data sample . Therefore the algorithm proposed in this paper is quite different from the commonly used multikernel learning algorithm.
Nevertheless, the algorithm proposed in this paper still belongs to the realm of multikernel learning. As stated above, the functions are different kernel functions and can produce different RKHS: . Now let , as stated in Section 3.1; is then a 1dimensional subspace of . Furthermore, the direct sum of these subspaces turns out to beObviously the direct sum of these subspaces is the solution space .
Due to the fact that our algorithm is different from the commonly used multikernel learning algorithm, but still involved in multiple kernel functions, our algorithm is called multikernellike algorithm.
5. An MKDPDBased Algorithm for Labeling New Coming Data
5.1. Problems
How to label new coming data has been a hot topic in machine learning, where represents the number of new coming data. Generally speaking, there are two extreme methods for labeling new coming data: relearning methods and unlearning methods.
The relearning methods regard the new coming data as unlabeled data samples and mixe them with the original data samples to form new data samples: , where , . Based on these new data samples, the methods relearn the new coefficients of the label function . The label of the new coming data is then given by .
The unlearning methods make use of the original coefficients of label function to label new coming data: .
Obviously, in terms of accuracy, the relearning methods perform best, whiles the unlearning methods perform worst. In terms of efficiency, the unlearning methods perform best, while the relearning methods perform worst. For years the researchers have been hovering between these two extreme methods and try to find the tradeoffs between accuracy and efficiency.
5.2. A MKDPDBased Algorithm for Labeling New Coming Data
As stated in Section 4, there are two times of learning in the MKDPD algorithm. In the first time of learning, the MKDPD algorithm has to adjust the parameters of kernel functions according to data probability distribution. In the second time of learning, the MKDPD algorithm has to determine the coefficients of label functions. Therefore, in the framework of the MKDPD algorithm, there are at least three ways to label new coming data:(1)The MKDPDbased relearning method: , . Obviously, the MKDPDbased relearning method can achieve the best accuracy but perform worst in efficiency because the MKDPDbased relearning method has to calculate both the parameters and coefficients .(2)The MKDPDbased unlearning method: . Obviously, the MKDPDbased unlearning method can achieve the best efficiency but perform worst in accuracy because the unlearning method does not calculate the parameters and coefficients either.(3)The MKDPDbased semilearning method: . The MKDPDbased semilearning method regards the new coming data as unlabeled data samples and mixes them with the original data samples to retrain the new parameters of kernel functions. However, the coefficients remained unchanged and combined with the retrained kernel functions to label the new coming data. The MKDPDbased semilearning method takes full advantage of two times of learning in the MKDPD algorithm and achieves a better tradeoff between computational accuracy and efficiency.
5.3. Error and Efficiency Analysis
5.3.1. Experimental Data and Experimental Settings
We test our algorithm in the framework of manifold regularization and therefore the experimental data are downloaded from the website of manifold regularization (http://manifold.cs.uchicago.edu/manifold_regularization/manifold.html). There are a total of 400 sets of data collected from two halfmoons, 200 sets of data from one halfmoon, and another 200 sets of data from another halfmoon.
We randomly take 1 set of datum as labeled sample and 99 sets of data as unlabeled samples from each halfmoon. The remaining 200 sets of data are taken as new coming data for labeling.
In order to alleviate the effect of random sampling on the objectivity of the experimental results, the random sampling has been done for times and each random sampling will produce an experimental result. The average of experimental results is taken as the end result. is set to be 30, 50, 70, and 90, respectively (Table 1).

5.3.2. Error Analysis
Table 1 lists the error rates of various algorithms for labeling new coming data. Not surprisingly, the order of error rates is . This order coincides with the amount of information exploited by these algorithms from the new coming data.
In addition, the error rate of is smaller than that of , where and . It can be seen from the formulae of and that the basic functions of exploit not only the locations of samples, but also the probabilities of data on the samples, while the basic functions of exploit only the locations of samples.
Figure 1(a) shows the error rates of various algorithms change along with the number of new coming data. Again, the error rate of is between those of and .
(a)
(b)
5.3.3. Efficiency Analysis
Figure 1(b) shows the runtime of various algorithms in labeling the new coming data. It can be seen from Figure 1(b) that the runtime of increases exponentially along with the number of new coming data, while the runtime of , , and almost remains unchanged.
Since both and make use of the original parameters and to label the new coming data, the runtime of and will not change no matter how many new coming data are coming. However, in the proposed algorithm , although the parameters are retrained from to according to the new coming data, the runtime of almost remains unchanged along with the number of new coming data. The means that achieves a certain amount of accuracy without increasing its runtime.
6. Experiments
6.1. Adjustment of Parameters of Kernel Functions
In the proposed MKDPD algorithm, the basic functions of label function are set to be , where , . The schemes of how to adjust the parameters are open. People can adopt various schemes according to their specific applications. No matter how to adjust the parameters , the structures of analytic solutions shown in (19) will not change in the framework of the proposed MKDPD algorithm. In the experiments presented in this paper, the scheme of adjusting the parameters is based on the following considerations;(1)The parameters should be adjusted so as to make . In this way, for all with , the label function can give the label of .(2)If the value of is large, the number of samples gathering in the neighborhood of will be large too because they are more likely to be collected. In order to prevent data overfitting in the area, it is reasonable to adjust the parameter to reduce the scope of the support . Conversely, if the value of is small, the number of samples will be small too because they are more unlikely to be collected. In order to prevent data underfitting in the area, it is reasonable to adjust the parameter to expand the scope of the support . This means that the probability is inversely proportional to the scope of the support.(3)In the following experiments, we adopt Gaussian kernel functions . The Gaussian kernel function can be regarded as a compactly support function, the sample is its center, and is often regarded as its effective radius. Therefore, the parameter is proportional to the scope of the support , or inversely proportional to the probability ; that is, , where is an adjustable parameter. In our experiments, the probability of sample is set to , where and is the number of neighbors of .
6.2. Experiments on Synthetic Dataset (Two Moons Dataset)
The synthetic dataset is the Two Moons Dataset, which has already been used in the experiments in Section 5.
The Two Moons dataset contains 400 sets of data nonevenly collected from two halfmoons, where 200 sets of data are collected from one halfmoon and the other 200 sets of data are collected from the other halfmoon. We randomly take 100 sets of data from each halfmoon as samples, where 1 sample is labeled and the other samples are unlabelled. The remaining 200 sets of data in the Two Moons dataset are taken as the test data (i.e., new coming data in Section 5).
Figure 2 shows the experimental results of the proposed MKDPD algorithm and MR algorithm. In Figures 2(a) and 2(b), although the labeled sample in the upper halfmoon is badly located, the proposed MKDPD algorithm can still separate the upper halfmoon from the lower halfmoon, while MR algorithm fails at the one corner of the upper halfmoon. In Figures 2(c) and 2(d), the labeled sample in the upper halfmoon is located much near the center of the upper halfmoon and accordingly the performance of MR algorithm becomes much better. However, the proposed MKDPD algorithm still outperforms MR algorithm in this circumstance.
(a)
(b)
(c)
(d)
The labeled samples are randomly taken for times and each time will produce an experimental result. The average of experimental results is listed in Table 2. It can be seen from Table 2 that the proposed MKDPD algorithm outperforms MR algorithm under all circumstances.

There are two regularization terms in the proposed MKDPD algorithm and MR algorithm: manifold regularization and function regularization , where in MKDPD or in MR. We keep unchanged, while letting change from 0 to 0.2 and comparing their performances (see Figure 3). As shown in Figure 3, the error rates of the proposed MKDPD algorithm and MR algorithm decrease where the parameter decreases, but the error rate of the proposed MKDPD decreases much faster than that of MR algorithm.
(a)
(b)
6.3. Recognition of Handwritten Digits
6.3.1. USPS and MNIST Datasets
USPS (United States Postal) Dataset (http://www.cs.nyu.edu/~roweis/data.html) is a very popular dataset of handwritten digits, which contains 10 handwritten digits from “0” to “9”; each digit has 1100 images and each image is sized as and can be converted into a 256dimensional vector. We take the first 400 images of each digit as the samples and the remaining images as the test data (new coming data).
The MNIST (http://yann.lecun.com/exdb/mnist/) is another popular handwritten digits dataset, which contains a training set of 60000 images and a test set of 10000 images. Each image in MNIST is of size and can be converted into a 784dimensional vector. We select 400 sets of data from the training set as the samples for each digit and all data in the test set as the test data (new coming data).
6.3.2. The TwoClass Experiments
We randomly select two different digits from the ten digits to construct a binary classification problem, and then there is a total of 45 classification problems. For each binary classification problem, a sample is taken as the labeled sample for each digit and the remaining samples are taken as the unlabeled samples. To avoid the randomness, the labeled samples are randomly selected for ten times and each time will produce an experimental result. The average of ten experimental results is presented as the final experiment result.
The experimental results of the proposed MKDPD algorithm and MR algorithm are presented in Figure 4. In Figures 4(a), 4(b), 4(c), and 4(d), the axis represents the 45 binary classification problems, and the axis represents the error rates. The error rates on the unlabeled samples are shown in Figures 4(a) and 4(c), and the error rates on the test data are shown in Figures 4(b) and 4(d). As can be seen, the error rates of the proposed MKDPD algorithm are lower than those of the MR algorithm. Furthermore, the averages of the results of 45 binary classifications are listed in Table 3. It can be seen from Table 3 that the proposed MKDPD algorithm outperforms the MR algorithm.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
In Figures 4(e), 4(f), 4(g), and 4(h), the axis represents the error rates of labeling the unlabeled samples and the axis represents the error rates of labeling the test data. For a good learning algorithm, its error rates on the unlabeled samples and on the test data should be close to each other; that is, the scatter points in Figures 4(e), 4(f), 4(g), and 4(h) should be close to the diagonal line. Again, in this respect, the proposed MKDPD algorithm performs the MR algorithm better.
6.3.3. The Multiclass Experiments
In the multiclass experiments, there are 10 classes and each class corresponds to a digit. The labeled samples of each digit are randomly selected from its samples for 10 times and each time will produce an experimental result. The average of 10 experimental results is taken as the final result and listed in the first and second column of Table 5. The number of labeled samples is set to be 1, 3, and 5, respectively. It can be seen from Table 5 that the proposed MKDPD algorithm outperforms the MR algorithm.
6.4. Recognition of Spoken Letters
6.4.1. ISOLET Dataset
ISOLET is a dataset of spoken letters and can be downloaded from UCI machine learning repository. ISOLET contains the utterances of 150 speakers who spoke 26 English letters twice, and then each speaker has 52 utterances. In the experiment, two subsets of ISOLET, whose names are ISOLET1 and ISOLET5 respectively, are directly download from (http://manifold.cs.uchicago.edu/manifold_regularization/manifold.html). Each subset contains the utterances of 30 speakers. We take the data in ISOLET1 as the samples and the data in ISOLET5 as the test data (new coming data).
6.4.2. The TwoClass Experiments
In the twoclass experiments, the utterances of the first 13 English letters are classified as one class and the utterances of the last 13 English letters are classified as another class. We take the 52 utterances of one speaker from ISOLET1 as the labels samples and the utterances of the other speakers in ISOLET1 as the unlabeled samples. Since there are 30 speakers in ISOLET1, we can construct 30 twoclass experiments this way and the 30 experimental results are presented in Figure 5. As can be seen from Figure 5, the proposed MKDPD algorithm outperforms the MR algorithm. The averages of 30 experimental results experiments are listed in Table 4, which also shows that the proposed MKDPD performs better than the MR algorithm.


(a)
(b)
6.4.3. The Multiclass Experiments
In the multiclass experiments, the utterances of each English letter are classified as a class, and then there are 26 classes. We randomly select speakers from ISOLET1 and take their utterances as the labeled samples. The utterances of the other speakers in ISOLET1 are then taken as the unlabeled samples. In order to alleviate the effect of randomness, the selection of speakers is performed for 10 times and each time will produce an experimental result. The averages of 10 experimental results are taken as the final results and listed in the third column of Table 5. It can be observed from Table 5 that the proposed MKDPD algorithm achieves about 1%~3% improvements over the MR algorithm.
6.5. Face Recognition
6.5.1. YALEB and CMUPIE Datasets
YALEB (http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html) is a dataset of face images which consists of the face images of 10 persons; each person is photographed under 9 poses and 64 illuminations. In the experiment, we select the face images of 8 persons of 64 illuminations as the experimental dataset. Each image is cropped and resized to an image of pixels. For each person, 50% of his face images are taken as the samples, while his other face images are taken as the test data (new coming data). CMUPIE [38] is another dataset of face images which consists of more than 40000 images of 68 persons; each person is photographed under 4 expressions, 24 Illuminations, and 13 poses. In the experiment, we select the face images of 8 persons of 3 poses and 24 illuminations as the experimental dataset. Each image is cropped and resized to an image of pixels. Again, for each person, 50% of his face images are taken as the samples, while his other face images are taken as the test data (new coming data).
6.5.2. The TwoClass Experiments
We select the face images of two persons to construct a binary classification problem and then there are a total of 28 binary classification problems. For each binary classification problem, samples of each person are taken as the labeled samples and the remaining samples are taken as the unlabeled samples. To avoid the randomness, the samples are randomly taken for 10 times and each time will produce an experimental result. The average of 10 experimental results is presented as the final results (see Figures 6 and 7), where the number of labeled samples is set to be , and 7, respectively. The average of 28 binary classification results is listed in Table 4. It can be seen that, compared with the MR algorithm, the proposed MKDPD algorithm achieves about 1%~5% improvements.
(a)
(b)
(a)
(b)
6.5.3. The Multiclass Experiments
In the multiclassification experiments, we take the face images of a person as one class and then there are 8 classes. samples of each person are taken as the labeled samples and the remaining samples as the unlabeled samples. Again, the labeled samples are randomly taken for 10 times and each time will produce an experimental result. The average of 10 experimental results is presented as the final result and listed in the last four columns of Table 5, where the number of labeled samples is set to be , and 7, respectively. As can be seen, the proposed MKDPD algorithm achieves about 1%~12% improvements to the MR algorithm.
7. Conclusion
In machine learning, one variable of a kernel function is often anchored on each given sample and thus derives a number of basic functions of the label function. The weights of basic functions in the label function are then trained by exploiting the labels of labeled samples. The basic functions derived this way are the same in shape, only different in the positions of data space. Obviously, these basic functions seem too simple to adapt to the changes of data distribution. For example, if the given samples are distributed unevenly, then in the area where there are too many samples are given, there will be too many basic functions located in this area and maybe overlapped too much, while in the area where there are too few samples given, there will be too few basic functions located in this region and maybe overlapped too little or not overlapped at all.
In the MKDPD algorithm proposed in this paper, we adjust the basic functions according to the probabilities of data on the given samples. If the probability of data on a sample is large, then the number of samples in the vicinity of the sample will be large too and we can reduce the support of the basic function located on the sample accordingly to avoid overlapping too much with other basic functions. Likewise, if the probability of data on a sample is small, the number of samples in the vicinity of the sample will be small too and we can expand the support of the basic function located on the sample accordingly to avoid overlapping too little with other basic functions. The experimental results justify the proposed MKDPD algorithm.
From the perspective of the applications of data classification, the aim of machine learning is to label the new coming data. Usually, there are three methods: unlearning, relearning, and semilearning. In the MKDPD algorithm proposed in this paper, there are two learning processes: learning the basic functions and learning the weights of basic functions. In this paper we propose a semilearning method based on the MKDPD algorithm. The proposed semilearning method regards the new coming data as unlabeled samples and mixes them with the original samples to relearn the basic functions, but still the original weights to combine the new basic functions to label the new coming data. The proposed MKDPDbased method for labeling new coming data achieves a better tradeoff between the computational accuracy and efficiency.
Competing Interests
The authors declare that they have no competing interests.
Acknowledgments
This work is supported in part by the Guangdong Provincial Science and Technology Major Projects of China under Grant 6700042020009.
References
 E. Parzen, “On estimation of a probability density function and mode,” Annals of Mathematical Statistics, vol. 33, pp. 1065–1076, 1962. View at: Publisher Site  Google Scholar  MathSciNet
 S. Kay, “Modelbased probability density function estimation,” IEEE Signal Processing Letters, vol. 5, no. 12, pp. 318–320, 1998. View at: Publisher Site  Google Scholar
 A. G. Bors and N. Nasios, “Kernel bandwidth estimation in methods based on probability density function modelling,” in Proceedings of the 19th International Conference on Pattern Recognition (ICPR '08), Tampa, Fla, USA, December 2008. View at: Google Scholar
 J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999. View at: Publisher Site  Google Scholar
 L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, WileyInterscience, New York, NY, USA, 2004. View at: Publisher Site  MathSciNet
 B. Scholkopf, A. Smola, and K. R. Mller, “Nonlinear component analysis as a kernel eigenvalue problem,” Journal of Neural Computation, vol. 10, no. 5, pp. 1299–1319, 1998. View at: Publisher Site  Google Scholar
 S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers, “Fisher discriminant analysis with kernels,” in Proceedings of the IEEE Signal Processing Society Workshop Neural Networks for Signal Processing IX, pp. 41–48, Madison, Wis, USA, August 1999. View at: Publisher Site  Google Scholar
 M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: a geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, pp. 2399–2434, 2006. View at: Google Scholar  MathSciNet
 S. Melacci and M. Belkin, “Laplacian support vector machines trained in the primal,” Journal of Machine Learning Research, vol. 12, pp. 1149–1184, 2011. View at: Google Scholar  MathSciNet
 B. Geng, D. Tao, C. Xu, L. J. Yang, and X.S. Hua, “Ensemble manifold regularization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 6, pp. 1227–1233, 2012. View at: Publisher Site  Google Scholar
 Y. Luo, D. C. Tao, B. Geng, C. Xu, and S. J. Maybank, “Manifold regularized multitask learning for semisupervised multilabel image classification,” IEEE Transactions on Image Processing, vol. 22, no. 2, pp. 523–536, 2013. View at: Publisher Site  Google Scholar  MathSciNet
 M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003. View at: Publisher Site  Google Scholar
 S. S. Bucak, R. Jin, and A. K. Jain, “Multiple kernel learning for visual object recognition: a review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 7, pp. 1354–1369, 2014. View at: Publisher Site  Google Scholar
 M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,” Journal of Machine Learning Research, vol. 12, pp. 2211–2268, 2011. View at: Publisher Site  Google Scholar  MathSciNet
 F. A. Tobar, S.Y. Kung, and D. P. Mandic, “Multikernel least mean square algorithm,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 2, pp. 265–277, 2014. View at: Publisher Site  Google Scholar
 Y.L. Xu, D.R. Chen, H.X. Li, and L. Liu, “Least square regularized regression in sum space,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 4, pp. 635–646, 2013. View at: Publisher Site  Google Scholar
 J. ShaweTaylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, UK, 2004.
 X. J. Zhu, “Semisupervised learning literature survey,” Computer Sciences TR 1530, 2008. View at: Google Scholar
 P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997. View at: Publisher Site  Google Scholar
 V. Sindhwani, P. Niyogi, and M. Belkin, “Beyond the point cloud: from transductive to semisupervised learning,” in Proceedings of the 22nd International Conference on Machine Learning (ICML '05), pp. 825–832, August 2005. View at: Google Scholar
 X. J. Zhu, Z. Ghahramani, and J. Lafferty, “Semisupervised learning using gaussian fields and harmonic functions,” in Proceedings of the 20th International Conference on Machine Learning (ICML '03), Washington, DC, USA, 2003. View at: Google Scholar
 D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Advances in Neural Information Processing Systems 16, 2004. View at: Google Scholar
 F. P. Nie, D. Xu, I. W. Tsang, and C. S. Zhang, “Flexible manifold embedding: a framework for semisupervised and unsupervised dimension reduction,” IEEE Transactions on Image Processing, vol. 19, no. 7, pp. 1921–1932, 2010. View at: Publisher Site  Google Scholar  MathSciNet
 M. Y. Fan, N. N. Gu, H. Qiao, and B. Zhang, “Sparse regularization for semisupervised classification,” Pattern Recognition, vol. 44, no. 8, pp. 1777–1784, 2011. View at: Publisher Site  Google Scholar
 J.W. Xu, A. R. C. Paiva, I. Park, and J. C. Principe, “A reproducing kernel Hilbert space framework for informationtheoretic learning,” IEEE Transactions on Signal Processing, vol. 56, no. 12, pp. 5891–5902, 2008. View at: Publisher Site  Google Scholar  MathSciNet
 S. Liwicki, S. Zafeiriou, G. Tzimiropoulos, and M. Pantic, “Efficient online subspace learning with an indefinite kernel for visual tracking and recognition,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 10, pp. 1624–1636, 2012. View at: Publisher Site  Google Scholar
 K. Slavakis, P. Bouboulis, and S. Theodoridis, “Adaptive multiregression in reproducing kernel hilbert spaces: the multiaccess MIMO channel case,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 2, pp. 260–276, 2012. View at: Publisher Site  Google Scholar
 G. Q. Li, C. Y. Wen, Z. G. Li, A. M. Zhang, F. Yang, and K. Z. Mao, “Modelbased online learning with kernels,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 3, pp. 356–369, 2013. View at: Publisher Site  Google Scholar
 P. Gurram and H. Kwon, “Contextual SVM using Hilbert space embedding for hyperspectral classification,” IEEE Geoscience and Remote Sensing Letters, vol. 10, no. 5, pp. 1031–1035, 2013. View at: Publisher Site  Google Scholar
 Y. Gu, S. Wang, and X. Jia, “Spectral unmixing in multiplekernel hilbert space for hyperspectral imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 7, pp. 3968–3981, 2013. View at: Publisher Site  Google Scholar
 B. Scholkopf and A. J. Smola, Learning with Kernels, MIT Press, Cambridge, Mass, USA, 2001.
 S. Wu and S.I. Amari, “Conformal transformation of kernel functions: a datadependent way to improve support vector machine classifiers,” Neural Processing Letters, vol. 15, no. 1, pp. 59–67, 2002. View at: Publisher Site  Google Scholar
 P. Gurram and H. Kwon, “Sparse kernelbased ensemble learning with fully optimized kernel parameters for hyperspectral classification problems,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 2, pp. 787–802, 2013. View at: Publisher Site  Google Scholar
 J. Huang, P. C. Yuen, W.S. Chen, and J. H. Lai, “Choosing parameters of kernel subspace LDA for recognition of face images under pose and illumination variations,” IEEE Transactions on Systems, Man, and Cybernetics Part B: Cybernetics, vol. 37, no. 4, pp. 847–862, 2007. View at: Publisher Site  Google Scholar
 G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” Journal of Machine Learning Research, vol. 5, pp. 27–72, 2004. View at: Google Scholar
 A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” Journal of Machine Learning Research, vol. 9, pp. 2491–2521, 2008. View at: Google Scholar
 J. J. Thiagarajan, K. N. Ramamurthy, and A. Spanias, “Multiple kernel sparse representations for supervised and unsupervised learning,” IEEE Transactions on Image Processing, vol. 23, no. 7, pp. 2905–2915, 2014. View at: Publisher Site  Google Scholar  MathSciNet
 T. Sim, S. Baker, and M. Bsat, “The CMU pose, illumination, and expression database,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1615–1618, 2003. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2016 Guo Niu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.