Abstract

Currently image classifiers based on multikernel learning (MKL) mostly use batch approach, which is slow and difficult to scale up for large datasets. In the meantime, standard MKL model neglects the correlations among examples associated with a specific kernel, which makes it infeasible to adjust the kernel combination coefficients. To address these issues, a new and efficient multikernel multiclass algorithm called TripleReg-MKL is proposed in this work. Taking the principle of strong convex optimization into consideration, we propose a new triple-norm regularizer (TripleReg) to constrain the empirical loss objective function, which exploits the correlations among examples to tune the kernel weights. It highlights the application of multivariate hinge loss and a conservative updating strategy to filter noisy samples, thereby reducing the model complexity. This novel MKL formulation is then solved in an online mode using a primal-dual framework. A theoretical analysis of the complexity and convergence of TripleReg-MKL is presented. It shows that the new algorithm has a complexity of and achieves a fast convergence rate of . Extensive experiments on four benchmark datasets demonstrate the effectiveness and robustness of this new approach.

1. Introduction

Image semantic classification is a challenging task in computer vision field. Researchers are constantly searching for efficient learning methods with good scalability to categorize large and complex image datasets [16]. Among the current approaches used for image categorization, multikernel learning (MKL) [711] has been the subject of many recent studies and it delivers the state-of-the-art performance by solving a joint optimization problem, which comprises sample coefficients for the base kernel classifier and the optimal weights for combining multiple kernels associated with multiple clues [5].

However, most MKL methods use batch learning methods [6, 1216], which are slow at classification, and they do not scale up well with large training datasets. To this end, various online methods have been proposed to facilitate efficient learning and real-time application [1723]. These different online MKL methods involve different regularization techniques and updating rules. For example, Hoi et al. [17] used a Perceptron algorithm [23] to learn about a base classifier for a given kernel before applying the Hedge algorithm [24] to combine multiple classifiers in a linear manner. However, regularization was not considered in this formulation. Cavallanti et al. [22] proposed an -norm multiview perception algorithm, where differences in the clue-related kernel space were neglected.

Given the complex parameter structures of multikernel models, more researchers are using mixed norm regularization items to integrate sophisticated prior knowledge and to handle the parameter mutation caused by data noise. The -norm was first proposed as a regularizer for multiclass MKL [9]. This approach induced absolute sparsity in the domain of the kernels, but it might weaken the convexity of the optimization problem or lead to poor performance [25]. Thus, a general type of group norm, -norm , was proposed in [19, 26] to provide greater flexibility when tuning the level of sparsity required for a task. However, the algorithm has difficulty achieving convergence when is close to 1. In addition, an elastic net form of regularization is available for MKL, which allows the solution to obtain exact mathematical zeros and is effective for filtering invalid kernels [20, 27]. In summary, the aforementioned norms impose a constraint on kernels or classes, whereas they neglect the correlations among examples.

In this paper, a new algorithm called TripleReg-MKL is proposed. It defines a triple-norm regularizer (abbreviated as TripleReg) with strong convexity to constrain the empirical risk upon the current incoming samples. An online solution is derived using primal-dual framework for this new MKL formulation. As the correlations among examples are considered in TripleReg-MKL to tune the sparsity of model, the updating of kernel weights involves the historical cumulative effects of the overall online training procedure [19, 20]. It also highlights the combination of multivariate hinge loss and a conservative updating strategy to filter noisy samples. A theoretical derivation is presented and an analysis of the complexity and convergence is conducted as well. Extensive experiments delivered on four benchmark datasets verify the claims.

2. TripleReg-MKL Algorithm

2.1. Multiclass Multikernel Problem

Suppose that we are given a set of training samples , where each sample is represented as an instance-label pair . Here, , denotes the label of a sample. The instance denotes the corresponding measurements of features and each is a multivariate feature vector, which describes a visual characteristic of the th sample. Then, the multiclass classifier is defined bywhere is the value of the score function when the instance is assigned to the class . is the predicted class for which the function achieves the highest score. Based on a consideration of multikernel integration, the score function is defined aswhere are model parameters and is the nonlinear mapping function that transforms the features into arbitrary high-dimensional reproducing kernel Hilbert space (RKHS). is the inner product operation between matrices and , which is defined by

In the multiclass setup, the concept of class is introduced in the definition of a mapping function. This is different from traditional kernel machines, which ignore the class label information in the kernel definition. Specifically, we define where , is a label-free feature map [19]. Correspondingly, the model parameter comprises blocks in each feature space; that is, . That is to say, both and carry class information and feature clues. Therefore,Now, the goal is to learn about the multiple score function parameterized by parameter matrices, each of which is denoted by .

To obtain the solution of the optimization problem, the primal objective function that needs to be minimized is defined as where is the triple-norm regularizer (TripleReg) used to measure the complexity of and to constrain the problem in a low-complexity domain. The second item is the global loss that accumulates hinge losses over all possible samples in the training set. is the instantaneous loss function that measures the discrepancy between the predicted answer and the correct answer. Specifically, it can be denoted by at the th iteration. Here a multivariate hinge-loss function with convexity is defined for multiclass categorization as follows: is a parameter that trades off the significance between the empirical loss and regularization item. From (6), the essence of this optimization problem is to learn about the optimal weight to minimize the cumulative loss that occurs during the sequence of observations under regularization.

2.2. Triple-Norm Regularizer

According to Section 2.1, each component of associated with a specific class and kernel, that is, , is a coefficient vector. It inherently implies the triple structure of the model parameter. Thus, the triple-norm-based regularization is designed aswhereand . From (8), the regularizer (TripleReg) is strongly convex due to the square form of the triple norms and the value range of . In (9), is applied to each that indicates sample information to obtain an real number matrix . Then, is applied to each column of to obtain a vector in upon which the norm is applied to yield the value of . This TripleReg is actually a combination norm of three , each of which is imposed on sample-, class-, or kernel-related coefficients. The selection of the values of the parameters and allows the sparsity level of the solution to be determined in a flexible manner. The convexity of TripleReg with respect to is proved and its argument is derived as well in Appendix A.

2.3. Online Solution Using a Primal-Dual Framework

A primal-dual algorithmic framework [2830] is adopted to derive the optimal solution of (6).

Suppose that is the notation of the optimal fixed solution to the minimization problem of (6), which is both objective and imaginary. It may be considered objective because it can be selected retrospectively from a class of hypotheses based on complex and varied concepts of progress toward an acceptable competing hypothesis using the entire sequence of training data pairs [28]. It can be considered imaginary because it may require a long training period to be objective. is the actual model parameter at the th iteration. The algorithm is expected to satisfy at each iteration. Therefore, (6) of the primal domain can be rewritten with constraints:In (10), is the set of all possible hypotheses. Introducing the Lagrangian multiplier yieldsBased on the definition of the conjugate functionand the equation of , the dual objective function can be obtained:

Up to now, the constrained quadratic programming problem in the primal domain, as in (10), has been converted into a dual objective function, as in (13).

Sincewe obtain the following equation after differentiating both sides of (14) with respect to :Similarly, sincethen,We denote the sum of the current Lagrangian multipliers by ; that is,where is the dual variable with the same triple structure as the primal variable . Thus, the solution of the primal objective is obtained in the dual domain, which is formulated as . Specifically, the dual norm of (8) is formulated aswhere .

A summary of the algorithmic framework of TripleReg-MKL is shown in Algorithm 1.

TripleReg-MKL
() Input ;
() Initialize  
() For do
()  Receive a new instance randomly from
()  Make a prediction
()  Obtain the correct label
()  Compute inference label, that is,
()   if  
()         
()  ;
()  ;  
()  end if
() end for

Algorithm 1 shows that the TripleReg-MKL algorithm updates the weight through line (11). Here, is denoted by . Thus, the updating rule can be simplified to , which indicates that is determined by the kernel coefficient and the dual parameter . The relationship between the parameters during online learning is shown in Figure 1.

In Figure 1, the kernel weight is updated using information from the newly arriving sample and from all previous samples. It allows each sample to make different contributions to the model. In other words, the correlations among samples are introduced to tune the level of sparsity in the domain of the kernels because of the close relationship and high similarity among samples in the same class.

It is imperative for us to apply the kernel trick to avoid the difficult definition of and the expensive calculation of the inner product in a high-dimensional transformation space [32] for the derivation of TripleReg-MKL algorithm. By setting and according to the relationship of , the inner product between and can be calculated as

During the training process, TripleReg-MKL algorithm applies a conservative updating strategy. That is to say, the updating is only implemented in the current and the interference class model when the loss function is greater than zero, as shown in line (8) in Algorithm 1. The “one positive and one negative” approach is also used to increase the gap between the correct model and the max interference model.

Algorithm 1 shows that the time required by the TripleReg-MKL is dominated by line (5) in each iteration, which has a complexity of in the worst case. , , and are the numbers of classes, kernels, and previous samples, respectively. This complexity is common to other state-of-the-art online learning MKL algorithms, such as OM-2 and UFO-MKL.

3. Convergence Analysis

In this section, we analyze a theoretical guarantee of the convergence rate of the TripleReg algorithm. Theorem 1 is derived from the regret bound of primal-dual optimization in Theorem  2 in [29] (the proof of Theorem 1 is given in Appendix B).

Theorem 1. represents the sequence of the function. For all , and is the -strongly convex function of . is the dual norm of . is the optimal solution to the model. If we set and , then

In the TripleReg-MKL algorithm, in (8) is a -strongly convex function with respect to the norm (see Appendix A). Suppose that ; then the gradient of the multiclass hinge-loss function defined in (7) satisfieswhere . This means that the upper bound of is .

The Markov inequality () [33] was introduced in consideration of the random choice of the sample sequence during the online learning procedure. Let ; inequality (23) will be satisfied with a probability of at least over the choice of a random sample after iterations of the TripleReg-MKL algorithm:where and is the optimal hypothetical solution to (6) (see Appendix C). Equation (23) shows that the upper bound of algorithm convergence decreases gradually when increases infinitely. That is to say, the model parameter becomes increasingly close to the optimal hypothetical solution with increasing number of iterations.

4. Experiment

Experimental evaluation of TripleReg-MKL is presented in terms of classification performance and capacity to combine features. A comparison with four state-of-the-art online MKL algorithms, that is, OM-2 [19], UFO-MKL [20], OMCL [21], and Perceptron [23], is performed on the benchmark Caltech-101 [34], Caltech-256 [35], Oxford Flowers (102) [36], and MNIST [37] datasets. Caltech-101 [34] is a collection of 9144 images from 102 object categories. The number of images in each category varies from 40 to 800. Most of the object categories contain 50 images. Caltech-256 [35] is an extension of Caltech-101 containing 29781 images from 256 object categories. The minimum, average, and maximum number of images in each category are 80, 119, and 827, respectively. Oxford Flowers (102) [36] contains 8189 images that cover 102 flower categories. Each class contains 40 to 258 images. MNIST [37] is a large dataset of 60000 training examples and 10000 test examples from 10 handwritten digit categories. The digits have been size-normalized to 28 × 28 gray-scale images and they are centered in the fixed size images. This dataset is good for testing learning techniques using real-world data since it requires minimal preprocessing and formatting effort. These four datasets are characterized by their high image diversity, large sample volumes and number of categories, or great vagueness among classes, which presents great challenges for classification. The codes of the three comparison algorithms are obtained from DOGMA [31].

Complex but effective features including self-similarity (SSIM) [38], geometric blur (GB) [39], CSIFT [4043], and Oriented-PDF [44] were applied to the medium or large class datasets of Caltech-101 and Caltech-256. For the same reasons stated in a previous study [45], SPHOG [46], local binary pattern (LBP) [47], and GIST [48] were used to describe the handwritten digits of MNIST. For Oxford Flowers (102), a -distance matrix [36, 49] was used to measure the similarity associated with four different features of flowers, that is, “D_SIFTint,” “D_SIFTbdy,” “D_HSV histogram,” and “D_HOG.” The corresponding kernel matrix was computed using , where was the distance and was the kernel parameter determined by cross-validation.

4.1. Experimental Setup

Thirty images of each category were selected randomly for training from Caltech-101 and Caltech-256, and the rest were used for testing. For the Oxford Flowers (102) dataset, the predefined training and testing splits recommended in previous studies [36, 49] were used in this experiment: that is, only 10 of each class are from Oxford Flowers (102). Unless stated otherwise, the experimental process was replicated 10 times using a different random test set or sample sequence. The averages and standard deviations are reported. To obtain better experimental results, model parameters such as and were determined using a fivefold cross-validation procedure. Given the fact that online learning has a relatively slow convergence rate, we attempted to increase the training dataset by cycling the training examples through multiple epochs [45].

4.2. Experiment Results
4.2.1. Comparing the Effect of Using Single Kernels or Combining-All

In this experiment, TripleReg-MKL with a combined kernel is compared to that of a single kernel. The experiment results are shown in Table 1.

Table 1 shows that “combining-all” using the TripleReg-MKL algorithm results in significant improvement in the classification performance for any dataset compared with a single kernel. For example, the test accuracy increased by approximately 9.38% and 10.0% on the two large object class datasets, that is, Caltech-101 and Caltech-256, compared with that obtained using the best single kernel. With the Oxford Flowers (102) dataset, “combining-all” outperformed the best single kernel by approximately 35.57%. For the largest scale dataset MNIST, “error rate” is used as the performance index to clarify the numerical comparison. It is observed that a single SPHOG had a low error rate of 0.72% with MNIST, but “combining-all” resulted in a lower error rate of 0.65%. That is to say, 9.72% reduction in the error rate is achieved using TripleReg-MKL algorithm to fuse all the features of SPHOG, GIST, and LBP.

4.2.2. Comparing with the State-of-the-Art Online MKL Methods

We compared the performance of TripleReg-MKL (the source code is available at https://github.com/huangshuangping/TripleReg-MKL) with three state-of-the-art online multiclass MKL algorithms, that is, OM-2 [19], UFO-MKL [20], and Perceptron [23]. Figures 2(a)2(d) show the training sample size versus average training error rate, training sample size versus test accuracy, and training sample size versus training time curves for the four benchmark datasets. From the figure of “training sample size versus average training error rate,” the online training error rate of all of the four algorithms had the same trend with an increasing number of iterations. That is to say, the average training error rate decreased sharply as the number of iterations increased during the early part of the online training period and it gradually stabilized around zero as the learning process continued. This implied that no new prediction errors appeared with the subsequent training examples from the epoch repetitions after the model reached the steady state. In contrast, the curve of “training sample size versus test accuracy” tends to rise with increasing numbers of iterations during online updating after early variability until it reaches a steady state. This trend agrees with the gradual optimization law of online learning. A comparison of the results obtained with all four approaches showed that TripleReg-MKL had the best classification performance after reaching the steady state. That is indicated by the position of the red curve above all of the other curves. The UFO-MKL algorithm ranked the second in terms of classification performance with Caltech-101, Caltech-256, and Oxford Flowers (102). However, it had poor stability, which is demonstrated by the fluctuation in the blue curve. This occurred because UFO-MKL adopts an elastic net form of regularization, thus providing a direct kernel reset. This type of sparsity approach can simplify the model significantly. In the meantime, it may cause significant fluctuations in the performance and slow convergence due to its aggressive reset policy. OM-2 and Perceptron delivered far lower accuracy on the three datasets with more than 100 classes compared with TripleReg-MKL. By contrast, TripleReg-MKL achieved a slightly better performance than OM-2 on MNIST that has relatively simple examples and small classes. This indicates the high adaptability of TripleReg-MKL to relatively large class problems. As for the curve of “training sample size versus training time,” it shows the runtime performance as a function of the number of training samples. It can be seen that OM-2, UFO-MKL, and Perceptron are faster than TripleReg-MKL for all four datasets. This can be explained by the fact that our TripleReg-MKL algorithm considers information from not only the newly arriving sample but also all previous historical samples to update kernel weights. Therefore, TripleReg-MKL sacrifices runtime performance to achieve a superior classification performance. In addition, the trend curve of the training time is different for each dataset. This is because the quality and distribution of training samples from the four datasets vary, and these factors determine the number of noisy samples to be filtered and thus the runtime. Regardless, it is also observed from Figure 2 that the TripleReg-MKL algorithm is well suited to learning with more than one million samples. To be noted, the data size of million is simulated by means of multipass strategy in the experiment.

Given the fact that some of the results cannot be seen clearly in Figure 2, the test accuracy of the four algorithms at the final iteration is presented in Table 2. It can be seen that test accuracy close to 50% was achieved using only about 25% of the images from Caltech-256 for training. It is the best performance achieved using MKL methods on the Caltech-256 dataset to the best of our knowledge. The test accuracy of 88.75% on Caltech-101 is also the highest result obtained with an online algorithm. Oxford Flowers (102) is characterized by its great variation within classes and vague gaps between classes. Only 12.5% of the images were used for training and an accuracy of 62.56% was obtained using TripleReg-MKL on this benchmark dataset.

4.2.3. Comparing with the State-of-the-Art Batch MKL Methods

We compared the performance of TripleReg-MKL with some recent batch MKL methods including MKL-SRC [14], MKSR [15], and Soft Margin Multiple Kernel Learning [16] (abbreviated as SM1MKL and SM2MKL, corresponding to different setup of hinge loss and squared hinge loss, resp.). The comparison experiments are delivered on Caltech-101 and Caltech-256 as the literatures [14, 15] provide the results for the different training conditions on these two benchmark datasets. We download SMMKL implementation code (https://sites.google.com/site/xinxingxu666/) and use one-versus-all strategy for its multiclass extension. The optimal SVM regularization parameters for SM1MKL and SM2MKL are searched within using median method. To be concrete, the optimal SVM regularization parameter for SM1MKL and SM2MKL is set as 10 and 2.5, respectively. Table 3 shows the test accuracy of TripleReg-MKL and all the three batch MKL baseline algorithms. Comparison is relatively striking between TripleReg-MKL and MKL-SRC or MKSR on both datasets, demonstrating the generation performance of our algorithm. A similar case is with comparison between TripleReg-MKL and SM1MKL on the larger classes of Caltech-256 object dataset. The superiority is on the contrary not obvious when comparing test accuracy of TripleReg-MKL with that of SM1MKL and SM2MKL on Caltech-101. To summarize, the proposed TripleReg-MKL compares well with the state-of-the-art batch MKL methods in terms of test accuracy.

5. Conclusions

In this paper, a new TripleReg-MKL algorithm was proposed. In this approach, a novel triple-norm regularizer (TripleReg) was designed for MKL and an efficient online solution is derived. TripleReg introduced a constraint among correlated examples and makes the kernel weight updated using all previous historical sample information. It yielded greater flexibility for tuning the level of sparsity in the domain of kernels. This online solution allows the algorithm to be readily adapted to learning cases with millions of cases in an efficient manner. To examine the empirical performance of the proposed TripleReg-MKL algorithm, extensive experiments were conducted on a testbed using four diverse real datasets. Experiment results verified its high capacity for heterogeneous feature fusion, which is particularly important for recognizing large classes with crowded spaces and vague gaps between classes. It also achieved the highest classification performance compared with four state-of-the-art online MKL methods, that is, OM-2, UFO-MKL, OMCL, and Perceptron.

Appendices

A. Proof of TripleReg’s Strong Convexity

Let be a three-dimensional real matrix that comprises pages. The th page is a matrix of and each column . By denoting by , the triple-norm will be defined as . Based on the definition of the triple norm, we define the triple-norm regularizer, that is, TripleReg aswhere . The Fenchel-conjugate function of (A.1) iswhere .

Next, we present the analysis of the strong convexity of (A.1) and its argument. The proof begins by giving some mathematical definition and tools, which is followed by the derivation of the strong convexity argument.

Definition A.1. A function is -strongly smooth with respect to a norm , if is differentiable everywhere and if, for all and , one has

The following theorem states that strong convexity and strong smoothness are dual properties [50].

Theorem A.2 (strong convexity/strong smoothness duality). Assume that is a closed and convex function. The Fenchel conjugate of is denoted by . Then, is -strongly convex with respect to a norm if and only if is -strongly smooth with respect to the dual norm .

The following lemma relates to the strong convexity of the vector norm. Its proof is standard and can be found, for example, in a study by Kakade et al. [50].

Lemma A.3. Let . A function defined as is -strongly convex with respect to .
Let be absolutely symmetric norms on , respectively. Their dual norms are , where . Using the duality properties between strong convexity and smoothness stated in Theorem A.2, are -strongly smooth with respect to , where their constants are .
Next, we analyze the smoothness of (A.1) and obtain the argument. Thus, we need to prove

According to Theorem  13 in [50], the bi-norm regularizer function is -strongly smooth with respect to . Following the equivalent definition of strong smoothness in (A.3), we obtain

Combining the smoothness of , that is, the left side of (A.4) can be rewritten as

Next, for any and , we have . Thus, we can obtain

Combining (A.7) and (A.8) proves (A.4). Thus, our dual triple-norm regularizer as (A.2) is a -strongly smooth function. Based on the duality of the strong smoothness and convexity stated in Theorem A.2, the convexity of the triple-norm regularization in (A.1) is

In particular, the strong convexity argument of the TripleReg, as in (8), is the same as (A.9). As and , (A.9) is equivalent to .

B. Proof of Theorem 1

To complete the proof, Theorem  2 in [29] is rewritten as follows.

Let be a sequence of functions such that, for all and , where is -strongly convex with respect to a norm , is a convex and closed function. Then, any algorithm that can be derived from “template algorithm for online strongly convex optimization” [29] satisfieswhere and is the norm dual to .

If we let and suppose that , the right side of (B.1) yields the following conclusion:

Combining (B.1) and (B.2) yields

Theorem 1 is proved by dividing the left and right sides of (B.3) by .

C. Proof of the Convergence Rate

Assume that and is the optimal hypothetical solution to (6). Theorem 1 can be rewritten in a probabilistic sense as

The Markov inequality is as follows:where is a constant.

In this case, we denote by a random variable and we set . Then, the upper bound of is obtained as

Since , then . Thus, the following inequality holds with a probability of at least over the choice of random samples:

Plugging into (C.4) yields the specific conclusion for the TripleReg-MKL algorithm; that is,

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research is supported in part by the Guangdong Natural Science Funds (Grant no. S2013010014240), NSFC (nos. 31101087, 31201129, 61075021, and 61302120), National Science and Technology Support Plan (2013BAH65F01, 2013BAH65F04, and 2013BAJ13B05-01), GDSTP (no. 2012A010701001), and Research Fund for the Doctoral Program of Higher Education of China (nos. 20120172110023 and 20124404120005).