Abstract

Benefiting from the kernel skill and the sparse property, the relevance vector machine (RVM) could acquire a sparse solution, with an equivalent generalization ability compared with the support vector machine. The sparse property requires much less time in the prediction, making RVM potential in classifying the large-scale hyperspectral image. However, RVM is not widespread influenced by its slow training procedure. To solve the problem, the classification of the hyperspectral image using RVM is accelerated by the parallel computing technique in this paper. The parallelization is revealed from the aspects of the multiclass strategy, the ensemble of multiple weak classifiers, and the matrix operations. The parallel RVMs are implemented using the C language plus the parallel functions of the linear algebra packages and the message passing interface library. The proposed methods are evaluated by the AVIRIS Indian Pines data set on the Beowulf cluster and the multicore platforms. It shows that the parallel RVMs accelerate the training procedure obviously.

1. Introduction

Accompanying the rich spectral information that the hyperspectral imager collects, huge amount of training samples are required to train the classifier precisely. The problem is well known as the Hughes phenomena [1]. Collecting enough training samples is burdensome. Therefore, designing the hyperspectral image classifiers that can deal with the small training set became the theme in the past ten years. The solutions [210] can be divided into four categories: (1) regularization of the sample covariance matrix; (2) feature extraction or feature selection; (3) enlarging the training set by semisupervised learning; and (4) low complexity classifiers, such as support vector machine (SVM). Benefiting from the kernel skills, SVM is less affected by the Hughes phenomena. Maximizing the margins of the class pairs guarantees SVM low training error and good generalization ability. It has been proven that SVM is superior to most supervised classifiers in classifying the hyperspectral image [710].

Sparse Bayesian learning-based classifiers, represented by relevance vector machine (RVM), emerged in the remote sensing community since 2006 [1113]. To avoid overfitting, RVM constrains the predicting model by the automation relevance determination (ARD) framework, which promises a sparse model. Compared with SVM, RVM could acquire a much sparser model with equivalent generalization ability. The sparse property is competitive in classifying the large-scale hyperspectral image, as it requires far less time for prediction.

However, RVM is not widespread, due to its slow training procedure. In each iterative step, RVM carries out transpose, multiplication, and inversion operations on an Hessian matrix, where is the number of training samples. These operations are time consuming when is large, making it inefficient for the large-scale data set. To solve the problem, Tipping and Faul proposed a fast marginal likelihood maximization method, refreshing one coefficient at one time [14]. However, the fast method performs greedy search and is easily stuck into suboptimal solutions. Lei et al. [15] avoided the expensive inversion of the Hessian matrix, by substituting Broyden-Fletcher-Goldfarb-Shanno (BFGS) for iteratively reweighed least squares (IRLSs). Seeger and Ribeiro applied RVM to the large text sets by the ensemble, boosting, and incremental methods, which adopt divide-conquer-merge strategy and enable RVM to process more than ten thousands training samples [16]. The divide-conquer-merge strategy decreases the amount of training samples each weak RVM processes, also helpful for accelerating the training procedure. Yang et al. proposed recursive Cholesky decomposition for RVM and implement it on GPUs. The GPUs-based RVM was proved to be much faster for both single and double precision [17].

Recent development of the multicore platform and the low-cost clusters makes parallel computing popularized in the hyperspectral remote sensing community. Plenty of the classification, unmixing and anomaly detection algorithms have been parallelized successfully [1820]. Those cases motivate us to accelerate RVM with parallel computing. In this paper, we design three parallel implementations of RVM. The parallelization is revealed from the aspects of matrix operation, multiclassifier strategy, and divide-conquer-merge strategy. The parallel RVMs are tested on the multicore platform and the cluster, acquiring an obvious gain in efficiency.

2. Relevance Vector Machine Classifier

For the training samples and the class labels , RVM uses the linear combination of the kernel functions to describe the input-to-output relationship and the Bernoulli distribution to construct the probability density function The symbols in (2.1) ~ (2.2) are , , and . is the sigmoid function mapping into . To ensure the generalization ability, the weights are constrained by Then, the posterior probability density function can be obtained by the Bayes’ rule Maximizing (2.5), the optimized can be found as follows.

Serial Binary RVM Classification
Initialize and .Fix and update with The details of , , and could be found in [11]. Fix and update with where .Repeat step (2) ~ (3) until convergence.Classify the test samples with the estimated model.

3. Parallel Optimization

RVM is a binary classifier. To deal with the multiclass problem, either One Against One (OAO) or One Against ALL (OAA) should be used. The multiclass RVM consists of multiple independent binary classifiers and could be processed by multiple processing units simultaneously. In [16], Seeger and Ribeiro applied RVM to the large-scale text set with the ensemble, boosting, and incremental methods. They divided the standard RVM into multiple independent local RVMs. The idea was adopted by us for parallelization in this paper. RVM could also be accelerated by parallelizing the expensive matrix operations in (2.6) ~ (2.8). More complicated parallelization could be realized by combing the aforementioned strategies.

The parallel RVMs in this letter focus on the training phase. The test phase is not emphasized due to two reasons. First, with an equivalent amount of training and test samples, the optimization of and is far slower than the prediction of the test samples. Secondly, even if tens of thousands of test samples are involved, the prediction of each test sample is always independent. This is a typical data parallel problem. It could be easily solved by dividing the test set into multiple subsets, predicted by multiple processing units simultaneously.

3.1. Parallelizing the Multiclass Strategy

RVM deals with the multiclass problem by OAO or OAA. OAO is preferred, as it processes less training samples in each class pair [12]. Suppose there are classes; the multiclass RVM by OAO consists of uncorrelated binary RVMs. This is a typical task parallel problem and we name the parallel implementation pRVM-MultiClass for short. Load balance must be carefully handled for pRVM-MultiClass. With OAO, the class pairs may differ greatly in the amount of the training samples. The varieties will cause great difference between the CPU time consumptions of the class pairs. Load balance could not be promised if the class pairs are evenly distributed in the parallel environment. To solve the problem, pRVM-MultiClass is organized in the master-slave model. All the class pairs reside in the master and wait to be sent to one idle slave. Each slave is a binary RVM. The master continuously sends the unprocessed class pairs to the idle slaves until the entire class pairs have been sent out. The slaves request new class pair from the master once it is idle. The results of the slaves are collected and synthesized by the master for the classification map. The detail of pRVM-MultiClass is given in Figure 1.

3.2. Parallelizing Multiple Weak RVMs

RVM is extended to process the large-scale text data sets with the incremental, boosting, and ensemble methods [16]. All these methods are based on the divide-conquer-merge strategy. The entire training set is split into multiple small subsets. Each subset is used to train a weak RVM and classify the test set. The weak RVMs cause a decrease in precision, compared with the serial RVM. The loss could be compensated by the integration methods, such as majority voting. The ensemble method has been proven to be superior to the rest [16]. Thus, it is used to construct the weak classifiers in this paper and we named the parallel implementation pRVM-Ensemble. Figure 2 shows the flow of pRVM-Ensemble. Each process randomly extracts p% training samples from each class to train a local RVM. The class labels of test set are inferred by the weak RVMs, respectively, and then enhanced using majority voting.

pRVM-Ensemble is influenced by two parameters, the sampling rate and the number of weak classifiers . Suppose that the training set consists of samples and the serial RVM consumes seconds. With large , is mainly occupied by the inversion of the Hessian matrix H in (2.7), who’s complexity is O(). Ignoring the minor parts, the complexity of the weak RVM could be approximated by . Therefore, the speedup ratio of pRVM-Ensemble is where is the number of processes. Small and guarantee high speedup ratio, at the cost of losing accuracy. On the contrary, the misclassifications are decreased with a relatively low efficiency. Suppose that weak RVMs hold copies of the entire training set and are distributed to processes. In this extreme case, the efficiency and precision are not improved by pRVM-Ensemble. Fine tune of the parameters is necessary, to balance the trade-off between efficiency and precision.

3.3. Parallelizing the Matrix Operations

RVM is occupied by the matrix operations in (2.6) ~ (2.8), especially the expensive inversion of the Hessian matrix H. It could be easily accelerated by substituting the parallel matrix functions for the serial ones. We named the parallelization RVM-MatOp. The parallel functions have been realized in many parallel linear algebra packages, such as Intel’s Math Kernel Library (MKL) and Automatically Tuned Linear Algebra Software (ATLAS). The packages provide optimized matrix multiplication and inversion functions for the large-scale matrices. Although not developed by us, RVM-MatOp is emphasized for two reasons. First, the parallel matrix functions are implicitly controlled by the well-developed packages. The researchers could easily implement pRVM-MatOp on the multicore platforms, even if not familiar with parallel computing. Second, pRVM-MatOp could be combined with pRVM-MultiClass and pRVM-Ensemble for better efficiency. This will be discussed in Section 3.4.

3.4. Hybrid Parallel RVM

The parallel implementations could be further optimized by combination. pRVM-Ensemble uses OAO to deal with the multiclass problem in each weak RVM. Combining the multiclass parallel strategy with the local weak RVMs can accelerate pRVM-Ensemble. In addition, pRVM-MultiClass and pRVM-Ensemble consist of multiple standard binary RVMs. They are also tortured by the expensive matrix operations in (2.6) ~ (2.8). Therefore, pRVM-MatOp could be combined with pRVM-MultiClass and pRVM-Ensemble to optimize the binary RVMs. More complex case is the combination of three parallel implementations. The ensemble skill is used to decompose the serial RVM into multiple independent subtasks, the multiclass parallel strategy optimizes the local weak RVMs, and the binary RVMs are accelerated by the parallel matrix functions. However, the efficiency might be decreased by too complex parallel strategy. The ideal case is the combination of pRVM-MatOp with pRVM-MultiClass or pRVM-Ensemble. RVM is first globally separated into multiple uncorrelated subtasks by the multiclass or ensemble skill. Then the parallel matrix functions are used to optimize the local subtasks. The hybrid structure is popular in parallel programming and suitable to be implemented on the cluster of multicore computers.

4. Experiments

To evaluate the proposed methods, we carried out several experiments on the data acquired by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over Indian Pines test site in 1992 [21]. The test site covers a 145 × 145 region and 220 bands ranging from 0.4 to 2.5 μm. It contains sixteen different land covers, some of which are hard to separate. Seven classes were discarded for the insufficient labeled samples. The rest nine classes had 8489 samples and were divided into the training and test sets. The details of the data set are in Figure 3 and Table 1.

The algorithms were implemented with the C language, based on the Matlab code of Tipping. The RBF kernel was adopted. The width of RBF was optimized by 5-fold cross-validation and grid search for the serial RVM. The best width was directly used in the parallel RVMs. We focused on the efficiency of RVMs with the best width and ignored the cost of optimizing the parameter.

4.1. Evaluating pRVMs on the Beowulf Cluster

First, we evaluated the parallel RVMs on a sixteen-node Beowulf cluster, running on the CentOS Linux operating system. The nodes are connected by the Gigabit Ethernet. Each node contains one 2 GHz AMD processor and 1 GB memory. pRVM-MultiClass and pRVM-Ensemble were explicitly parallelized with the message passing interface (MPI). The matrix functions are from Basic Linear Algebra Subprograms (BLASs) and Linear Algebra Package (LAPACK). The scalable LAPACK (ScaLPACK) library provided pRVM-MatOp with the parallel matrix functions in the cluster environment. For better performance, we compiled the parallel RVMs with the optimized BLAS package ATLAS.

Table 2 shows the efficiency of the parallel RVMs. The serial RVM takes 3402.9 seconds. pRVM-MultiClass and pRVM-MatOp are equal to the serial RVM, when one CPU is involved. pRVM-Ensemble was tested with three different sampling rates, ranging from 20% to 60%. was set 32 for the load balance purpose. The CPU time of pRVM-Ensemble could be estimated by , which is 871.1 seconds for %, 6969.1 seconds for %, and 23520.8 seconds for %. The approximations are on the same order of magnitude as the measured values in Table 2. The time costs of the parallel RVMs decrease rapidly, when more CPUs are involved.

Figure 4 plots the speedup ratio curves of the parallel RVMs. Benefiting from the small sampling rate, pRVM-Ensemble acquires super linear speedup ratios when %. The ratios drop beneath the ideal cases for the large . The curve of pRVM-MultiClass descends as the number of CPUs increases. With less than 16 CPUs, pRVM-MultiClass is faster than pRVM-Ensemble (% and %). However, it falls beneath pRVM-Ensemble (%) with 16 CPUs. The phenomenon is caused by the unbalanced class pairs of the training set. Table 3 shows the time cost of training each class pair with the binary RVM. Classes 3 and 5 have the least number of samples. They only take 6.3 seconds to train a binary RVM. The features of class 7 and class 9 are quite similar. Additionally, they almost have the most samples. It takes 575.3 seconds to train a binary RVM for them. Influenced by the huge difference, the computing nodes could not acquire the same amount of tasks. The load unbalanced problem could be eased by the master-slave model in Figure 1. However, it could never be totally eliminated, especially when more CPUs are involved. pRVM-MatOp is more effective than pRVM-MultiClass and pRVM-Ensemble (% and %). To implement the parallel matrix functions in the cluster, ScaLPACK introduces extra overhead on the network for data exchange, which causes a slight downtrend in the speedup ratio curve of pRVM-MatOp. The downtrend will further increase with more CPUs.

Table 4 lists the overall classification accuracy of the parallel RVMs. pRVM-MultiClass and pRVM-MatOp acquire the same accuracy as the serial, because they do not alter the logic of the standard RVM. The training subsets of the weak RVMs are randomly sampled in pRVM-Ensemble. It will cause the difference among the accuracies of the local classifiers. When equals 20%, the minimum accuracy of the weak RVMs is 81.67% and the maximal value is 84.43%. There exists a gap of 2.76%. The gap is gradually decreased as the sampling rate increases. The ensemble accuracy after the majority voting is better than those (Min, Max, and Aver) of the weak RVMs. It exceeds the serial when %. The accuracies do not linearly increase with the sampling rate for pRVM-Ensemble. The increment is 2.85%, when varies from 20% to 40%. It drops to 0.73% for the 40%–60% case. The increment is further decreased to 0.19%, when reaches 80%. However, it produces a huge growth of the time cost with so large . Taking into consideration both efficiency and accuracy, the % and % cases are preferred.

We also measure the performance of the parallel RVMs for different data sizes. Table 5 shows the speedup ratios of the parallel implementations, when 16 CPUs are involved. The second column of the table gives the amount of samples used to train the classifiers. The number in the parenthesis is the percentage of the used training samples, compared with the entire training set in Table 1. The cost of the serial RVM does not linearly scale with the number of training samples. Therefore, the speedup ratio of pRVM-Ensemble is on a declining curve when the training set decreases. pRVM-MultiClass is stable when the data size varies. The speedup ratio of pRVM-MatOp decreases obviously when less training samples are used. It drops to 2.3 for the smallest data set DS-A. The reduction could be explained from three aspects. First, the parallel matrix functions used in pRVM-MatOp are designed for the large matrix, not suitable for too small data set. Second, the extra overhead caused by the parallelization has a prominently negative impact for the small training set. Third, pRVM-MatOp only parallelizes the matrix operations in (2.6) ~ (2.8). For the small data sets, the unparallelized parts could not be neglected. These factors greatly reduce the efficiency of pRVM-MatOp, once the data size decreases. The experiment indicates that pRVM-MultiClass is basically immune to the data size, but pRVM-Ensemble and pRVM-MatOp are not suitable for too small data sets.

4.2. Evaluate pRVM-MatOp on Multicore Platform

pRVM-MatOp was also tested on two multicore platforms. One is a server with a two-core Intel E5500 processor and 2 GB memory. The other is a workstation with a four-core Intel Q9400 CPU and 4 GB memory. pRVM-MatOp is complied and linked with Intel’s complier under Visual Studio 2008. The parallel matrix functions come from the MKL package. The results are given in Table 6.

Like the experiment of Table 5, we also extracted different amounts of samples from the entire training set to assess the parallel matrix functions. The time costs with one core are equal to those of the serial RVM on the platforms. As the data size decreases, the reduction of the speedup ratio is not significant for the two-core platform. This is due to the too few cores. However, on the four-core platform, more processing units are involved. The speedup ratio decreases dramatically when the training set is small. This could also be explained from the aforementioned three factors. For the large training set, the cost is decreased significantly. Learning with the entire training samples, RVM is improved from 1354.9 seconds to 1028.3 seconds on two-core E5500 server. It acquires a 1.32 speedup ratio. The time is further decreased from 1413.5 s to 559.3 s on the four-core Q9400 workstation, with a 2.53 speedup ratio. The multicore platforms exchange the data with the internal bus, which is much faster than the network of the cluster. Therefore, the parallel efficiencies (speedup ratio/CPUs) of pRVM-MatOp on the multicore platforms are slightly better than the counterparts of the cluster. Considering the limited computing resources of the multicore platforms, pRVM-MultiClass and pRVM-Ensemble are not included in this part.

4.3. Further Discussion

The parallel implementations are summarized and compared in Table 7. pRVM-MatOp and pRVM-MultiClass have the same logic as the serial RVM. Therefore, they could not process the large-scale training set. pRVM-Ensemble solves the problem by splitting the large training set into small subsets. pRVM-MatOp is controlled by the linear algebra packages. The parallelization is implicit for the researcher, making pRVM-MatOp easy to use. pRVM-MultiClass and pRVM-Ensemble are explicitly parallelized by the programmer with the send, receive functions of the MPI library. Designing this kind of parallel algorithms is rather difficult. Affected by the unbalanced class pairs, pRVM-MultiClass could not achieve load balance. It further causes the poor scalability and low parallel efficiency. pRVM-Ensemble is load balance as long as and are set properly. It is scalable and can achieve a high parallel efficiency. The load balance of pRVM-MatOp is controlled by the linear algebra packages. Its scalability and efficiency are also unsatisfactory, because of the extra overhead and unparallelized nonmatrix operations.

Considering the easy-to-use characteristic, pRVM-MatOp is preferred when the multicore platform is available. It could be easily realized and achieve a satisfactory speedup. For the large-scale training set, pRVM-Ensemble on the cluster is highly recommended because of its good scalability and high parallel efficiency.

5. Conclusion

Parallel computing is used to accelerate RVM classification of the hyperspectral image in this paper. The parallelization is discussed from the matrix operations, the multiclass strategy, and the ensemble skill. Evaluated with the AVIRIS data, the parallel RVMs are proved to be effective. The training procedure is accelerated when more cores or CPUs are involved. Future improvements could be carried on by designing and evaluating the hybrid structure, which is not included in the experiments due to the lack of the testing platform. RVM could be parallelized by the multiclass or the ensemble skill in the global view and by the parallel matrix functions in the local view. The hybrid structure is suitable for the cluster of the multicore platforms, which is the trend of the supercomputer.

Acknowledgments

This work is supported by Fundamental Research Funds for Central Universities (no. 2012ZM0100), China Postdoctoral Science Foundation funded project (no. 20100480750), and Key Laboratory of Autonomous Systems and Network Control, Ministry of Education.