#### Abstract

Multiview active learning (MVAL) is a technique which can result in a large decrease in the size of the version space than traditional active learning and has great potential applications in large-scale data analysis. This paper made research on MVAL-based scene classification for helping the computer accurately understand diverse and complex environments macroscopically, which has been widely used in many fields such as image retrieval and autonomous driving. The main contribution of this paper is that different high-level image semantics are used for replacing the traditional low-level features to generate more independent and diverse hypotheses in MVAL. First, our algorithm uses different object detectors to achieve local object responses in the scenes. Furthermore, we design a cascaded online LDA model for mining the theme semantic of an image. The experimental results demonstrate that our proposed theme modeling strategy fits the large-scale data learning, and our MVAL algorithm with both high-level semantic views can achieve significant improvement in the scene classification than traditional active learning-based algorithms.

#### 1. Introduction

Scene classification is defined as using a computer to understand the class of an image scene. The related research studies can be roughly divided into two branches: some focus on fast holistic scene perception based on visual psychology and physiology [1, 2], while others build the statistical models through local image analysis to understand the scene, which is also the main developing tendency [3–5]. There have been many methods for image representation in the past two decades, which is a key step for scene classification. Low-level features such as color, texture, and edge have been widely used to represent the local regions of an image. Some researchers trained object detectors to achieve high-level semantics such as object’s class, size, and shape for more accurate image representation [6, 7]. Prevailing statistical models are bag-of-words (BoW) and related theme statistical models. These models reduce the gap between the low-level features and high-level semantics by mining the hidden themes from local image regions such as pLSA [8] and LDA [9]. Other new scene statistical models [10–12] were proposed for more accurate object recognition in the scene. However, these mentioned models above mainly focus on the occurrence of the image semantics, and the spatial semantic correlations between different image regions are usually ignored.

For mining the spatial context information from an image, some researchers considered the information interaction between different spatial pyramid levels [13–15], and how to build reasonable attention mechanisms also can lead to significant improvement for scene classification. These methods used deep neural networks, and their large-scale network parameter estimation tasks usually lead to much higher computational complexity than nondeep learning based methods.

Active learning ranks the unlabeled samples iteratively and only selects the samples with high uncertainty or which cause great ambiguity for the classifier. In PAC learning theory, compared with traditional passive learning, it can exponentially reduce its sample complexity to in the feature space for learning a classifier with expectation classification error [16–18], which has good potential of wide application in large-scale data leaning. However, most of the traditional active learning algorithms’ lack of diversity of the hypotheses is generated usually by low-level image features, which affects their performances. This paper proposed a MVAL-based scene classification algorithm, which uses different high-level semantics as its views and can realize a decrease in more than a half size of the version space, and it is more efficient than both single-hypothesis-based and committee-based active learning [19].

#### 2. Materials and Methods

##### 2.1. Proposed Algorithm

The flowchart of our proposed algorithm is illustrated in Figure 1. Our algorithm uses different high-level semantics as its views to generate the corresponding hypotheses. First, object detectors are trained to achieve the responses of different object classes in image regions. Furthermore, we design a cascaded online LDA (CO-LDA) as a secondary view for achieving more accurate image representation. Finally, a fine-tuned MVAL algorithm is utilized with both two high-level image semantics as its views for classifying the scene of an image.

##### 2.2. Object Semantic-Based Image Representation

Our object semantic-based image representation is illustrated in Figure 2.

First, multiple object objectors are used to achieve the local object response maps. Second, these maps are decomposed into three spatial pyramid levels, and the maximal object responses are computed in image blocks in each spatial level, which is annotated as red blocks in Figure 2. Finally, an object response histogram is computed, which can effectively reduce the influence of object response error in the whole image. For generating the object response, a latent SVM-based detector [7] is applied for recognizing the object classes with bulk type such as car and pedestrian. Another geometric context-based detector [6] is utilized for recognizing the object classes with different textures such as tree, sky, and building.

##### 2.3. Theme Semantic-Based Image Representation

For satisfying the dynamic update of an active learning training set, an online LDA model [20] based on stochastic gradient descent strategy is used. It adds new samples sequentially, and old samples have been no longer stored, which can achieve efficient and accurate parameter estimation in large-scale data training.

Online LDA computes the posterior probability distribution of the hidden nodes based on observed samples. It actually uses variational inference to estimate the maximum likelihood of based on and . Three variational parameters , , and follow the distributions: , , and . The variational distribution follows

The optimal is solved by maximizing the lower bound in the following equation:where denotes the conditional mathematical expectation. Maximizing the lower bound is equivalent to minimizing KL divergence of and :where is factorized as follows:

Equation (4) can be transformed into formula (5). In equation (5), denotes the frequency that word occurs in text . reflects the contribution of for the lower bound, which is iteratively optimized by a coordinate ascent algorithm:in equation (5) is iteratively solved:where digamma function is the first-order derivative of function . and are iteratively solved in the following way: .

When^{th} vector of word frequency is observed, we keep unchanged and update the local optimal solution of and in *E* step. In *M* step, and from last iteration are both used to update : in formula (7) is solved as follows:where is ^{th} text in each batch text set, is the number of the training text set, and is the size of each batch text set. Hyperparameters and are updated by the Newton–Rapson method: . Here, is the product of Hessian matrix and gradient of the objective function . is the product of Hessian matrix and gradient of the objective function .

Based on online LDA, we proposed the CO-LDA theme model, which is similar with the classic SP-pLSA model in structure for enhancing the spatial correlation between different image regions. The framework of CO-LDA is illustrated in Figure 3. The main difference between CO-LDA and SP-pLSA is that different online LDAs (LDA1, LDA2, and LDA3) are applied in different spatial levels to jointly mine the theme of an image. The main advantage of CO-LDA is that it integrates the spatial correlation of objects in different image resolutions, which further improves the holistic scene understanding. The visual histogram computation in online LDA is the same as the way of object response histogram in Section 2.2, and the theme feature of each spatial block is represented by variational parameter of the online LDA model.

Finally, the theme feature of the whole image is achieved by concatenating the theme features of different blocks of different spatial pyramid levels:where denotes the theme feature of the corresponding block in ^{th} pyramid level, denotes the linear concatenation between feature vectors, and the weights of different spatial levels are configured as follows: .

##### 2.4. Multiview Active Learning

The MVAL referred in this paper is our previous work [21], which has two improvements in both hypothesis generation and selective sampling. First, boosting-like technique is integrated into MVAL, which uses a similar way of iterative weak classifier optimization, and the current hypothesis is boosted by weighted voting of all the hypotheses from the past queries. Furthermore, an adaptive hierarchical competition sampling is presented. In this sampling strategy, if the number of the contention samples is large, an unsupervised spectral clustering is activated to obtain the coarse spatial distribution of these contention samples in the high-dimensional feature space, and then, a multiview-based batch mode selective sampling is run based on two measures: sample uncertainty and redundancy by solving quadratic programming to determine the queried samples in each cluster.

###### 2.4.1. Hypothesis Generation

If an active learning can select enough number of contention samples, which could improve the hypothesis in each query, the number of unlabeled samples, which are incorrectly classified, will decrease. It is quite similar with boosting technique in weak classifier optimization. The MVAL incorporates the AdaBoost algorithm into our framework to boost the generated hypothesis in each query, and the main flowchart is described in Figure 4.

In Figure 4, a support vector machine (SVM) is used as a base classifier to construct a multiview classifier, which replaces the single-view classifier in AdaBoost, and this multiview classifier in each query can be considered as a weak classifier in each iteration in AdaBoost. The hypothesis of multiview classifier is computed by weighted voting of SVM base classifiers whose weights are . Unlike traditional query by boosting, we update the weight of each base classifier in each query and obtain the boosted hypothesis by weighting all the hypotheses from the past queries and not from the current query only.

The detailed process of the MAVL’s hypothesis generation based on AdaBoost is as follows:(a)In iteration , weighted voting is used to generate the initial multiview-based hypothesis: where is the classification confidence of sample by view , and denotes the contribution of view for classification which is determined by the soft classification error rate , which defines how correctly a sample is classified: where and denote the sum of classification confidence of unlabeled samples, which are labeled as and , respectively. For a “positive/negative” sample, the distance of it to the decision boundary in the “positive/negative” side reflects the degree of how correctly it is classified, and this information is utilized to calculate the error degree here instead of the traditional classification error calculated by the decision hypothesis in AdaBoost. Also, is updated through the following way: , where is the normalized weight. Then, the classification confidence of the multiview classifier can be computed by the following equation:(b)After iteration , the size of the labeled sample set is increased as follows: . denotes the labeled sample set in iteration , and denotes the newly added samples after query. As we know, the size of the labeled samples set is increased during iteration in active learning. Thus, if the size of the initial labeled training set is small, the influence of should be considered when updating the weight of the multiview classifier, which is illustrated by the following equation: Then, the weight of each sample is updated through the following way: , where , if is correctly classified, , otherwise, .(c)The final boosted hypothesis of the queried sample is equivalent to the weighted sum of all the hypotheses from the past queries, which is defined by

###### 2.4.2. Sampling Strategy

The MVAL uses a new hierarchical competition-based sampling strategy in order to query the contention samples with high probability in different sample distributions, which is illustrated in Figure 5.

(1)* Intercluster Sampling Competition.* In the MVAL, a fast approximate spectral clustering algorithm is designed to reduce the computational complexity significantly to , where is the iteration number of K mean clustering, and is the total number of contention samples. The detailed process is illustrated as follows: (a) perform traditional K mean clustering on the contention unlabeled samples , compute the centroid of each cluster as representative points, and build a correspondence table to associate each with the nearest cluster centroid ; (b) run the normalized cut algorithm on to obtain a m-way cluster membership for each of ; and (c) recover the cluster membership for each by looking up the cluster membership of the corresponding centroid in the corresponding table.

After fast spectral clustering, two intercluster sampling measures are defined: the number of samples in the cluster and its information entropy. Both measures are weighted to obtain the number of selected samples in cluster in the following equation:where is proportional to the total number of samples in cluster , and computing is equivalent to kernel density estimation of in cluster . Weight reflects the impact of both measures in intercluster sampling competition, is the normalized factor, is the total number of selected samples in the current query, and is rounding operation.

(2) *Intracluster Sampling Competition.* In the MVAL, an efficient quadratic programming based-method [22] is utilized, which dynamically estimates the weights of the redundancy and uncertainty of an unlabeled sample in each query. It is used for intracluster selective sampling and solved by minimizing the following object function:

Equation (16) aims to estimate the normalized parameter , which reflects how probable the unlabeled sample is selected. is the classification confidence of sample in ^{th} view. are the queried unlabeled samples, is a unit vector, and is the number of unlabeled samples in batch mode. The first part denotes the sample uncertainty in ^{th} view, and the sampling strategy tends to select the contention sample near the classification hyperplane of ^{th} view by minimizing . The second part denotes the sample redundancy in ^{th} view, and the similar samples are selected by minimizing . The sampling probability is calculated by a convex quadratic programming, and finally, , which corresponds to in ^{th} view, is obtained. For selective sampling in each cluster, the conservative sampling strategy is utilized in a classic co-testing algorithm [23].

#### 3. Results and Discussion

In our experiment, two classic image sets (OT image set from MIT [9] and UIUC sports event image set from UIUC [24]) are used for algorithm comparison. Average classification precision (ACP) and mean of average classification precision (MACP) are both used for evaluating the performance of both CO-LDA models and multiview active learning algorithms.

##### 3.1. Evaluation of Theme Semantic

The first experiment is designed for evaluating the performance of our proposed theme semantic. In OT and UIUC Sports datasets, the parameter configuration of the CO-LDA model is as follows: (1) and in formula (7). (2) The batch sizes of sampled images in MVAL are .

We observe MACP variation of the CO-LDA model by changing the numbers of both theme and visual word: and , and a total of twenty groups of (T, W) are obtained. In Figure 6, we find that (T, W) curves for both datasets show the similar trends that MASP increase first and then decrease. Thus, in our CO-LDA model, we set and .

Figures 7(a)-7(b) and 8(a)-8(b) show the probability distributions of different themes by CO-LDA in OT and UIUC Sports image datasets.

**(a)**

**(b)**

**(a)**

**(b)**

In the OT image set, we can see that there are significant differences between four scene classes “Highway,” “Forest,” “Mountain,” and “Tall building” in theme probability distributions, and multiview SVM classifier works well in scene classification. In the UIUC Sports image set, the theme probability distributions are very similar in four scene classes “Bocce,” “Croquet,” “Polo,” and “Snowboarding,” which significantly increases the difficulty of scene classification.

Furthermore, we compare the CO-LDA model with traditional LDA [9] and SP-pLSA [8] models, and the performance comparison of three theme models is shown in Table 1. N1 ∼ N8 denote the following eight natural scene classes: “Coast,” “Forest,” “Mountain,” “Open country,” “Highway,” “Inside city,” “Tall building,” and “Street.” S1 ∼ S8 denote the following eight event scene classes: ”Badminton,” “Bocce,” “Croquet,” “Polo,” “Rock Climbing,” “Rowing,” “Sailing,” and “Snowboarding.”

In the LDA model, each image is divided into blocks, and 5 pixels are overlapped between neighbored blocks. For feature representation, gray-scale SIFT descriptors are sparsely sampled, and means of three color channels are calculated. The numbers of the theme and visual word are and by cross validation. In the SP-pLSA model, the ways of image division and feature representation are the same as the LDA model. The numbers of the theme and visual word are and by cross validation.

In the OT image set, CO-LDA achieves both higher ACP and MACP than SP-pLSA in six scene classes except “Mountain” and “Inside city.” LDA performs the worst in all of scene classes except “Street.” It is easy to conclude that CO-LDA can achieve more accurate scene semantics than other two classic methods. In the UIUC Sports image set, CO-LDA achieves the highest ACP in the following three event classes: “Croquet,” “Polo,” and “Rowing,” and SP-pLSA achieves the highest ACP in the following three event classes: “Bocce,” “Rock Climbing,” and “Snowboarding.” But in the event classes “Badminton” and “Sailing,” in which LDA has the highest ACP, CO-LDA still performs better than SP-pSLA. Thus, we can conclude that our proposed CO-LDA also have slightly better performance in theme mining than the two classic image representation methods.

##### 3.2. Evaluation of MVAL

In the second experiment, we compare our algorithm with other single-view active learning algorithm with both high-level semantics and low-level features for scene classification. In our initial labeled training set, label size = 150, batch size = 20, and iteration = 10.

Our proposed algorithm MVAL^{HS} (MVAL and HS denote MVAL [21] and two proposed high-level semantics, respectively) is compared with the following four algorithms: (1) MVAL^{LS} (LS denotes low-level image features): in MVAL, both means of three color channels and densely sampled color-SIFT descriptors are concatenated as a feature vector for image representation. (2) AL^{QP} [22]: a single-view SVM active learning by QP-based selective sampling, which relies on the sample uncertainty and redundancy. (3) Diff^{WS} [25–30]: a disagreement-based active learning from weak and strong labelers. (4) Graph^{GP} [23]: a graphical model-based active learning with robust Gaussian process. The feature representations of AL^{QP}, Diff^{WS}, and Graph^{GP} are the same as MVAL^{LS}. The performance comparison of the five active learning algorithms is shown in Table 2.

From Table 2, it is easily found that our algorithm MVAL^{HS} has the highest MACP in almost all scene classes than the other four algorithms in both image sets, which demonstrates that high-level semantics can achieve more significant improvement in holistic scene understanding than traditional low-level image features. Furthermore, we can see that MVAL^{LS} performs better in most cases than other three single-view algorithms, which also means that multiple view setting can successfully result in larger decrease in the size of the version space than traditional single-view active learnings due to its independent and diverse views.

#### 4. Conclusion

This paper proposed a MVAL-based scene classification algorithm, which applies two different high-level image semantics to generate the corresponding hypotheses. Different object detectors are first trained to achieve the responses of different object classes as object semantic. Furthermore, a CO-LDA model is proposed for achieving more accurate theme semantic by integrating the spatial correlation of objects in different image resolutions, which improves the holistic scene understanding. With the help of both two independent views, our MVAL algorithm has potential to not only handle large-scale data training but also improve the performance of scene classification.

#### Data Availability

All data utilized in our research can be accessed from the following website: https://archive.ics.uci.edu/ml/datasets/Corel+Image+Features and http://vision.stanford.edu/lijiali/event_dataset/.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This work was funded by the Natural Science Foundation of China (Grant nos. 41571299 and 11601339), Key Research and Development Plan of Zhejiang Province (Grant no. 2018C01086), Open Research Project of the State Key Laboratory of Industrial Control Technology, Zhejiang University, China (no. ICT20047), Zhejiang Provincial Natural Science Foundation of China (Grant no. LY18F020025), and National Thousand Talents Program (Grant no. Y474161).