Scientific Programming Towards a Smart World 2020View this Special Issue
Research Article | Open Access
Tuozhong Yao, Wenfeng Wang, Yuhong Gu, Qiuguo Zhu, "Multiview Active Learning for Scene Classification with High-Level Semantic-Based Hypothesis Generation", Scientific Programming, vol. 2020, Article ID 3878153, 13 pages, 2020. https://doi.org/10.1155/2020/3878153
Multiview Active Learning for Scene Classification with High-Level Semantic-Based Hypothesis Generation
Multiview active learning (MVAL) is a technique which can result in a large decrease in the size of the version space than traditional active learning and has great potential applications in large-scale data analysis. This paper made research on MVAL-based scene classification for helping the computer accurately understand diverse and complex environments macroscopically, which has been widely used in many fields such as image retrieval and autonomous driving. The main contribution of this paper is that different high-level image semantics are used for replacing the traditional low-level features to generate more independent and diverse hypotheses in MVAL. First, our algorithm uses different object detectors to achieve local object responses in the scenes. Furthermore, we design a cascaded online LDA model for mining the theme semantic of an image. The experimental results demonstrate that our proposed theme modeling strategy fits the large-scale data learning, and our MVAL algorithm with both high-level semantic views can achieve significant improvement in the scene classification than traditional active learning-based algorithms.
Scene classification is defined as using a computer to understand the class of an image scene. The related research studies can be roughly divided into two branches: some focus on fast holistic scene perception based on visual psychology and physiology [1, 2], while others build the statistical models through local image analysis to understand the scene, which is also the main developing tendency [3–5]. There have been many methods for image representation in the past two decades, which is a key step for scene classification. Low-level features such as color, texture, and edge have been widely used to represent the local regions of an image. Some researchers trained object detectors to achieve high-level semantics such as object’s class, size, and shape for more accurate image representation [6, 7]. Prevailing statistical models are bag-of-words (BoW) and related theme statistical models. These models reduce the gap between the low-level features and high-level semantics by mining the hidden themes from local image regions such as pLSA  and LDA . Other new scene statistical models [10–12] were proposed for more accurate object recognition in the scene. However, these mentioned models above mainly focus on the occurrence of the image semantics, and the spatial semantic correlations between different image regions are usually ignored.
For mining the spatial context information from an image, some researchers considered the information interaction between different spatial pyramid levels [13–15], and how to build reasonable attention mechanisms also can lead to significant improvement for scene classification. These methods used deep neural networks, and their large-scale network parameter estimation tasks usually lead to much higher computational complexity than nondeep learning based methods.
Active learning ranks the unlabeled samples iteratively and only selects the samples with high uncertainty or which cause great ambiguity for the classifier. In PAC learning theory, compared with traditional passive learning, it can exponentially reduce its sample complexity to in the feature space for learning a classifier with expectation classification error [16–18], which has good potential of wide application in large-scale data leaning. However, most of the traditional active learning algorithms’ lack of diversity of the hypotheses is generated usually by low-level image features, which affects their performances. This paper proposed a MVAL-based scene classification algorithm, which uses different high-level semantics as its views and can realize a decrease in more than a half size of the version space, and it is more efficient than both single-hypothesis-based and committee-based active learning .
2. Materials and Methods
2.1. Proposed Algorithm
The flowchart of our proposed algorithm is illustrated in Figure 1. Our algorithm uses different high-level semantics as its views to generate the corresponding hypotheses. First, object detectors are trained to achieve the responses of different object classes in image regions. Furthermore, we design a cascaded online LDA (CO-LDA) as a secondary view for achieving more accurate image representation. Finally, a fine-tuned MVAL algorithm is utilized with both two high-level image semantics as its views for classifying the scene of an image.
2.2. Object Semantic-Based Image Representation
Our object semantic-based image representation is illustrated in Figure 2.
First, multiple object objectors are used to achieve the local object response maps. Second, these maps are decomposed into three spatial pyramid levels, and the maximal object responses are computed in image blocks in each spatial level, which is annotated as red blocks in Figure 2. Finally, an object response histogram is computed, which can effectively reduce the influence of object response error in the whole image. For generating the object response, a latent SVM-based detector  is applied for recognizing the object classes with bulk type such as car and pedestrian. Another geometric context-based detector  is utilized for recognizing the object classes with different textures such as tree, sky, and building.
2.3. Theme Semantic-Based Image Representation
For satisfying the dynamic update of an active learning training set, an online LDA model  based on stochastic gradient descent strategy is used. It adds new samples sequentially, and old samples have been no longer stored, which can achieve efficient and accurate parameter estimation in large-scale data training.
Online LDA computes the posterior probability distribution of the hidden nodes based on observed samples. It actually uses variational inference to estimate the maximum likelihood of based on and . Three variational parameters , , and follow the distributions: , , and . The variational distribution follows
The optimal is solved by maximizing the lower bound in the following equation:where denotes the conditional mathematical expectation. Maximizing the lower bound is equivalent to minimizing KL divergence of and :where is factorized as follows:
Equation (4) can be transformed into formula (5). In equation (5), denotes the frequency that word occurs in text . reflects the contribution of for the lower bound, which is iteratively optimized by a coordinate ascent algorithm:in equation (5) is iteratively solved:where digamma function is the first-order derivative of function . and are iteratively solved in the following way: .
Whenth vector of word frequency is observed, we keep unchanged and update the local optimal solution of and in E step. In M step, and from last iteration are both used to update : in formula (7) is solved as follows:where is th text in each batch text set, is the number of the training text set, and is the size of each batch text set. Hyperparameters and are updated by the Newton–Rapson method: . Here, is the product of Hessian matrix and gradient of the objective function . is the product of Hessian matrix and gradient of the objective function .
Based on online LDA, we proposed the CO-LDA theme model, which is similar with the classic SP-pLSA model in structure for enhancing the spatial correlation between different image regions. The framework of CO-LDA is illustrated in Figure 3. The main difference between CO-LDA and SP-pLSA is that different online LDAs (LDA1, LDA2, and LDA3) are applied in different spatial levels to jointly mine the theme of an image. The main advantage of CO-LDA is that it integrates the spatial correlation of objects in different image resolutions, which further improves the holistic scene understanding. The visual histogram computation in online LDA is the same as the way of object response histogram in Section 2.2, and the theme feature of each spatial block is represented by variational parameter of the online LDA model.
Finally, the theme feature of the whole image is achieved by concatenating the theme features of different blocks of different spatial pyramid levels:where denotes the theme feature of the corresponding block in th pyramid level, denotes the linear concatenation between feature vectors, and the weights of different spatial levels are configured as follows: .
2.4. Multiview Active Learning
The MVAL referred in this paper is our previous work , which has two improvements in both hypothesis generation and selective sampling. First, boosting-like technique is integrated into MVAL, which uses a similar way of iterative weak classifier optimization, and the current hypothesis is boosted by weighted voting of all the hypotheses from the past queries. Furthermore, an adaptive hierarchical competition sampling is presented. In this sampling strategy, if the number of the contention samples is large, an unsupervised spectral clustering is activated to obtain the coarse spatial distribution of these contention samples in the high-dimensional feature space, and then, a multiview-based batch mode selective sampling is run based on two measures: sample uncertainty and redundancy by solving quadratic programming to determine the queried samples in each cluster.
2.4.1. Hypothesis Generation
If an active learning can select enough number of contention samples, which could improve the hypothesis in each query, the number of unlabeled samples, which are incorrectly classified, will decrease. It is quite similar with boosting technique in weak classifier optimization. The MVAL incorporates the AdaBoost algorithm into our framework to boost the generated hypothesis in each query, and the main flowchart is described in Figure 4.
In Figure 4, a support vector machine (SVM) is used as a base classifier to construct a multiview classifier, which replaces the single-view classifier in AdaBoost, and this multiview classifier in each query can be considered as a weak classifier in each iteration in AdaBoost. The hypothesis of multiview classifier is computed by weighted voting of SVM base classifiers whose weights are . Unlike traditional query by boosting, we update the weight of each base classifier in each query and obtain the boosted hypothesis by weighting all the hypotheses from the past queries and not from the current query only.
The detailed process of the MAVL’s hypothesis generation based on AdaBoost is as follows:(a)In iteration , weighted voting is used to generate the initial multiview-based hypothesis: where is the classification confidence of sample by view , and denotes the contribution of view for classification which is determined by the soft classification error rate , which defines how correctly a sample is classified: where and denote the sum of classification confidence of unlabeled samples, which are labeled as and , respectively. For a “positive/negative” sample, the distance of it to the decision boundary in the “positive/negative” side reflects the degree of how correctly it is classified, and this information is utilized to calculate the error degree here instead of the traditional classification error calculated by the decision hypothesis in AdaBoost. Also, is updated through the following way: , where is the normalized weight. Then, the classification confidence of the multiview classifier can be computed by the following equation:(b)After iteration , the size of the labeled sample set is increased as follows: . denotes the labeled sample set in iteration , and denotes the newly added samples after query. As we know, the size of the labeled samples set is increased during iteration in active learning. Thus, if the size of the initial labeled training set is small, the influence of should be considered when updating the weight of the multiview classifier, which is illustrated by the following equation: Then, the weight of each sample is updated through the following way: , where , if is correctly classified, , otherwise, .(c)The final boosted hypothesis of the queried sample is equivalent to the weighted sum of all the hypotheses from the past queries, which is defined by
2.4.2. Sampling Strategy
The MVAL uses a new hierarchical competition-based sampling strategy in order to query the contention samples with high probability in different sample distributions, which is illustrated in Figure 5.
(1) Intercluster Sampling Competition. In the MVAL, a fast approximate spectral clustering algorithm is designed to reduce the computational complexity significantly to , where is the iteration number of K mean clustering, and is the total number of contention samples. The detailed process is illustrated as follows: (a) perform traditional K mean clustering on the contention unlabeled samples , compute the centroid of each cluster as representative points, and build a correspondence table to associate each with the nearest cluster centroid ; (b) run the normalized cut algorithm on to obtain a m-way cluster membership for each of ; and (c) recover the cluster membership for each by looking up the cluster membership of the corresponding centroid in the corresponding table.
After fast spectral clustering, two intercluster sampling measures are defined: the number of samples in the cluster and its information entropy. Both measures are weighted to obtain the number of selected samples in cluster in the following equation:where is proportional to the total number of samples in cluster , and computing is equivalent to kernel density estimation of in cluster . Weight reflects the impact of both measures in intercluster sampling competition, is the normalized factor, is the total number of selected samples in the current query, and is rounding operation.
(2) Intracluster Sampling Competition. In the MVAL, an efficient quadratic programming based-method  is utilized, which dynamically estimates the weights of the redundancy and uncertainty of an unlabeled sample in each query. It is used for intracluster selective sampling and solved by minimizing the following object function:
Equation (16) aims to estimate the normalized parameter , which reflects how probable the unlabeled sample is selected. is the classification confidence of sample in th view. are the queried unlabeled samples, is a unit vector, and is the number of unlabeled samples in batch mode. The first part denotes the sample uncertainty in th view, and the sampling strategy tends to select the contention sample near the classification hyperplane of th view by minimizing . The second part denotes the sample redundancy in th view, and the similar samples are selected by minimizing . The sampling probability is calculated by a convex quadratic programming, and finally, , which corresponds to in th view, is obtained. For selective sampling in each cluster, the conservative sampling strategy is utilized in a classic co-testing algorithm .
3. Results and Discussion
In our experiment, two classic image sets (OT image set from MIT  and UIUC sports event image set from UIUC ) are used for algorithm comparison. Average classification precision (ACP) and mean of average classification precision (MACP) are both used for evaluating the performance of both CO-LDA models and multiview active learning algorithms.
3.1. Evaluation of Theme Semantic
The first experiment is designed for evaluating the performance of our proposed theme semantic. In OT and UIUC Sports datasets, the parameter configuration of the CO-LDA model is as follows: (1) and in formula (7). (2) The batch sizes of sampled images in MVAL are .
We observe MACP variation of the CO-LDA model by changing the numbers of both theme and visual word: and , and a total of twenty groups of (T, W) are obtained. In Figure 6, we find that (T, W) curves for both datasets show the similar trends that MASP increase first and then decrease. Thus, in our CO-LDA model, we set and .
In the OT image set, we can see that there are significant differences between four scene classes “Highway,” “Forest,” “Mountain,” and “Tall building” in theme probability distributions, and multiview SVM classifier works well in scene classification. In the UIUC Sports image set, the theme probability distributions are very similar in four scene classes “Bocce,” “Croquet,” “Polo,” and “Snowboarding,” which significantly increases the difficulty of scene classification.
Furthermore, we compare the CO-LDA model with traditional LDA  and SP-pLSA  models, and the performance comparison of three theme models is shown in Table 1. N1 ∼ N8 denote the following eight natural scene classes: “Coast,” “Forest,” “Mountain,” “Open country,” “Highway,” “Inside city,” “Tall building,” and “Street.” S1 ∼ S8 denote the following eight event scene classes: ”Badminton,” “Bocce,” “Croquet,” “Polo,” “Rock Climbing,” “Rowing,” “Sailing,” and “Snowboarding.”
Note. Bold values represent the best performance of the algorithms corresponding to each class.
In the LDA model, each image is divided into blocks, and 5 pixels are overlapped between neighbored blocks. For feature representation, gray-scale SIFT descriptors are sparsely sampled, and means of three color channels are calculated. The numbers of the theme and visual word are and by cross validation. In the SP-pLSA model, the ways of image division and feature representation are the same as the LDA model. The numbers of the theme and visual word are and by cross validation.
In the OT image set, CO-LDA achieves both higher ACP and MACP than SP-pLSA in six scene classes except “Mountain” and “Inside city.” LDA performs the worst in all of scene classes except “Street.” It is easy to conclude that CO-LDA can achieve more accurate scene semantics than other two classic methods. In the UIUC Sports image set, CO-LDA achieves the highest ACP in the following three event classes: “Croquet,” “Polo,” and “Rowing,” and SP-pLSA achieves the highest ACP in the following three event classes: “Bocce,” “Rock Climbing,” and “Snowboarding.” But in the event classes “Badminton” and “Sailing,” in which LDA has the highest ACP, CO-LDA still performs better than SP-pSLA. Thus, we can conclude that our proposed CO-LDA also have slightly better performance in theme mining than the two classic image representation methods.
3.2. Evaluation of MVAL
In the second experiment, we compare our algorithm with other single-view active learning algorithm with both high-level semantics and low-level features for scene classification. In our initial labeled training set, label size = 150, batch size = 20, and iteration = 10.
Our proposed algorithm MVALHS (MVAL and HS denote MVAL  and two proposed high-level semantics, respectively) is compared with the following four algorithms: (1) MVALLS (LS denotes low-level image features): in MVAL, both means of three color channels and densely sampled color-SIFT descriptors are concatenated as a feature vector for image representation. (2) ALQP : a single-view SVM active learning by QP-based selective sampling, which relies on the sample uncertainty and redundancy. (3) DiffWS [25–30]: a disagreement-based active learning from weak and strong labelers. (4) GraphGP : a graphical model-based active learning with robust Gaussian process. The feature representations of ALQP, DiffWS, and GraphGP are the same as MVALLS. The performance comparison of the five active learning algorithms is shown in Table 2.
Note. Bold values represent the best performance of the algorithms corresponding to each class.
From Table 2, it is easily found that our algorithm MVALHS has the highest MACP in almost all scene classes than the other four algorithms in both image sets, which demonstrates that high-level semantics can achieve more significant improvement in holistic scene understanding than traditional low-level image features. Furthermore, we can see that MVALLS performs better in most cases than other three single-view algorithms, which also means that multiple view setting can successfully result in larger decrease in the size of the version space than traditional single-view active learnings due to its independent and diverse views.
This paper proposed a MVAL-based scene classification algorithm, which applies two different high-level image semantics to generate the corresponding hypotheses. Different object detectors are first trained to achieve the responses of different object classes as object semantic. Furthermore, a CO-LDA model is proposed for achieving more accurate theme semantic by integrating the spatial correlation of objects in different image resolutions, which improves the holistic scene understanding. With the help of both two independent views, our MVAL algorithm has potential to not only handle large-scale data training but also improve the performance of scene classification.
All data utilized in our research can be accessed from the following website: https://archive.ics.uci.edu/ml/datasets/Corel+Image+Features and http://vision.stanford.edu/lijiali/event_dataset/.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was funded by the Natural Science Foundation of China (Grant nos. 41571299 and 11601339), Key Research and Development Plan of Zhejiang Province (Grant no. 2018C01086), Open Research Project of the State Key Laboratory of Industrial Control Technology, Zhejiang University, China (no. ICT20047), Zhejiang Provincial Natural Science Foundation of China (Grant no. LY18F020025), and National Thousand Talents Program (Grant no. Y474161).
- P. F. Oliva and A. Torralba, “Modeling the shape of the scene: a holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 32, no. 9, pp. 145–175, 2001.
- A. Oliva and A. Torralba, “Chapter 2 Building the gist of a scene: the role of global image features in recognition,” Progress in Brain Research, vol. 30, no. 4, pp. 23–36, 2006.
- T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002.
- J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Proceedings of the conference of computer vision and pattern recognition, Honolulu, HI, USA, 2017.
- H. Guo, K. Zhang, X. C. Fan, H. K. Yu, and S. Wang, “Visual attention consistency under image transforms for multi-label image classification,” in Proceedings of the conference of computer vision and pattern recognition, Long Beach, CA, USA, 2019.
- D. Hoiem, A. N. Stein, A. A. Efros, and M. Hebert, “Recovering occlusion boundaries from a single image,” in IEEE 11th International conference on computer vision, pp. 1–8, Rio de Janeiro, Brazil, 2007.
- P. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 1627–1645, 2010.
- A. Bosch, A. Zisserman, and X. Munoz, “Scene classification using a hybrid generative/discriminative approach,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, pp. 712–727, 2008.
- L. Fei-Fei and P. Peronma, “A bayesian hierarchical model for learning natural scene categories,” in IEEE computer society conference on computer vision and pattern recognition, pp. 524–531, San Diego, CA, USA, 2005.
- Y. Tokozume, Y. Ushiku, and T. Harada, “Between-class learning for image classification,” in Proceedings of the conference of computer vision and pattern recognition, Salt Lake City, UT, USA, 2018.
- Z. M. Chen, X. S. Wei, P. Wang, and Y. W. Guo, “Multi-label image recognition with graph convolutional networks,” in Proceedings of the conference of computer vision and pattern recognition, Long Beach, CA, USA, 2019.
- X. Y. Zhang, S. H. Du, and Y. Zhang, “Semantic and spatial co-occurrence analysis on object pairs for urban scene classification,” IEEE Journal of Selected Ttopics in Applied Earth Observations and Remote Sensing, vol. 11, pp. 2630–2643, 2018.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, 2015.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, “SSD: single shot multibox detector,” in European conference on computer vision, Amsterdam, Netherlands, 2016.
- T. Y. Lin, P. Dollar, and R. Girshick, “Feature pyramid networks for object detection,” in Proceedings of the conference of computer vision and pattern recognition, Honolulu, HI, USA, 2017.
- P. Liu, H. Zhang, and K. B. Eom, “Active deep learning for classification of hyperspectral images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pp. 712–724, 2017.
- K. Wang, D. Y. Zhang, Y. Li, R. M. Zhang, and L. Lin, “Cost-effective active learning for deep image classification,” IEEE Transactions on Circuits and Systems for Video Technology, 2017.
- L. Yang, Y. Z. Zhang, J. X. Chen, S. Y. Zhang, and D. Z. Chen, “Suggestive annotation: a deep active learning framework for biomedical image segmentation,” in Proceedings of the conference of computer vision and pattern recognition, Honolulu, HI, USA, 2017.
- W. H. Yang, G. Q. Liu, L. Zhang, and E. H. Chen, “Multi-view learning with batch mode active selection for image retrieval,” in Proceedings of the 21st international conference on pattern recognition, pp. 979–982, Tsukuba, Japan, 2012.
- M. D. Hoffman, D. M. Blei, and F. Bach, “Online learning for latent dirichlet allocation,” Proceedings of Neural Information Processing Systems, pp. 1–9, 2010.
- T. Z. Yao, P. An, and J. T. Song, “Multi-view active learning based on weighted hypothesis boosting and hierarchical competition sampling,” Acta Electronica Sinica, vol. 45, no. 1, pp. 46–53, 2017.
- S. C. H. Hoi, R. Jin, J. K. Zhu, and M. R. Lyu, “Semi-supervised svm batch mode active learning for image retrieval,” in IEEE conference on computer vision and pattern recognition, pp. 1–7, Anchorage, AK, USA, 2008.
- C. C. Long and G. Hua, “Multi-class multi-annotator active learning with robust Gaussian process for visual recognition,” in IEEE international conference on computer vision, pp. 2839–2847, Santiago, Chile, 2015.
- L. J. Li and L. Fei-Fei, “What, where and who? classifying events by scene and object recognition,” in IEEE 11th International conference on computer vison, pp. 1–8, Rio de Janeiro, Brazil, 2007.
- C. C. Zhang and K. Chaudhuri, “Active learning from weak and strong labelers,” Advances in Neural Information Processing Systems, 2015.
- X. M. Zhang, T. T. Wang, J. Q. Qi, H. C. Lu, and G. Wang, “Progressive attention guided recurrent network for salient object detection,” in Proceedings of the conference of computer vision and pattern recognition, Salt Lake City, UT, USA, 2018.
- T. Zhao and X. Q. Wu, “Pyramid Feature attention network for saliency detection,” in Proceedings of the conference of computer vision and pattern recognition, Long Beach, CA, USA, 2019.
- H. L. Zheng, J. L. Fu, Z. J. Zha, and J. B. Luo, “Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition,” in Proceedings of the conference of computer vision and pattern recognition, Long Beach, CA, USA, 2019.
- L. J. Li, H. Su, E. P. Xing, and L. Fei-Fei, “Object bank: a high-level image representation for scene classification & semantic feature sparsification,” Proceedings of Neural Information Processing Systems, pp. 1–9, 2010.
- I. Muslea, S. Minton, and C. A. Knoblock, “Active learning with multiple views,” Journal of Artificial Intelligence Research, vol. 27, no. 1, pp. 203–233, 2006.
Copyright © 2020 Tuozhong Yao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.