Table of Contents Author Guidelines Submit a Manuscript
Mathematical Problems in Engineering
Volume 2018 (2018), Article ID 1090565, 11 pages
https://doi.org/10.1155/2018/1090565
Research Article

Label Distribution Learning by Regularized Sample Self-Representation

Lab of Granular Computing, Minnan Normal University, Zhangzhou, Fujian 363000, China

Correspondence should be addressed to Wenyuan Yang; nc.ude.umx@ywgnay

Received 19 October 2017; Revised 1 January 2018; Accepted 12 March 2018; Published 23 April 2018

Academic Editor: Wanquan Liu

Copyright © 2018 Wenyuan Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Multilabel learning that focuses on an instance of the corresponding related or unrelated label can solve many ambiguity problems. Label distribution learning (LDL) reflects the importance of the related label to an instance and offers a more general learning framework than multilabel learning. However, the current LDL algorithms ignore the linear relationship between the distribution of labels and the feature. In this paper, we propose a regularized sample self-representation (RSSR) approach for LDL. First, the label distribution problem is formalized by sample self-representation, whereby each label distribution can be represented as a linear combination of its relevant features. Second, the LDL problem is solved by -norm least-squares and -norm least-squares methods to reduce the effects of outliers and overfitting. The corresponding algorithms are named RSSR-LDL2 and RSSR-LDL21. Third, the proposed algorithms are compared with four state-of-the-art LDL algorithms using 12 public datasets and five evaluation metrics. The results demonstrate that the proposed algorithms can effectively identify the predictive label distribution and exhibit good performance in terms of distance and similarity evaluations.

1. Introduction

Multilabel learning allows more than one label to be associated with each instance [1]. In many practical applications, such as text categorization, ticket sales, and torch relays [2], objects have more than one semantic label, often expressed as the objects ambiguity. As an effective learning paradigm, multilabel learning is applied in a variety of fields [3, 4], but it mainly focuses on an instance of the corresponding related or unrelated label.

Though multilabel learning can solve many ambiguity problems, it is not well-suited to some practical problems [5, 6]. For example, consider the image recognition problem of a natural scene that is annotated with mostly water, lots of sky, some cloud, a little land, and a few trees. As can be seen from Figure 1, each label in Figure 1(a) should be assigned a different importance. Multilabel learning mainly focuses on an instance of the corresponding related label or unrelated label, rather than the difference in importance [7]. This leads to the question of how to determine the importance of different labels in an instance. Label distribution learning (LDL) can reflect the importance of each label in an instance in a similar way to a probability distribution. Figure 1(b) shows an example of a label distribution. This scenario is encountered in many types of multilabel tasks, such as age estimation [7], expression recognition [8], and the prediction of crowd opinions [9].

Figure 1: A natural scene image which has been annotated with water, sky, cloud, land, and trees.

In contrast to the multilabel learning output of a set of labels, the output of LDL is a probability distribution [10]. In recent years, LDL has become a popular topic of research as a new paradigm in machine learning. For instance, Geng et al. proposed the IIS-LDL and CPNN algorithms to estimate the ages of different faces [11]. Their approach achieves better results than previous age estimation algorithms, because they use more information in the training process. Thereafter, Geng developed a complete framework for LDL [10]. This framework not only defines LDL but also generalizes LDL algorithms and gives corresponding metrics to measure their performance.

At present, the parameter model for LDL is mainly based on Kullback-Leibler divergence [12]. Different models can be used to train the parameters, such as maximum entropy [13] or logistic regression [14], although there is no particular evidence to support their use. To some extent, the LDL process ignores the linear relationship between the features and the label distribution. Unlike other applications, LDL aims to predict the label distribution rather than the category. Thus, the overall label distribution can be effectively reconstructed from the corresponding samples.

In this paper, we propose an LDL method that uses the property of sample self-representation to reconstruct the labels. As the labels are similar to a probability distribution, but not actually a probability distribution, in LDL we can represent the labels through the feature matrix instead of the distance between two probability distributions. With the above considerations, we use a least-squares model to establish the objective function. That is, as far as possible, each label distribution is represented as the linear combination of its relevant features. The goal of this optimization model is to minimize the residuals. We combine LDL with sparsity regularization to optimize the model and then introduce regularization terms to solve the model. To solve the objective function efficiently, we use the -norm and -norm as the regularization terms. The corresponding algorithms are named regularized sample self-representation RSSR-LDL2 and RSSR-LDL21. The proposed algorithms not only have strong interpretability but also avoid the problem of overfitting. In a series of experiments, we demonstrate that the similarity and distance in a variety of evaluation metrics are superior to those of four state-of-the-art algorithms. The results of the experimental analysis on public datasets show that the proposed method can effectively predict the labels.

The remainder of this paper is organized as follows. A brief review of related work on LDL and sparsity regularization is presented in Section 2. In Section 3, we introduce the LDL task and evaluation metrics. We describe the RSSR-LDL method and develop two algorithms in Section 4. Section 5 presents and analyzes the experimental results. Finally, we conclude this paper and present some ideas for future work in Section 6.

2. Related Work

The continued efforts of researchers have led to various LDL algorithms being proposed [9, 13, 15]. There are three main design strategies in the literature [10]: problem transformation, algorithm adaptation, and specialized algorithm design. Problem transformation (PT) takes the label distribution instances and transforms them into multilabel instances or single-label instances; PT-SVM and PT-Bayes are the representative algorithms in this class. Algorithm adaptation (AA) extends some existing supervised learning algorithms to deal with the problem of label distribution, such as the AA-kNN and AA-BP algorithms [10]. Unlike problem transformation and algorithm adaptation, specialized algorithm (SA) design sets up a direct model for the label distribution data. Typical algorithms include LDLogitBoost, based on logistic regression [16], SA-BFGS, based on maximum entropy [10], and DLDL, which combines LDL with deep learning [17, 18].

Unlike traditional clustering [19] or classification learning [20], the labels in LDL have similar patterns to probability distributions. According to the definition of the label distribution, we assume that there may be a function that matches the feature to the label. We find that each label can be well approximated by a linear combination of its relevant features.

Linear reconstruction is not strictly expressed as , which is a coefficient matrix. Then we transform it to minimum residual and optimize it by least-square method. In order to avoid overfitting and to solve the problem, regularization term -norm is often added. In this way, the label distribution is constructed directly from the sample information of the data by the coefficient matrix. To find the corresponding relevant features while avoiding effects of noise in high dimensional data [21], we introduce sparse reconstruction [22, 23]. Sparsity reconstruction is the addition of sparse regularization terms -norm or -norm on the linear reconstruction. Nowadays, there are sparse regularization terms -norm and -norm proposed gradually [24]. -norm regularization is performed to select features across all of the data points with joint sparsity [25]. For matrix ,

Sparsity reconstruction is widely used in machine learning, especially for data dimensionality reduction [26, 27]. For example, Cai proposed the MCFS algorithm by using -regularized least-squares to deal with multicluster data [28], and Zhu et al. proposed the RMR algorithm based on regularized self-representation for feature selection [29]. Nie developed the RFS and JELSR algorithms using -regularized least-squares to optimize the objective function [25, 30]. Furthermore, -SVM and sparse logistic regression [31] have been shown to be effective. In general, linear reconstruction and sparsity reconstruction produce good performance in feature selection and classifier [32, 33]. Next, we will propose a label distribution learning method based on linear reconfiguration and sparsity reconstruction.

3. The Proposed Model

The goal of LDL is to obtain a set of probability distributions. Therefore, LDL is different from previous approaches in terms of the problem statement. In this section, the problem statement is briefly reviewed and the proposed model is introduced.

LDL is the process of describing an instance more naturally using labels [10]. We assign a value to each of the corresponding possible labels . This value represents the extent to which the label describes instance , that is, the description degree. Taking account of the corresponding subset of labels can give a complete description of the sample. Therefore, it is assumed that and . That is, the data for have a form that is similar to a probability distribution for an instance. The learning process based on such data is called label distribution learning.

We use to represent the instances, , and . Let denote the complete class labels, where and represent the labels. Corresponding label distribution , among , represents the label distribution of instance . Concretely, represents the extent to which the label describes the instance . Therefore, we can represent the training set of label distribution learning in this paper. In addition, the test data is defined as and the corresponding predicted label distribution is defined as .

We combine LDL with regularized sample self-representation to give the RSSR-LDL model. For RSSR-LDL, each sample and the corresponding description degree have the following relationship:where is the transformation matrix from the sample to the description degree and . According to the definition, this is equivalent toIn general, for , and (3) cannot be solved [34].

In order to solve the optimal , we introduce residual sum function :When , the minimum value of , the objective function of the model is

Using the difference values from (4), we obtainWhen is not full rank, or there is a significant linear correlation between columns, the determinant will be close to 0, which makes the calculation of an ill-posed problem. This will introduce a large error into the calculation of , resulting in a lack of stability and reliability in (5). Therefore, we introduce a regularization term with parameter to optimize the objective function.

In other words,There are several possible regularizations [25], is the LASSO regularization. is the ridge regression, also known as Tikhonov regularization. It is the most frequently used regularization method for ill-posed problem. is a new joint regularization [25, 29].

4. Regularized Model and Algorithm

To solve the objective function efficiently, we use the -norm and -norm to regularize the RSSR-LDL model, resulting in RSSR-LDL2 and RSSR-LDL21. The RSSR-LDL2 and RSSR-LDL21 algorithms are presented in this section.

4.1. Regularized Sample Self-Representation by -Norm

We use the -norm of to solve the RSSR-LDL problem. For convenience, we use the regularization term in (7). Then, (7) is as follows:Because (9) is smooth, (9) can be solved by the differential as (10).or equivalently, is a nonsingular matrix definitely; then

We can predict the label distribution of the test dataset using the learned matrix . Specifically, the predictive label distribution is defined as follows:Borrowing from the above theoretical analysis, we summarize Algorithm 1.

Algorithm 1: Regularized sample self-representation by -norm (RSSR-LDL2).

Algorithm 1 does not include an iterative process. In other words, the matrix can be solved directly, which makes the algorithm faster. Although this approach is efficient and easy to understand, it is not very accurate. Thus, in the next section, we use the -norm of to solve the RSSR-LDL problem.

4.2. Regularized Sample Self-Representation by -Norm

Combined with the characteristics of and , we choose the -norm of ,that is, , as the regularization term. This gives the RSSR-LDL21 algorithm. The objective optimization function of (7) is shown in the following expression:which can be transformed into

According to [25], this can be further transformed intowhere , , , and . The problem in (16) becomes one of solving a Lagrangian function; that is,where is a diagonal matrix with . The solution in (17) is convergent [25], so the iteration is viable. The RSSR-LDL21 algorithm is shown in Algorithm 2.

Algorithm 2: Regularized sample self-representation by -norm (RSSR-LDL21).

In Algorithm 2, the iteration is repeated until . In each iteration, is calculated with the previous and is calculated with the current .

5. Experiments

To demonstrate the performance of the proposed RSSR-LDL2 and RSSR-LDL21 algorithms, we apply them to gene expression levels, facial expression, and movie score problems. In this section, we use five evaluation metrics to test the proposed algorithms on 12 publicly available datasets (http://cse.seu.edu.cn/PersonalPage/xgeng/LDL/index.htm). The proposed algorithms are also compared with four state-of-the-art LDL algorithms.

5.1. Evaluation Metrics

In LDL, there are multiple labels associated with each instance, and these reflect the importance of each label for the instance. As a result, performance evaluation is different from that of both single- and multilabel learning. Because the label distribution is similar to a probability distribution, we use the similarity and distance between the original distribution and the predicted distribution to evaluate the effectiveness of LDL algorithms. There are many measures of the distance and similarity between probability distributions. In [35], 41 kinds of distance and similarity evaluation metrics were identified across eight classes. The various distance/similarity measures offer different performances in terms of comparing two probability distributions.

According to the agglomerative single linkage with average clustering method [36], screening rules [10], and experimental conditions, we evaluated five methods: the Chebyshev distance [37], Clark distance [38], Canberra distance [39], intersection similarity [36], and cosine similarity [39]. The related names and expressions are listed in Table 1. A “” after the distance measure indicates that smaller values are better, whereas “” after the similarity measure indicates that larger values are better.

Table 1: Evaluation metrics description.
5.2. Experimental Setting

Experiments are conducted on 12 public datasets. In the movie dataset, each instance represents the characteristics of a movie and the category score that the movie may belong to. SBU-3DFE and SJAFFE datasets represent facial expression images. Each instance represents a facial expression and scores of possible expression class. The Yeast family contains nine yeast gene expression levels. Each instance represents the expression level of a gene at a certain time. These datasets are described in Table 2.

Table 2: Data description.

To verify the effectiveness and performance of our LDL method, we compared the RSSR-LDL2 and RSSR-LDL21 algorithms with four existing LDL algorithms. According to [10], we selected comparative algorithms that use different strategies.

PT-SVM. PT-SVM is applied to training sets in which label distribution is obtained by the problem resampling method [40]. PT-SVM uses pairwise coupling to solve the multiclassification problem [41]. This algorithm calculates the posterior probability of each class as the description degree of a label.

AA-BP. AA-BP is a three-layer backpropagation neural network. This algorithm has input units and output units, which receive and output , respectively.

SA-IIS. SA-IIS uses the maximum entropy to solve the LDL problem. The optimization strategy of this algorithm is similar to that of the scaling-based IIS [42].

SA-BFGS. SA-BFGS is an improved algorithm based on SA-IIS. This improved algorithm employs an effective quasi-Newton method and is more efficient than the standard line search approach.

For the parameter settings in these four algorithms, we refer to [10]. For our algorithms, we tuned the regularization parameter using values of and present the best results [29]. The performance of the above LDL algorithms was evaluated by considering the distance and similarity between the original label distribution and the predicted label distribution.

5.3. Results Analysis of Experiments

In this section, the performance of the proposed algorithms is compared with that of four existing state-of-the-art LDL algorithms in terms of five evaluation metrics. We also present the predicted label distribution given by the six algorithms and the real label distribution.

5.3.1. Distance and Similarity Comparison

To verify the advantages of the proposed RSSR-LDL2 and RSSR-LDL21, experiments were conducted on 12 public datasets. Each experiment used tenfold cross-validation [43, 44], and the mean value and standard deviation of each evaluation were recorded. Because many results were close to zero, they are represented as “.” The main measure of the size of individual differences is the distance, whereas the similarity reflects the trend and direction of the vector. Therefore, we use distance and similarity to demonstrate the superiority of the proposed algorithms.

The results for the Chebyshev distance, Clark distance, Canberra metric, cosine coefficient, and intersection similarity are presented in Tables 36, respectively. In each table, the best results are given in bold and the second-best results are italicized (if the mean is the same, the algorithm with the smaller standard deviation is considered to be better). The first evaluation metrics measure distance, and so smaller values are better; the latter two measure similarity, and so larger values are better. From these results, we can see that RSSR-LDL21 achieves the best performance of all the algorithms and RSSR-LDL2 is better than the others.

Table 3: Chebyshev distance   (mean std) × 103 of different algorithms on the twelve datasets. The best results are enlightened in bold and the second best results are italicized.
Table 4: Clark Distance (mean std) × 103 of different algorithms on the twelve datasets. The best results are enlightened in bold and the second best results are italicized.
Table 5: Canberra Meric   (mean std) × 103 of different algorithms on the twelve datasets. The best results are enlightened in bold and the second best results are italicized.
Table 6: Intersection   (mean std) × 103 of different algorithms on the twelve datasets. The best results are enlightened in bold and the second best results are italicized.

From the results in Tables 46, our algorithms have obvious advantages. In particular, the L21 algorithm offers better performance than the other algorithms with almost every dataset. The SA-BFGS algorithm achieves equivalent performance in terms of the Chebyshev distance (Table 3) and cosine coefficient (Table 7) with some datasets, mainly those for the yeast genes. In addition, our algorithms not only produce good results, but they are also very stable, especially RSSR-LDL21.

Table 7: Cosine   (mean std) × 103 of different algorithms on the twelve datasets. The best results are enlightened in bold and the second best results are italicized.

The proposed algorithms perform differently with the different datasets. The results show that the RSSR-LDL approach has an absolute advantage over the other algorithms with the movie dataset. This is because the characteristics of sparse representation offer obvious advantages when there are a large number of features. The proposed algorithms continue to offer some advantages over the other algorithms with the facial expression datasets, although some results are similar to those given by the SA-BFGS algorithm. As the number of features in the yeast datasets is small, our algorithms do not show the best performance with all evaluation metrics but still achieve similar performance to the SA-BFGS algorithm. Moreover, the performance of the proposed algorithms is better than the other comparative algorithms. Especially, there is a more obvious advantage in high dimensional data.

5.3.2. Label Distribution Showing

Unlike classification learning and clustering, LDL reflects the importance of each label for an instance. Hence, our ultimate goal is no longer categorization but a sort of probability distribution. Two typical examples of the original label distribution and that predicted by the six LDL algorithms are presented in Table 8. We select the th sample of the label distribution as a demonstration.

Table 8: The real and predictive distribution of two typical examples on six algorithms.

In Table 8, the second and third columns represent the real label distribution and the predicted label distributions given by the six different algorithms for the movie and SBU-3DFE datasets, respectively. Each point represents the corresponding value of a label in the subgraph in Table 8, and the spline shows the trend in the label distribution. According to the distribution law of the midpoint of the graph, the movie distribution was fitted using a Gaussian function and SBU-3DFE was fitted with a smooth spline.

Table 8 indicates that the proposed algorithms achieve perfect performance. On the one hand, the RSSR-LDLL21 algorithm has an absolute advantage, with the value and trend being almost consistent with the real label distribution. On the other hand, the RSSR-LDL2 algorithm is not as good as RSSR-LDL21 but achieves the same performance as SA-BFGS, which is obviously better than the other three comparative algorithms in terms of distance and similarity.

5.3.3. Parameter Sensitivity

Like many other learning algorithms, RSSR-LDL has parameters that must be tuned in advance. We tuned and then recorded the best results given in Tables 38. For RSSR-LDL2, the Clark distance given by on 3 representative datasets is shown in Figure 2 which belongs to three different data types, respectively. We observe that RSSR-LDL2 is relatively insensitive to for the facial expression and gene expression datasets, whereas it is slightly more sensitive for movie score datasets. Interestingly, in Figure 3, note that in RSSR-LDL21 is similar to .

Figure 2: Clark distance of RSSR-LDL2 with respect to .
Figure 3: Clark distance of RSSR-LDL21with respect to .

6. Conclusion and Future Work

LDL deals with instances associated with multiple labels but also reflects the importance degree of each label on the instance. In this paper, we proposed a new criterion for LDL using regularized sample self-representation. We reconstructed the labels with features and a transformation matrix and described each label as a linear combination of features. Then, we used the -norm and -norm as regularization terms to optimize the transformation matrix. We conducted experiments on 12 real datasets and compared the proposed algorithms with four existing LDL algorithms using five evaluation metrics. The experimental results show that the proposed algorithms are efficient and accurate. In future work, we will use a least-angle regression model to develop a better generalization model for solving practical problems.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is in part supported by National Science Foundation of China (under Grant nos. 61379049, 61379089, and 61703196).

References

  1. M.-L. Zhang and K. Zhang, “Multi-label learning by exploiting label dependency,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD-2010, pp. 999–1007, USA, July 2010. View at Publisher · View at Google Scholar · View at Scopus
  2. M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algorithms,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1819–1837, 2014. View at Publisher · View at Google Scholar · View at Scopus
  3. Z.-H. Zhou and M.-L. Zhang, “Multi-instance multi-label learning with application to scene classification,” in Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS '06), pp. 1609–1616, December 2006. View at Scopus
  4. M.-L. Zhang and L. Wu, “LIFT: multi-label learning with label-specific features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 1, pp. 107–120, 2015. View at Publisher · View at Google Scholar · View at Scopus
  5. Y.-K. Li, M.-L. Zhang, and X. Geng, “Leveraging implicit relative labeling-importance information for effective multi-label learning,” in Proceedings of the 15th IEEE International Conference on Data Mining, ICDM 2015, pp. 251–260, USA, November 2015. View at Publisher · View at Google Scholar · View at Scopus
  6. W. Zhu, “Relationship between generalized rough sets based on binary relation and covering,” Information Sciences, vol. 179, no. 3, pp. 210–225, 2009. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  7. Z. He, X. Li, Z. Zhang et al., “Data-dependent label distribution learning for age estimation,” IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 3846–3858, 2017. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  8. Y. Zhou, H. Xue, and X. Geng, “Emotion distribution recognition from facial expressions,” in Proceedings of the 23rd ACM International Conference on Multimedia, MM 2015, pp. 1247–1250, Australia, October 2015. View at Publisher · View at Google Scholar · View at Scopus
  9. X. Geng and P. Hou, “Pre-release prediction of crowd opinion on movies by label distribution learning,” in Proceedings of the 24th International Joint Conference on Artificial Intelligence, IJCAI 2015, pp. 3511–3517, arg, July 2015. View at Scopus
  10. X. Geng, “Label Distribution Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1734–1748, 2016. View at Publisher · View at Google Scholar · View at Scopus
  11. X. Geng, C. Yin, and Z.-H. Zhou, “Facial age estimation by learning from label distributions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 10, pp. 2401–2412, 2013. View at Publisher · View at Google Scholar · View at Scopus
  12. T. van Erven and P. Harremo, “Rényi divergence and Kullback-Leibler divergence,” Institute of Electrical and Electronics Engineers Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014. View at Publisher · View at Google Scholar · View at MathSciNet
  13. X. Geng, Q. Wang, and Y. Xia, “Facial age estimation by adaptive label distribution learning,” in Proceedings of the 22nd International Conference on Pattern Recognition, ICPR 2014, pp. 4465–4470, Sweden, August 2014. View at Publisher · View at Google Scholar · View at Scopus
  14. J. Pearce and S. Ferrier, “Evaluating the predictive performance of habitat models developed using logistic regression,” Ecological Modelling, vol. 133, no. 3, pp. 225–245, 2000. View at Publisher · View at Google Scholar · View at Scopus
  15. Z. Zhang, M. Wang, and X. Geng, “Crowd counting in public video surveillance by label distribution learning,” Neurocomputing, vol. 166, pp. 151–163, 2015. View at Publisher · View at Google Scholar · View at Scopus
  16. C. Xing, X. Geng, and H. Xue, “Logistic boosting regression for label distribution learning,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16), pp. 4489–4497, July 2016. View at Scopus
  17. X. Yang, B.-B. Gao, C. Xing et al., “Deep Label Distribution Learning for Apparent Age Estimation,” in Proceedings of the 15th IEEE International Conference on Computer Vision Workshops, ICCVW 2015, pp. 344–350, chl, December 2015. View at Publisher · View at Google Scholar · View at Scopus
  18. B.-B. Gao, C. Xing, C.-W. Xie, J. Wu, and X. Geng, “Deep label distribution learning with label ambiguity,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2825–2838, 2017. View at Publisher · View at Google Scholar · View at MathSciNet
  19. E. Elhamifar and R. Vidal, “Sparse subspace clustering: algorithm, theory, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2765–2781, 2013. View at Publisher · View at Google Scholar · View at Scopus
  20. H. Zhao, P. Zhu, P. Wang, and Q. Hu, “Hierarchical Feature Selection with Recursive Regularization,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp. 3483–3489, Melbourne, Australia, August 2017. View at Publisher · View at Google Scholar
  21. X. Luo, X. Chang, and X. Ban, “Regression and classification using extreme learning machine based on L1-norm and L2-norm,” Neurocomputing, vol. 174, pp. 179–186, 2016. View at Publisher · View at Google Scholar · View at Scopus
  22. C. Hou, F. Nie, X. Li, D. Yi, and Y. Wu, “Joint embedding learning and sparse regression: A framework for unsupervised feature selection,” IEEE Transactions on Cybernetics, vol. 44, no. 6, pp. 793–804, 2014. View at Publisher · View at Google Scholar · View at Scopus
  23. B. Zhang, A. Perina, V. Murino, and A. Del Bue, “Sparse representation classification with manifold constraints transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, pp. 4557–4565, usa, June 2015. View at Publisher · View at Google Scholar · View at Scopus
  24. D. Luo, C. Ding, and H. Huang, “Towards structural sparsity: An explicit 2/ 0 approach,” in Proceedings of the 10th IEEE International Conference on Data Mining, ICDM 2010, pp. 344–353, Australia, December 2010. View at Publisher · View at Google Scholar · View at Scopus
  25. F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust feature selection via joint l2,1-norms minimization,” in Advances in Neural Information Processing Systems, pp. 1813–1821, MIT Press, 2010. View at Google Scholar
  26. S. Mosci, L. Rosasco, M. Santoro, A. Verri, and S. Villa, “Solving structured sparsity regularization with proximal methods,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface, vol. 6322, no. 2, pp. 418–433, 2010. View at Publisher · View at Google Scholar · View at Scopus
  27. F. Wu, Y. Han, Q. Tian, and Y. Zhuang, “Multi-label boosting for image annotation by structural grouping sparsity,” in Proceedings of the 18th ACM International Conference on Multimedia ACM Multimedia 2010, (MM'10), pp. 15–24, ita, October 2010. View at Publisher · View at Google Scholar · View at Scopus
  28. D. Cai, C. Zhang, and X. He, “Unsupervised feature selection for multi-cluster data,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '10), pp. 333–342, ACM, Washington, DC, USA, July 2010. View at Publisher · View at Google Scholar · View at Scopus
  29. P. Zhu, W. Zuo, L. Zhang, Q. Hu, and S. C. K. Shiu, “Unsupervised feature selection by regularized self-representation,” Pattern Recognition, vol. 48, no. 2, pp. 438–446, 2015. View at Publisher · View at Google Scholar · View at Scopus
  30. C. Hou, F. Nie, and D. Yi, “Feature selection via joint embedding learning and sparse regression,” in Proceedings of the 22nd International Joint Conference on Artificial Intelligence, vol. 22, pp. 1324–1329, 2011.
  31. S. K. Shevade and S. S. Keerthi, “A simple and efficient algorithm for gene selection using sparse logistic regression,” Bioinformatics, vol. 19, no. 17, pp. 2246–2253, 2003. View at Publisher · View at Google Scholar · View at Scopus
  32. J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani, “1-norm support vector machines,” in Conference on Neural Information Processing Systems, vol. 15, pp. 49–56, 2003.
  33. C.-N. Li, Y.-H. Shao, and N.-Y. Deng, “Robust L1-norm two-dimensional linear discriminant analysis,” Neural Networks, vol. 65, pp. 92–104, 2015. View at Publisher · View at Google Scholar · View at Scopus
  34. W. J. McCalla, Linear equation solution, vol. 37, Springer, USA, 1988. View at Publisher · View at Google Scholar
  35. S. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,” International Journal of Mathematical Models and Methods in Applied Sciences, vol. 1, no. 2, pp. 300–307, 2007. View at Google Scholar
  36. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, New York, NY, USA, 2nd edition, 2001.
  37. F. Fahroo and I. M. Ross, “Direct trajectory optimization by a Chebyshev pseudospectral method,” in Proceedings of the American Control Conference, vol. 6, pp. 3860–3864, 2000. View at Publisher · View at Google Scholar
  38. K. X. Pearson, “On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 50, no. 302, pp. 157–175, 1900. View at Google Scholar
  39. E. Deza and M.-M. Deza, Dictionary of distances, Elsevier, 2006. View at Publisher · View at Google Scholar · View at Scopus
  40. H.-T. Lin, C.-J. Lin, and R. C. Weng, “A note on Platt's probabilistic outputs for support vector machines,” Machine Learning, vol. 68, no. 3, pp. 267–276, 2007. View at Publisher · View at Google Scholar · View at Scopus
  41. T.-F. Wu, C.-J. Lin, and R. C. Weng, “Probability estimates for multi-class classification by pairwise coupling,” Journal of Machine Learning Research (JMLR), vol. 5, pp. 975–1005, 2004. View at Google Scholar · View at MathSciNet
  42. R. Malouf, “A comparison of algorithms for maximum entropy parameter estimation,” in Proceedings of the proceeding of the 6th conference, pp. 1–7, Not Known, August 2002. View at Publisher · View at Google Scholar
  43. A. M. Chekroud, R. J. Zotti, Z. Shehzad et al., “Cross-trial prediction of treatment outcome in depression: A machine learning approach,” The Lancet Psychiatry, vol. 3, no. 3, pp. 243–250, 2016. View at Publisher · View at Google Scholar · View at Scopus
  44. C. Xu, T. Liu, D. Tao, and C. Xu, “Local Rademacher complexity for multi-label learning,” IEEE Transactions on Image Processing, vol. 25, no. 3, pp. 1495–1507, 2016. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus