Abstract

As a specific case of image recognition, zero-shot image classification is difficult to solve since its training set cannot cover all the categories of the testing set. From the view point of human vision recognition, the objects can be recognized through the visible and nameable description to the properties. To be the semantic description of the object property, attributes can be taken as a bridge between the seen and unseen categories, which are capable of using into zero-shot image classification. There are mainly binary attributes and relative attributes for zero-shot classification, where the relative attributes have the ability to catch more general sematic relationship than the binary ones. But relative attributes do not always work in zero-shot classification for those categories having similar relative strength attributes. Aiming at solving the defect of the relative attributes in describing the similar categories, we propose to construct the Hybrid Relative Attributes based on Sparse Coding (SC-HRA). First, sparse coding is implemented on low-level features to get nonsemantic relative attributes, which are the necessary complement to the existing relative attributes. After that, they are integrated with the relative attributes to form the hybrid relative attributes (HRA). HRA ranking functions are then learned by the relative attribute learning. Finally, the class label is obtained according to the predicted ranking results of HRA and the ranking relations of HRA among the categories. To verify the effectiveness of SC-HRA, the extensive experiments are conducted on the datasets of faces and natural scenes. The results show that SC-HRA acquires the higher classification accuracy and AUC value.

1. Introduction

Image recognition has attracted the attention of many researchers and made a lot of progress in recent years. In general, the methods applied to image recognition problems include supervised learning, semisupervised learning and unsupervised learning. However, due to the high cost of the procedure of labeling images, it is not enough to cover all the categories to discriminate merely by the labeled images. Also no labeled image can be obtained. Zero-shot image classification is therefore proposed to solve the problem where the unseen categories of the training set exist in the testing set. To solve zero-shot classification problems, traditional image recognition approaches build the relations between low-level features and object category labels. However, the recent approaches prefer to use semantic attributes.

As it is well known, attribute [1] is the property that can be described by semantics or tagged by persons (e.g., with wings, black hair). Thus, attribute and feature are used to describe image contents from different levels. In low-level description of the image content, feature can be recognized by machines but cannot be understood by humans. However, attribute is the high-level description of the image content which can be understood by humans as well as machines. This is because attribute learning has shown excellent characteristic in describing objects, usability in face recognition [2], image retrieval [3, 4], image description [5, 6], and image classification [7, 8]. An object usually has various properties. For example, the hawks have wings, Asians have black hair, the road is open, and high-heeled shoes have high heel, etc. Through these attributes, people can recognize the objects’ appearance and describe them in a semantic way. Besides, different categories may have the same attributes which can be regarded as the middle level description of objects in the classifier to be shared between different categories. In this way, the known knowledge can be transferred from one category to other new categories to make classifiers identify the categories even without any training samples.

The relative attribute [9], also named semantic relative attribute, indicates the attribute strength in an image compared to that in other image. The range of relative attribute is , while the absolute value of it has no meaning because of its comparability between different categories. In such a way, relative attribute has many visual applications to help visual tasks understand the image contents as humans prefer to understand objects in the comparative way. Compared to the binary attribute, relative attribute has a richer language description for objective contents. Therefore, relative attribute has a strong ability in image description and human-computer interaction, which can effectively bridge the semantic gap between the low-level image feature and the high-level semantic feature.

Owing to the advantage of comparative description ability, relative attributes are used to zero-shot image classification [1013]. Parikh et al. [9] demonstrated that the relative attribute learning could acquire higher classification accuracy in zero-shot image classification than the binary attributes. In relative attributes-based zero-shot classification, the relative strengths of the attributes are considered, which comparatively describe the samples in attributes like “greater than”, “less than”, or “equal to”. By this way, the sample has distinguishable location in the attribute space, and the class label of the sample can be further determined. However, for those categories whose attributes have similar relative strengths (that is the attribute locations are close to each other in the attribute space), it is difficult to distinguish them only by the relative attributes. To solve this problem, a hybrid relative attribute model is constructed using sparse coding. Firstly, the model utilizes sparse coding to reconstruct the low-level feature of the image [14] and then obtain the low dimensional and compact representation of the image. Furthermore, the relative relations among the reconstructed features are regarded as a new kind of relative attributes. Due to its no semantic character, the reconstructed feature is named as nonsemantic relative attribute, which can be taken as the complement of the relative attributes and further increase the distinguishability of the relative attributes. Therefore, it is easy to distinguish the categories with the attributes having the similar relative relation strengths [15]. Finally, the constructed hybrid relative attributes are used to improve the performance of zero-shot image classification.

The rest of the paper is organized as follows. The related works are reviewed in Section 2. Section 3 describes the hybrid relative attribute learned by sparse coding in detail. In Section 4, the experiments are implemented on the zero-shot image classification. Finally, the conclusion is given in Section 5.

Among the image recognition problems, zero-shot classification is comparably difficult to deal with. As training set and testing set have different label space, some researchers considered it as a special case of transfer learning [16] and then find the sharable low-level feature subspace by minimizing the discrimination between training set and testing set. On the other aspect, images naturally have rich linguistic meaning. For the cases without labeled training samples, the visual attributes bridge the gap between low-level features and labels. Attribute learning has excellent performance on object description [12] and knowledge transfer [17], which can be successfully applied to zero-shot image classification.

Attribute learning mainly contains binary attribute learning and relative attribute learning. The binary attribute learning focuses on those attributes with Boolean values 0 and 1. 0 denotes the absence of the attribute in an object, whereas 1 means the existence of the attribute in the object. Commonly, support vector machine [18] is used to train the binary classifier. For the images with plural attributes, the binary classifier is trained on the extracted low-level feature for each attribute, where the binary attributes could assist the classifier to semantically understand image contents. However, humans always learn about the objects in a relative way. For example, Karen and Mark are smiling and Karen is smiling more than Mark, which covey different information. Therefore, even a simple relative statement about an attribute could convey distinguishable meaning and abundant information of the image content [19]. Generally, the relative attribute mainly focuses on comparatively describing images through the relative strengths of the attribute in different samples [9]. The relative attribute has been used to search images using the feedback mechanism. Once the image search started, the relevance function was updated, and the image pool was reranked based on the accumulated constraints [20]. Moreover, the attribute feedback was used in active learning, which further prompted its classification ability [21]. Sandeep et al. [22] proposed the local parts instead of global description to jointly represent pairs of images. The part was a block around the detected point by the domain specific method. Taking “smiling” as an example, the area of mouth should be located as the significant part with important information to distinguish images since it is highly correlated with image categories. But for “eyes-open”, the best local part is definitely different from “smiling”. However, neither binary attributes nor relative attributes make sense all the time. Thus, they could be used together to learn the relationship between attributes and the statements in a more natural way [19].

In general, the relative attributes can describe the objects more naturally, which conform to human language. In the case of the attributes having similar relative strengths, the relative attribute learning does not work. Inspired by the landmark point of local part, it is feasible to implement sparse coding on the low-level feature to acquire more important information as the nonsemantic attributes for the zero-shot image classification. Therefore, we propose the hybrid relative attribute model to get the compact representation of images. Combining the semantic and nonsemantic attributes, the zero-shot classification is prompted to get the higher performance.

3. SC-HRA Model Based Zero-Shot Image Classification

SC-HRA model for zero-shot image classification is mainly divided into IV stages. Stage I is to construct the hybrid relative attributes. The nonsemantic attributes of training and testing samples are obtained by sparse coding based on the training and testing samples, respectively. And the relationship between the nonsemantic relative attribute and the class label is obtained simultaneously. Then the semantic relative attributes and the nonsemantic relative attributes are integrated as the hybrid relative attributes. Stage II is to learn the ranking function for each hybrid relative attribute. We use the ranking SVM to train the ordered pairs of the attributes as well as the unordered pairs of the attributes. Thus, each hybrid relative attribute has its own ranking function. Stage III is to build the Gaussian class model. For the training category, the ranking functions learned by stage II are used to build the Gaussian class model. But, for the testing category, the Gaussian class model is built by the relationship between the hybrid relative attributes of the training category and those of the testing category. Then, in terms of the established Gaussian class models, all categories can be located in the attribute space, which assist to distinguish the relative strength of the attributes in different categories. In Stage IV, the zero-shot image classification is implemented. The hybrid relative attributes of the testing samples are predicted by the attribute ranking functions of stage II. Then, Maximum Likelihood Estimation (MLE) is used to calculate the similarity between the hybrid relative attributes of the testing samples and the Gaussian models of each category to predict the class label of the testing sample.

In order to implement SC-HRA model based zero-shot image classification, the key points of the above five stages are to construct the hybrid relative attributes by learning the nonsemantic relative attributes and learn the ranking functions for hybrid relative attributes. The detailed realization is described in this section.

3.1. Hybrid Relative Attributes

In this section, we take the Outdoor Scene Recognition (OSR) [23] dataset as an example to explain the construction of the hybrid relative attribute. As shown in Figure 1, five semantic relative attributes are given to describe 3 categories, which are “Tall building”, “Street”, and “Coast”. “Tall building” and “Street” have similar ranks on all the semantic relative attributes except the attribute “Perspective”. So the generative model [9] of “Tall building” built on the semantic attributes is extremely similar to that of “Street”. Therefore, when the generative model is used to classify the zero-shot images, it is difficult to distinguish “Tall building” and “Street” by the classifier only depending on the semantic relative attributes. But for the category “Coast”, the semantic relative attributes are far different from those describing “Tall building” and “Street”. Thereby, the generative model of “Coast” is obviously distinguished from those of “Tall building” and “Street”. Naturally, it is easy for the classifier to identify “Coast” from “Tall building” and “Street”. Therefore, considering the limitation of the semantic relative attributes, the nonsemantic relative attributes acquired by sparse coding are taken as the supplement of the semantic attributes to construct the hybrid relative attributes , which aim at classifying those categories with high similarity. On the other hand, the auxiliary nonsemantic relative attributes cannot mislead the classifier which has obtained the correct results by the semantic relative attributes, such as “Coast”.

3.2. Nonsemantic Relative Attribute Learning Based on Sparse Coding

The main idea of the nonsemantic attribute learning by sparse encoding is described in this part. At first, the features of the unlabeled training samples are used to learn a set of basis vectors. After, the features of all the training samples can be linearly represented by the learned basis vector set. Besides, the features of the testing samples can also be encoded by the learned basis vectors.

is a set of input vectors, where , denotes the th input vector and is its dimensions. Basis vectors are learned by sparse coding, which are used to reconstruct the initial input features later. is the activation value, which is named as the nonsemantic attribute in this paper. Generally, sparse coding has two steps: training and coding.

In the training stage, a set of basis vectors are generated using the training samples. Then, the linear combination of the set of basis vectors is used to express input vectors. Thus, the generation of the basis vectors can be regarded as an optimization problem, which can be described by the following objective function [24]:where is the jth sparse coding representation of the training sample . denotes the weight attenuation coefficient. can be taken as a sparse regularization term, which potentially obliges the constraint function to have the unique solution and ensures the distinct features to represent it. By solving the optimization problem in (1), we can obtain the activation value and the basis vector of the input feature vector . When and vary simultaneously, the objective function is not always the convex optimization problem, whereas, with the basis vector fixed, it turns to be the convex optimization problem about the activation value . If is fixed, it becomes the convex optimization problem about the basis vector . Therefore, it is feasible to implement the sparse coding as the alternately iterative optimization problem, which can fix one variable to get the solution of the other variable [25, 26]. Each iteration has two steps. The first step is to fix the basis vector and adjust the activation value to minimize (1). The second step is to fix the activation value and adjust the basis vector to minimize the objective constraint function. Till it is converged, the set of basis vectors can be obtained.

The coding stage is to calculate the sparse representation of a new sample. As the basis vector has been obtained in the training stage, in order to obtain the sparse representation of the new sample, it is just needed to adjust the activation value to minimize (2). Hence, the activation vector is the sparse representation of the sample, which is the nonsemantic relative attributes as well.

After that, the relative relations of nonsemantic attributes among the different samples are determined by comparing the nonsemantic relative attributes between the different samples. Algorithm 1 shows the specific procedure to obtain the nonsemantic relative attributes by sparse coding.

Input: the low level features of the training samples and the testing samples, the weight attenuation coefficient .
Output: the non-semantic relative attributes of the training samples and the testing samples.
Step 1: bring the low level features of the training samples into Equation (1), and initialize the basis vector ;
Repeat;
Fix the basis vector in the last step, update the activation value to minimize Equation (1).
Fix the activation value in the last step, update the basis vector to minimize Equation (1).
Repeat the above procedure till convergence is reached, obtain the basis vector to represent the sample
features well.
Step 2: bring the basis vector and the low level feature of the testing samples into Equation (2).
Step 3: adjust the activation value to minimize Equation (2), the adjusted activation value is the non-semantic relative
attributes of the testing sample.
3.3. Ranking Function Learning Based on Hybrid Relative Attributes

The ranking functions are learned based on the hybrid relative attributes . For each hybrid relative attribute of the training samples, the ordered pairs of attributes and the unordered pairs of attributes are given, respectively, where means image has a stronger presence of attribute than image j and means that image and have the similar relative strengths of . As an example “open” shown in Figure 2, denotes the ordered pairs. And the pair of the images above correspondingly is an example of the ordered pairs, where the left image looks opener than the right. denotes the unordered pairs. The image pair below is hard to say which image is much more enlarged.

In order to obtain the relative relations between the attributes, we use Ranking SVM to learn the ranking functions of () attributes [27]:where is the projection vector. For , the conditions shown in (4) and (5) should be satisfied:Learning the ranking attribute functions is to find the optimal projection direction in the low-level feature space, so that all the samples have the right orders in the projection direction. Consequently, (4) and (5) are transformed into

As shown in (5) and (6), the ranking relation between the samples can be represented as . The ordered relation between any two samples in the sharable feature space can be represented by the new vector and its corresponding new label, which is shown in

Therefore, we can correctly rank attributes by solving the standard binary classification in (8), where its optimization objective is shown as

Herein, is the nonnegative relaxation factor of the ordered pairs ; is the nonnegative relaxation factor of the unordered pairs ; and is a trade-off constant to balance between the maximization of the margin distance and the satisfaction of the relative relation of the attribute pair. By maximizing the ranking margin and minimizing the relaxation factors and simultaneously, we can obtain the optimal projection vector . Then, the attribute ranking function is represented as

As to our knowledge of machine learning, binary classifier is to perfectly separate the positive and negative samples, where the margin restriction is to maximize the distance of the nearest two samples from the two different categories, whereas learning the ranking functions is to rank the training samples in terms of the attribute strengths, so its margin restriction is to maximize the distance of the nearest two samples during the ranking [28, 29]. Therefore, the ranking functions are better to reflect the relationships of the relative strengths between the attributes of the different categories.

3.4. Zero-Shot Image Classification Based on SC-HRA

Due to the zero-shot learning without known samples, the classification model is built based on the relative relations of the attributes between the training and testing samples by the method in [9]. Specifically, the images are given from categories, of which categories are the “seen” categories as the training samples to learn the ranking functions of the hybrid relative attributes, and categories are the “unseen” categories as the testing samples not involved in learning the ranking functions. In the modelling procedure, and are used to represent the models of the training category and the testing category, respectively.

The procedure of zero-shot image classification is explained in the following.

Building the model of the training category: first, the attribute ranking scores of the training samples are predicted by the ranking functions of the hybrid relative attributes, in which is the ranking score of the fth hybrid relative attribute in the ith training sample. After that, based on the ranking scores of the hybrid relative attributes in the training samples, the mean value and the covariance matrix are estimated for the training categories. Finally, the Gaussian model of the training category is obtained, which is , .

Building the model of the testing category: for the hybrid relative attribute , the testing category can be expressed into three cases based on the “seen” and :

① in the case of , the mean value of the model is , and the covariance of the model is ; ② in the case of , the mean value of the model is , and the covariance of the model is ; ③in the case of , the mean value of the model is , and the covariance of the model is .

From the above, represents the mean value of the fth hybrid relative attribute, and represents the average difference among the ranking scores of the hybrid relative attributes in the training category. It is reasonable to assume that the relation distributions of the hybrid relative attributes of all the categories are same.

In the testing, we first compute the attribute ranking score of the ith testing sample by the ranking function of the hybrid relative attribute. Then, the class label is given by the similarity between and the category models, which is shown as

We also concludes the implementation of zero-shot image classification based on SC-HRA, which is shown as Algorithm 2.

Input: the attribute ranking scores of the training and testing samples
Output: the label of the testing sample
Building of the training models;
Calculate mean value and covariance matrix based on of training sample,
obtain
Building of the testing models;
If is satisfied, then the mean value of model is
, and the covariance matrix is .
If is satisfied, then the mean value of model is , and
the covariance matrix is .
If is satisfied, then the mean value of model is , and
the covariance matrix is .
Testing;
By calculating the attribute ranking score of the testing sample, the class label is
determined as

4. Experiments

4.1. Experimental Datasets

In this section, the proposed method is evaluated on OSR dataset and Public Figure Face (Pub Fig) [9, 30] dataset. The two datasets are from the distinctive areas of landscape and human faces, which can sufficiently validate the effects of the proposed method. OSR dataset contains 2688 images, 6 attributes, and 8 scenes (tall building (T), inside-city (I), street (S), highway (H), coast (C), Open-country (O), mountain (M), and forest (F)), and use 512-dimensional gist [31] descriptor to represent the images. In the paper, 2648 images of images are selected as testing set. Pub Fig dataset consist of 772 images with 11 semantic attributes from 8 identities (Alex (A), Clive (C), Hugh (H), Jared (J), Miley (M), Scarlett (S), Viggo (V), and Zac (Z)), which is 512-dimensional gist descriptor and 30-dimensional global color features extracted from each image [32]. 560 images from Pub Fig dataset are selected as the testing set. According to the empirical values, the parameters and are set as 0.5 and 1, respectively.

Tables 1 and 2 show the relative attribute rankings and the description of OSR and Pub Fig datasets, respectively. The relative attributes can be ranked by the relative strengths.

4.2. Attribute Ranking Prediction

To verify the effects of the hybrid attributes on the ranking performance, we predict the ranking scores of 2648 testing images from OSR dataset and 560 testing images from Pub Fig dataset. For each pair of one attribute, we calculate to get the strengths of attributes. means that the strength of is higher than that of , while denotes the strength of is smaller than . In this section, 6 learning methods are compared with the proposed method, which are B-SVM [33], RFUA [34], HAP [35], CSHAPH, CSHAPG [35], and RA [9]. The attribute rankings predicted by the different methods are compared with the truly relative relations of the samples. Based on the comparisons of the predicted relations with the real ones, the attributes’ ranking accuracy can be calculated.

Tables 3 and 4 illustrate the performance of attribute ranking prediction by different methods. Table 3 shows the comparisons of attributes’ ranking accuracy on OSR dataset. Specifically, SC-HRA obtains the best ranking accuracy for all the attributes. Table 4 shows the comparisons of attributes’ ranking accuracy on Pub Fig dataset. The attribute ranking accuracy on Pub Fig is higher than other methods except for the attributes “young” and “Visible Forehead”, which are not as good as than RA but better than others. Moreover, the average accuracy of ranking prediction is 89.15% and 82.29%, respectively. As compared to the performance of other methods, SC-HRA gets the best performance on attribute ranking prediction, which confirms the advantage of HRA during the attribute ranking.

4.3. Zero-Shot Image Classification Results

In this part, the effects of SC-HRA on the zero-shot image classification are checked. In each dataset, a series of experiments are carried out on zero-shot image classification. And for each experiment, selected are the different training (“seen”) and testing (“unseen”) categories to realize the classification. We adopt fold cross-validation (T is the total number of the dataset categories, and is the number of the training categories) to eliminate the influence of the experiment randomness. So, the results based on the cross-validation are reliable since all the samples are almost considered during training. Specifically, 5 cross-validation experiments are implemented on the bench mark datasets, which are fold, fold, fold, fold, and fold.

Figure 3 shows the results of the different number of the training categories as we increase the nonsemantic relative attribute dimension N, where N ranges from 0 to 14. In general, the classification accuracy declines with the training categories decreasing. The reason is that, when the number of the training categories reduces, the fewer ordered pairs and unordered pairs are used to learn the ranking functions of the hybrid relative attributes, which impacts the ranking accuracy of the hybrid relative attributes. Thus, too low proportion of the training category can result in the worse performance of zero-shot classification. Besides, with the training categories reducing, the proportion of the categories to build “unseen” model also decreases, which is another factor for the worse performance of the classifier. Without nonsemantic relative attribute (that is, ), HRA and RA are equivalent.

An additional phenomenon in Figure 3 is that all the curves grow first and then decline. And in the condition of t=6, the proposed method obtains the better performance on both OSR dataset and Pub Fig dataset than the other cases (t=2, t=3, t=4, t=5). Next, we calculate the average accuracy of zero-shot classification shown as Tables 5 and 6 to concretely analyse that fact. With the increase of the nonsemantic relative attribute dimension N, the average classification accuracy on OSR dataset grows to 48.02% at N = 10 firstly and then declines. On the other hand, the average classification accuracy of Pub Fig dataset reaches to 47.33% at N=6 and then reduces, too. As to the reason, the nonsemantic relative attributes contribute to the difference of the relative attribute space, which impacts on the classification accuracy. However, due to the increase of the nonsemantic relative attribute dimension N, too many nonsemantic relative attributes make the relative attributes redundant, which results in the decrease of the classification accuracy. Therefore, according the analysis based on the experiments, parameter N is set as 10 for OSR dataset and 6 for Pub Fig dataset when equals 6, which can acquire the highest average accuracy.

Although the proportion of the correct classified samples to the whole testing set can be illustrated by the classification accuracy, it neglects the relation between false positive rate (the probability of negative samples wrongly categorized as positive) and true positive rate (the probability of positive samples correctly categorized as negative). Therefore, we further use AUC (Area Under Curve) [36, 37] to evaluate the classification performance. AUC denotes the area under the receiver operating characteristic (ROC) curve which provides another way to evaluate the performance of the proposed model. If the model is ideal, its AUC value equals 1. The AUC value of a random model equals 0.5. On the other hand, regarding the results of Tables 5 and 6, SC-HRA acquires the highest classification accuracy on OSR dataset at N=10. When equals 6, SC-HRA acquires the highest average classification accuracy on Pub Fig dataset. When rises to 14, the classification accuracy of SC-HRA on both OSR and Pub Fig is lower than that of SC-HRA with other 13 different values. Even worse, it is lower than that of RA. Thus, we choose the representative results of SC-HRA as well as RA, which are N=6, N=10, and N=14. Figure 4 shows the AUC values of zero-shot classification on OSR and Pub Fig datasets. In general, all the three cases of SC-HRA (N=6, N=10, and N=14) illustrated in Figure 4 have higher AUC values than the random tested AUC value (0.5) on OSR dataset as well as Pub Fig dataset. Specifically, as to N=6 and N=10, AUC values of SC-HRA are higher than RA. But as to N=14, the AUC value decreases since the redundant nonsemantic relative attributes deteriorate the performance of zero-shot image classification.

Next, we demonstrate the proposed method using the confusion matrix. Figures 5 and 6 show the classification performance of RA and SC-HRA on OSR and Pub Fig illustrated by confusion matrix. In the confusion matrix, the blocks distributed on the diagonal line denote the number of the samples correctly classified. The darker the color of the block, the larger the number of the correctly classified samples. As shown in Figure 5, due to the high similarity of the semantic relative attributes between Tall building and Street, it is difficult for RA to distinguish them. For example, 260 images of Tall building are mistaken as Street category. When the nonsemantic relative attributes are introduced into zero-shot classification, SC-HRA can distinguish Tall building and Street partially, which is better than RA. Concluded from Figures 5 and 6, in the case of and , the samples correctly classified by SC-HRA are more than those correctly classified by RA on the diagonal lines of OSR and Pub Fig confusion matrix. The reason is that a certain amount of nonsemantic relative attributes can improve the performance of the classifier. In the case of , the number of samples correctly classified by SC-HRA decreases, which is caused by the redundant nonsemantic relative attributes. Therefore, a certain amount of nonsemantic relative attributes can benefit zero-shot classification to ameliorate its performance, yet too many of them bring much redundant information that may confuse the classification procedure.

4.4. Comparisons of Zero-Shot Image Classification Methods

To further verify the classification performance of the proposed method, we compare the zero-shot image classification accuracy of it with the other 6 methods, which are DAP [38], RFUA-SRA, HAP-SRA, CSHAPH-SRA, CSHAPG-SRA, and MLE-RA. The baseline DAP is a direct attribute prediction model. Based on RA approach, RFUA-SRA (RFUA score-based relative attributes) replaces rank values with the RFUA classifier output score; HAP-SRA (HAP score-based relative attributes) replaces rank values with the HAP classifier output score; CSHAPH-SRA (CSHAPH score-based relative attributes) replaces rank values with the RFUA classifier output score; and CSHAPG-SRA (CSHAPG score-based relative attributes) replaces rank values with the CSHAPG classifier output score. MLE-RA utilizes maximum likelihood estimation to predict the testing sample label.

To avoid the randomness during the simulations, the cross-validation is adopted to compare the proposed model with other baseline models. The comparisons are illustrated as Tables 7 and 8. As the proportion of training category decreases, the classification accuracies of all the models decline. Thus, each model can obtain the highest accuracy when the number of training category is 6. Further, as we expected, our method significantly outperforms DAP, RFUA-SRA, HAP-SRA, CSHAPH-SRA, and CSHAPG-SRA and slightly exceeds the baseline model MLE-RA. This validates the power of the hybrid relative attribute jointly combined semantic relative attribute and nonsemantic attribute for zero-shot image classification.

5. Conclusions

Zero-shot image classification primarily solves the challenge of how to identify the “unseen” categories accurately when the labeled samples cannot cover all the objective categories. In that the semantic relative attributes become the key to solve zero-shot image classification. However, the existing zero-shot image classifiers based on semantic relative attributes ignore the low-level features. In other words, the class labels of the samples in the testing categories are decided only depending on the semantic relative attributes of the samples. But for the categories having similar semantic relative attributes, it is difficult to distinguish them. Therefore, considering the restriction of the semantic relative attributes, the nonsemantic relative attributes are obtained by sparse coding the low-level features of the samples and treated as the supplement of the semantic relative attributes to construct the hybrid relative attributes. Further, the hybrid relative attributes are applied to zero-shot image classification. So a zero-shot image classification based on the hybrid relative attributes is proposed. The proposed SC-HRA is demonstrated and compared with the baseline methods by the experiments on OSR and Pub Fig datasets. The experimental results validate that SC-HRA has powerful ranking and classification ability in zero-shot learning due to the distinguishability of the nonsemantic relative attributes in classification.

Data Availability

All the experiments of this paper are based on the datasets of Outdoor Scene Recognition (OSR) and a subset of the Public Figure Face Database (Pub Fig), which can be downloaded from “https://www.cc.gatech.edu/~parikh/relative.html#data”.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by Chinese Universities Scientific Fund (Grant no. 2015QNB20).