Abstract

In visual search systems, it is important to address the issue of how to leverage the rich contextual information in a visual computational model to build more robust visual search systems and to better satisfy the user’s need and intention. In this paper, we introduced a ranking model by understanding the complex relations within product visual and textual information in visual search systems. To understand their complex relations, we focused on using graph-based paradigms to model the relations among product images, product category labels, and product names and descriptions. We developed a unified probabilistic hypergraph ranking algorithm, which, modeling the correlations among product visual features and textual features, extensively enriches the description of the image. We conducted experiments on the proposed ranking algorithm on a dataset collected from a real e-commerce website. The results of our comparison demonstrate that our proposed algorithm extensively improves the retrieval performance over the visual distance based ranking.

1. Introduction

Ranking plays an essential role in a product search system. Given a query, candidate products should be ranked according to their distance to the query. The effectiveness of the product search system is evaluated by its ranked search results, for example, in the form of precision or recall. In addition, the efficiency of a system is evaluated by its running time of a query. The best scenario is that the system returns a series of relevant products at the top of retrieved results. However, in certain cases, even if a system finds the particular relevant product, it is still considered as ineffective for the reason that the retrieved product is not present in the top list but is buried in a number of irrelevant results. In order to compensate for this rank inversion issue, the automated learning techniques and the skills of users are utilized to improve the representation of the query product.

A natural extension of such add-value process is to request users to label the returned results as relevant or irrelevant, which is called relevance feedback (RF). However, in reality users are neither willing to initialize a query by labeling retrieval metadata and samples nor willing to give feedback of the retrieved results, since these methods make the retrieval procedure inconvenient. Therefore, the insufficient user-labeled images undermine the prospect of supervised learning methods in the visual content-based image retrieval (CBIR) field. A promising and relatively unexplored research direction is to exploit transductive or semisupervised learning, among which graph-based methods [14] have demonstrated their effectiveness in image retrieval and therefore received increasing attention. In the graph-based methods, a graph is built on the image dataset and each image is considered as a vertex in the graph. An edge and its weight are defined between two images according to a certain relationship definition. For example, the edge can be defined as the images visual similarity. The weight is formulated by the visual distance between any two image vertices. Then, the ranking can be formulated as a random walk on the graph [1, 5], or an optimization problem [2]. However, these graphs, created in pairs, cannot sufficiently show the relations among images. Hypergraph is introduced to the CBIR field. Hypergraph is a generalization of a simple graph. In a hypergraph, an edge, called hyperedge, can connect any number of vertices; it is a nonempty set of vertices. A probabilistic hypergraph is proposed in [3] for image retrieval ranking. The hyperedge is formed by a centroid image and its -nearest neighbors based on their visual similarity. Gao et al. [4] propose a hypergraph learning algorithm for the social image search, where the weight of hyperedges, representing the impact of different tags and visual words, is automatically learned with a set of pseudopositive images. Their research [3, 4] proves the effectiveness of hypergraph learning in solving ranking problems. However, they fail to establish correlations between images visual content and textual content. They both built the hypergraph model on images visual content or textual content independently. As a result, the search must start with a user assigned keyword. In order to solve this issue and minimize user’s efforts in query, we take the product rich textual information into the visual model of image retrieval and propose a novel hypergraph-based transductive algorithm for ranking of visual product search. We design a unified probabilistic hypergraph to model multiple types of features of the products and explore the implicit relations among various visual and textual features.

The contributions of this paper are summarized in three aspects as follows. First, hypergraph is proposed to represent a commercial product image dataset. The relations between visual and textual features of these images have been explored. Second, a new product retrieval framework for the product search is designed. Third, a novel strategy of starting a query is created. Since relations between visual features and textual features have been established in a specific unified probabilistic hypergraph, problems that lack user-labeled query keywords can be solved via transductive inference on the hypergraph.

The paper is organized as the following outline. In Section 2, we have a literature review of recent ranking techniques. Section 3 discusses the design of the proposed unified hypergraph ranking algorithm. In Section 4, several retrieval experiments are conducted on an apparel dataset and compared with conventional CBIR ranking methods. Finally, we conclude with the proposed ranking scheme and discuss future works in Section 5.

Ranking and hypergraph learning are the two research fields related to our work. These two topics have received intensive attention in information retrieval and machine learning. The conventional image ranking is developed from textual retrieval. The ranking model is defined based on the bag of words, for example, Best Match 25 [6], the Vector Space Model [7], and the Language Modeling for Information Retrieval [8]. Another type of ranking model is based on hyperlink analysis, such as Hyperlink-Induced Topic Search [9], PageRank [10], and its variations [1113].

In the CBIR systems, the ranking is commonly obtained from the similarity measure of adopted visual features. One type of similarity measures is calculated from Minkowski distance, City-block distance, infinity distance, and cosine distance. They are usually called Minkowski and standard measures. Statistical measure, for example, Pearson correlation coefficient and Chi-square dissimilarity, is another type of similarity or dissimilarity measure methods. The third type of similarity measures is divergence measure, which includes Kullback-Leibler divergence, Jeffrey divergence, Kolmogorov-Smirnov divergence, and Cramer-von Mises divergence. There are some other measures, such as Earth Mover’s distance and diffusion distance [14, 15].

The learning to rank model has gained increasing attention in recent years, utilizing machine learning algorithms to optimize the ranking function by tuning some of the parameters and incorporating relevance features [16, 17]. Manifold ranking [18], a graph-based semisupervised learning method, ranks the data through exploiting their intrinsic manifold structure. Manifold ranking was firstly applied to CBIR in [19] and significantly improved image retrieval performance. Liu et al. [20] proposed a graph-based approach for tag ranking, by which a tag graph was built to mine the correlations among tags, and the relevance scores were obtained through a random walk over the similarity graph. These researches demonstrated the effectiveness of graph-based semisupervised learning techniques in solving different ranking problems. However, they are inadequate for the relations in images via pairwise graphs solely. It would be of great benefit to take into consideration the relationship among 3 or more vertices. Such a model capturing higher-order relations is called hypergraph. In a hypergraph, a nonempty set of vertices is defined as a weighted or unweighted hyperedge; the magnitude of the weight represents the degree that the vertices in the hyperedge belong to the same cluster. Agarwal et al. [21] firstly introduced hypergraph to computer vision and proposed a clique averaging graph approximation scheme to solve the clustering problems. Literature [22] formulated the probabilistic interpretation based image-matching problem as the hypergraph convex optimization and achieved a global optimum of the matching criteria. However, they set up three restrictions that are the same degree of all hyperedges, the same number of vertices in two graphs, and a complete match. Sun et al. [23] employed the hypergraph to capture the correlations among different labels for multilabel classification. The proposed hypergraph learning formulation showed the effectiveness on large-scale benchmark datasets, and its approximate least squares formulation maintained efficiency, as well as competitive classification performance. One shortage of their work is that they limited the target applications to linear models and thus did not have a general performance evaluation on other multilabel applications, such as the kernel-induced multilabels. In [24] the spatiotemporal relationship among different patches is captured by the hypergraph structure, and the video object segmentation is modeled as hypergraph partition. Furthermore, weights are added on important hyperedges. The experimental results have shown good segmentation performance on nature scenes. In the case that there are several different types of vertices or hyperedges, the hypergraph is called unified hypergraph. L. Li and T. Li [25] proposed a unified hypergraph model for the personalized news recommendation where users and multiple news entities are involved as different types of vertices, and their implicit correlations are captured. The recommendation is modeled as a hypergraph ranking problem. The hypergraph learning algorithms have demonstrated their capability of capturing complex high-order relations. Their applications in image retrieval are also promising [3, 4].

3. Ranking on Unified Probabilistic Hypergraph

In this research, we employ a unified probabilistic hypergraph to represent the relations of commercial product images, its textual descriptions, and its categorization labels. We propose a model for searching and ranking images based on hypergraph learning. Conventional visual search systems sort and search images based on the similarity of their visual content. The idea of this model is to learn the relevance of different product features, images visual feature, textual feature, and the hybrid visual-textual feature, and then combine them with the results of visual similarity based retrieval.

3.1. Notation and Problem Definition

Let represent a finite set of vertices. represents a family of hyperedges on , and each hyperedge contains a list of vertices that belong to . The hypergraph can be denoted as with a weight function . The degree of a hyperedge is defined by , that is, the number of vertices in . The degree of a vertex is defined by , where is the weight of the hyperedge . The hypergraph can be formulated to a vertex-hyperedge incidence matrix , where each entry is defined as Then we have , and . Let and denote the diagonal matrices containing the vertex and hyperedge degrees, respectively, and let be the diagonal matrix containing the weights of hyperedges.

Consider a simple example of hypergraph , built as shown in Figure 1. Consider and . The incidence matrix is defined as

The problem of ranking on the hypergraph is formulated as follows: given a query vector , a subset of vertices in the hypergraph , a ranking score vector is produced according to the relevance among vertices in the hypergraph and the query. We define the cost function of as follows [4]: where is the regulation factor. The first term, known as the normalized hypergraph Laplacian, is a constraint that vertices sharing many incidental hyperedges are supposed to obtain similar ranking scores. The second term is a constraint of the variation between the final ranking score and the initial score.

In order to obtain the optimal solution of the ranking problem, we seek to minimize the cost function:With the derivations in [4], we can rewrite the cost function aswhere . Then the optimal can be obtained by differentiating with respect to :

3.2. Unified Probabilistic Hypergraph Ranking Model

In the following we will explain our improved hypergraph formulation for the product retrieval and ranking. In a typical online shopping system there are three different types of information representing a product. They are product image, product name and description, and product labels, which are discussed in detail in Section 4. With these three types of information we design 6 types of hyperedges. Each image in the product image dataset is considered as a vertex in the unified hypergraph. Let denote the product image pool, and is a particular product image. Let denote the visual feature description of the images or, say, visual words, let denote the set of product styles, and let N be the name and description of the product. The unified hypergraph that contains 6 different types of hyperedge could represent the following implicit relations:(1) (the set of images feature-style-name hyperedges): the products share the same product name, product style, and visual feature word.(2) (the set of images feature-style hyperedges): the products, which belong to a certain product style, contain the same visual feature word.(3) (the set of images feature-name hyperedges): the products, containing the same visual feature word, share a common keyword in name.(4) (the set of images visual feature hyperedges): the product images might contain the same visual feature word.(5) (the set of images style hyperedges): the products belong to the same product style.(6) (the set of images name hyperedges): the products have similar keywords in their name and description.

Typically we assign 1 to the weights of these hyperedges. Rather than traditional hypergraph structure, in which an image vertex is assigned to a hyperedge in a binary way (i.e., is either 1 or 0), we propose a probabilistic hypergraph to describe the relation between vertex and hyperedge. For hyperedge , each image vertex is treated as a centroid, and the hyperedge is formed by the centroid image and its -nearest neighbors. The incidence matrix of the probabilistic hypergraph is defined as follows: where is the centroid of . In the proposed formulation, a vertex is softly assigned to a hyperedge based on the similarity between and , which overcomes the limitation of truncation loss with the binary assignment. Besides, we use a parameter to set the desired similarity.

Figure 2 demonstrates an example of constructing such hyperedges. Each vertex and its top 3 similar neighbors form a hyperedge. We set a constraint that only the vertices with a pair similarity larger than 0.4 would be connected into a hyperedge. The incidence matrix of the proposed probabilistic hypergraph is

With the hyperedges as designed above we can form the 6 types of unified weight matrix and have the vertex-hyperedge incidence matrix . The size of both matrices depends on the cardinality of product image dataset involved, and they are all sparse matrices. As a result, the computation of the proposed hypergraph ranking algorithm is fast. It is implemented in two stages: offline training and online ranking. In the offline training stage, we construct the unified hypergraph with matrices and derived from above. Then based on the matrices, we calculate the vertex degree matrix and the hyperedge degree matrix . Finally can be computed, where . Note that is invertible, since the hyperedge ensures that is full rank. Then the online ranking procedure can be described as follows: firstly build the query vector , and secondly compute the ranking score vector . The elements of the preranked relevant images are set to 1, and the others are 0. The procedures are described in Algorithm 1.

Algorithm 1 (the unified probabilistic hypergraph ranking algorithm description).
Input. The inputs are initial query vector of ranked vertices , similarity matrix Sim, matrix of textual features , , and matrix of visual features for all products vertices.
Output. The output is vector of optimal ranked product vertices :(1)Construct the vertex-hyperedge incidence matrix of hyperedge based on (8).(2)Construct the vertex-hyperedge incidence matrix of hyperedges , , , , and based on (1).(3)Form the incidence matrix by concatenating to .(4)Calculate the vertex degree matrix and the hyperedge degree matrix using and , respectively.(5)Compute , where .(6)Compute the optimal by (7).

4. Experimental Results

In the experiment, we build the unified images hypergraph using different combinations of hyperedges to test the effect of different factors on the ranking performance. We then investigate the performance of different hypergraphs. The superiority of the transductive inference is demonstrated in handling the queries that lack user labels. We use the visual similarity based ranking as a baseline. We compare the different hypergraph-based ranking models with the visual similarity ranking. Also we use the visual similarity ranking score to deduce the preranked score in hypergraph ranking.

For an online shopping system, a product is represented by three types of information, as shown in Table 1: images, which demonstrate the product visually (this usually has several photos taken from different viewpoint); name, which is the name of the product or gives a brief description of the product; labels, which are the textual tags that classify the product into different categories according to the sorting rules. For example, for apparel products, we could have different categories like style, length, sleeve length, occasions, and so forth.

The product image dataset used in the experiment is obtained from a list of prominent brands of women apparel. It contains 3 product categories, 58 brands, and 4210 images. We use different dress categories such as type, length, and sleeve length to form the set of product styles, which contains 7 types, 3 lengths, and 6 sleeve lengths. The product name is the product brand, its style name, and a short description. Here we generate a bag of words to represent it. For visual features, we first extract a color boosted SIFT feature [26], which captures the product color feature and its local patterns, and then quantize the visual feature descriptors into 65 visual words. For parameters and μ, we follow the setting in literature [4], where they are empirically set to 100 and 0.001. We choose Normalized Discounted Cumulative Gain (NDCG) [27] to evaluate the ranking performance. NDCG is widely adopted in machine learning approaches to ranking. It is designed for situations of nonbinary notions of relevance. In our research, an experiment participant is asked to judge the relevance of each retrieval result to the query. Each returned image is to be judged on a scale of 0–3 with rel = 0 meaning irrelevant, rel = 3 meaning completely relevant, and rel = 1 and rel = 2 meaning “somewhere in between.” NDCG at position is defined asIn our experiment, the relevance is assessed over a subset of the collection that is formed from the top results returned by different ranking methods. Human’s judgments are subjective, idiosyncratic, and variable. Thus, we employ kappa statistic [28] to evaluate the degree of agreement between judges: where is the proportion of the times the judges agreed and is the proportion of the times they should agree by chance.

In our proposed method, we integrate 7 different relations and hyperedges into constructing product hypergraph so that it effectively represents the product image dataset. The hypergraph also encloses multiple correlations among different visual words and text features. To evaluate the effectiveness of such a representation in product search, we consider different hypergraph constructions with different hyperedge integration. Figure 3 illustrates the ranking performance in terms of average NDCG at different depths of 10, 20, and 30. It is evident that the hybrid hypergraph (FSN, FN, and FS) outperforms the simple construction of hypergraph (F) and the visual similarity based ranking (NN). And the proposed unified hybrid hypergraph FSN achieves the best performance. The reason for this is quite straightforward: high-order correlations among product visual features and its textual labels are well captured in our unified hypergraph model. The representation and description of a product are extensively enhanced in database.

In Figure 4, an example of a query is demonstrated, in which the system cannot find the best match at the top 10. With the similarity ranking, a black tuxedo jumpsuit is recognized as dress, pants, and coats, while with the proposed unified probabilistic hypergraph learning ranking, the system provides a series of products with similar styles, which is meaningful for the online shoppers. The reason is that we not only capture the visual feature and textual feature separately, but also model the correlations between them. In this way, improved search results are produced.

Then, we use kappa statistics to assess relevance. Here rel = 1 and rel = 2 are considered as agreed. If two judges agree on relevance for all results, kappa = 1. If they agree at the probability of agreement by chance, kappa = 0. kappa < 0, if they agree at a rate below random probability. Table 2 gives an example of relevance rating data by two judges on our experiment results. The kappa statistic value between judge 1 and judge 2 is calculated with (11) as 0.7740. The kappa values for all pairs of judges are evaluated in the same way, and their average pairwise kappa value is 0.7483. The level of agreement in our experiments falls in the range of “fair” (0.67–0.8). As a result, the evaluation results of our experiments are proved to be valid.

5. Conclusion

In this paper, we address the problem of ranking in product search by image. We focus on integration of various types of product textual information and visual image. We introduce a hypergraph learning approach to the visual product search and propose a more comprehensive and robust ranking model. In this way the supervised classification and unsupervised visual search are well balanced. Specifically, we construct the hypergraph by combining three types of product information that embed the relevance among textual features and visual images. Experimental results show that the proposed hypergraph learning framework is a promising ranking scheme for product search. In future work we will consider exploring the adaptive feature weight and other hypergraph learning operators.

Competing Interests

The authors declare that they have no competing interests.