Abstract

In computer vision, image retrieval remained a significant problem and recent resurgent of image retrieval also relies on other postprocessing methods to improve the accuracy instead of solely relying on good feature representation. Our method addressed the shape retrieval of binary images. This paper proposes a new integration scheme to best utilize feature representation along with contextual information. For feature representation we used articulation invariant representation; dynamic programming is then utilized for better shape matching followed by manifold learning based postprocessing modified mutual NN graph to further improve the similarity score. We conducted extensive experiments on widely used MPEG-7 database of shape images by so-called bulls-eye score with and without normalization of modified mutual NN graph which clearly indicates the importance of normalization. Finally, our method demonstrated better results compared to other methods. We also computed the computational time with another graph transduction method which clearly shows that our method is computationally very fast. Furthermore, to show consistency of postprocessing method, we also performed experiments on challenging ORL and YALE face datasets and improved baseline results.

1. Introduction

Content based image retrieval (CBIR) remained an important and challenging problem in computer vision from many decades [1]. CBIR processes take a query such as query by image, region, or sketch and return most similar images from database based on similarity of low level features instead of textual annotations or media metadata [24]. Generally, image retrieval is divided into two steps. The first step consists of shape, texture, and color feature extraction of image or object and the second step involves similarity measure of query image with all database images and then images are ranked according to similarity measure. Detailed discussions can be found in [14]. Our method focused on shape based feature extraction and retrieval of binary images. Although flood of color images are on its peak, importance of binary images could not be neglected as they are widely used in trademark images, patent images, technical drawings, or some applications such as medical images [5], botanical collections [6], and road signs [7] and in many different areas. Shapes are a primary source of information for feature extraction of binary as well as color images. Shapes images are taken directly from binary images (i.e., black and white) or threshold in gray images where one side of threshold will be white and the other side will be black as there is little or no information of texture and there is no color information. Traditionally, feature extraction or feature representation remained to be a crucial step for improving retrieval rate. Lately, some similarity measures methods also improved retrieval accuracy rate. Recently, some postprocessing methods contributed significantly in image retrieval accuracy, such as graph transduction, cotransduction, spectral matching, and metasimilarity [811].

In this paper we propose integration of articulation invariant representation (AIR) feature extraction method with postprocessing, modified mutual NN graph, manifold learning method to improve the shape retrieval score with respect to context information. At first, we extracted features of each shape using AIR method [12]. After feature extraction, dynamic programming (DP) was used to measure pairwise shape similarity. After achieving distances by using DP, we applied modified mutual NN graph [13]. For proving the efficiency and effectiveness of our method we conducted experiments on popular MPEG-7 database. We calculate accuracy using so-called bulls-eye score. Our method improved the baseline results of AIR method. We compared our method with other rival methods and our method got good results compared to other methods. We, also, did experiments with respect to computational time as computational time is another challenging area of content based image retrieval. To show the computational efficiency of our method we integrated AIR representation method with graph transduction method [14] and we got almost the same retrieval accuracy results but on the cost of huge computational time which showed our method is promising and efficient.

To the best of our knowledge, AIR feature representation is a baseline method, till now, for shape feature representation on planar shapes, especially on challenging MPEG-7 database. Although histogram based shape context (SC) [15] feature representation remained to be a popular method for shape images, it lacks articulation invariance and the accuracy level of SC is lower than AIR. For shape matching or similarity measure, DP is more robust to outlier and noise, and it is also invariant to scale, rotation, and translation. Also, it proved to be an efficient matching algorithm in perspective of accuracy. Since similarity measure methods are not ideal in calculating interclass and extraclass distances, thus we used postprocessing method, modified mutual NN graph, to achieve similarity with respect to context information. It gives the affinity-wise similarity and in our method these similarities yield very good results of retrieval compared to other methods. Moreover, modified mutual NN graph is computationally inexpensive.

While we reproduced the results of inner distance shape context (IDSC) with modified mutual NN graph [13] which clearly showed deterioration of results compared to our method, to further prove the consistency of modified mutual NN graph we integrate it with different feature representations on ORL and YALE face databases and improved the baseline results, respectively.

There are three subproblems in shape based image retrieval. The first problem is feature extraction for good representation of shape. The second problem is similarity measure or shape matching which calculates the distance of query image with the rest of images in the database and reranks those images with respect to query image. The third problem addressed the postprocessing methods to obtain similarity with respect to context information. Our discussion will be around these three problems.

Feature extraction and shape matching are very active research area in literature [16]. Similarity of shape has also been very well analyzed in the field of psychology [17]. In literature, many descriptors and similarity methods have been reported but we focused more on the methods of silhouette images [18]. Based on shape contours, curvature scale space (CSS) and some varying methods are used for simple binary images [1921]. CSS method removes the irrelevant inflections. After removing inflections, it checks the characteristics of object. Recently, many efforts have been dedicated to overcome the drawbacks of CSS such as computational cost [21] and inaccurate similarity measure [20, 21].

Shape context (SC) proposed by Belongie et al. described the relative distribution such as distance and orientation [15]. SC combining with Thin Plate Spline (TPS) is considered to be discriminate. Generally, SC achieves good accuracy on the cost of high computational time in feature extraction and image matching. SC was extended by adding statistics of tangent vectors at landmark points [22]. Ling and Jacobs [23] extended the SC by using the geodesic distance between the contour points instead of Euclidean distance to measure the spatial relations between points on shapes. From one point of view, deformation, pose, and self-occlusion cause large variation in a shape class, while, on the other hand, different shape examples may have more identical components and variation can be addressed by employing different configuration of these components [24, 25]. Geometrical, noise, and occlusion invariant shape descriptor, Height Function (HF), is used for shape feature extraction [26]. In HF, predefined sample points of contour of each object are represented and on each sample point height function is calculated which follows the smoothing of height function for robustness. Contour flexibility feature extraction method was developed for feature extraction of both local and global features [27]. Bag of contour fragments (BCF) is an extended form of Bag-of-Words (BoW) for shape images that were used in [28]. In BCF, shape is divided into set of contour fragments and each contour is calculated by SC which follow the encoded into shape code. BCF worked well on trainable classifier but it performed poorly in shape matching. For shape matching Latecki and Lakämper [29] utilized visual parts made up by simplified polygons of contours. Feature-driven generative model for probabilistic shape matching was proposed by Tu and Yuille [30]. Siddiqi et al. [31] used the shape matching using graphical method. Dynamic and hierarchical curve matching method was introduced by Felzenszwalb and Schwartz [24] for avoiding the problems related with local or global methods. Multiscale representation of triangle areas for shape matching was introduced by Alajlan et al. [32], which consist of partial and global shape information. Based on SC, symbolic descriptor was defined by Daliri and Torre, and to overcome the problems caused by occlusions and deformation, edit distance for final matching was used in [33]. On the format of BoW, shape vocabulary was used as shape descriptor [34]. It addressed the issue of time consuming shape matching and speeds up the matching process as distance metric of global descriptors is compared instead of time consuming local features. Edge orientation autocorrelogram (EOAC) [35] is another binary image method. In EOAC, image edge orientations are used. These edge orientations are quantized and were given input to two-dimensional histogram. EOAC was used in PATSEEK search engine, developed by US patent office, in an effort to efficiently analyze the US patent image database [36].

As content based image retrieval is resurging more actively, different graphical methods are developing along with feature extraction methods and similarity measures. After obtaining good feature representation and applying matching algorithm to get similarity matrix, manifold learning as a postprocessing method has been used in the context of image retrieval. Improving the ranking of retrieval shapes by employing data manifold structure was proposed by Zhou et al. [8]. Otherwise stated, manifold learning improves the ranking results by keeping the context information of closest objects of database. Earlier developed for semisupervised learning, context information with label propagation of graph transduction for silhouette retrieval was proposed by Bai et al. [9]. It adds together various similarity as well as dissimilarity measures to better establish the relationship among many objects. Furthermore, in the work of Kontschieder et al. [13], the performance was improved by keeping the context information on inner distance shape context representation. Our work is similar to that of Kontschieder et al., but our work used articulation invariant representation instead of inner distance shape context, and our method also outperforms the results of the aforementioned method.

3. Proposed Work

Our proposed work consists of two parts. First, we need to extract the features or get a feature representation by using articulation invariant representation (AIR) method [1]. After getting pairwise distance matrix by using dynamic programming (DP), we applied the modified mutual NN graph to improve the similarity. Our method is depicted in Figure 1.

3.1. Feature Representation

For feature representation we used AIR, which is geometrical invariant, such as scale, translation, and rotation. Along with these invariances, it is also invariant to articulation. Another reason for using AIR is that it outperformed other feature representation methods on shapes, especially on large intraclass variation MPEG-7 database. As we have shape , such that , here, is the part of shape and is the junction of shape . is the numbers of set of conditions of 2D shapes are given  ; feature representation can be found which meets the criteria of (1). Complete glossaries are as follows:: 2D projection: Fixed number of conditions: Parts of 2D shape : Junction of shape , for all to , set of points that constitute : The shape representation of  (1): DistanceID: Inner distance: Shape distance: Affinity distance: Total error: Weak perspective of real world: Distance error: Projection error: Part-wise affine normalization to perform transformation: Histogram of point: Total number of bins in histogram : Vertex of graph: Edge of graph: Modified mutual NN graph with vertices and edges.

Consider If are the two points, then we can measure a respective distance by Here, represents a constant. Using above distance we obtained feature representation that satisfies (1). Because of many variations in viewpoint or varying effect, part by part affine normalization will be performed, to cope with such changes in shape . This normalization process will essentially find a transformation . So every will be transformed as . To find the articulation changes in we compute the distance of two points using the inner distance (ID), which is articulation invariant: Ideally, (3) should construct by satisfying (1), but practically there is an error as in In (4), is an error of different type which can be described as below: In (5), error occurs due to weak perspective of complex real world. Inner distance gave error when pathway connecting two points passes over the junction. denotes the error of projection of 3D image on 2D image. Due to varying effects and various view points, affine normalization will be performed: where shows a minimum enfolding parallelogram which transforms the part of shape toward unit square. Thus, it is used for the affine transformation that consists of scale, translation, rotation, and shear transformation. After achieving transformation we will compute two points using inner distance (ID) and inner angle (IA). At last, shape context descriptor will be built on each point and it is called histogram : Here, denotes the number of bins. Total 60 bins were used. Distance bins were 12 and angular bins were 5. The result of these two types of bins is 60. Now, feature representation satisfying (1) can be constructed as .

3.2. Similarity Measure

For shape retrieval, usually, similarity of shape or dissimilarity of shape also known as pairwise shape distance is calculated for which best possible correspondence of contour points needs to be found. These optimal points are used for ranking of shapes in the database for shape retrieval. Similarity measure or shape distance was calculated by DP and we obtained matrix for further postprocessing method. This DP method is invariant to translation, rotation, scale, and also robust against outliers and noise. This and other similarity measures are nonmetric and breach the triangle in equality. DP can be calculated as follows: consider there are two shapes and and they have contour point sequences, such as for with points and for with points. The matching between shape and shape is mapping from to , where can be matched to if and will be left unmatched otherwise. is used for minimization of the cost of matching as follows: Here, is the matching cost and can be calculated as follows: Here, and are the shape context histograms of and and represents the numbers of histograms.

According to true analysis of database, dissimilarities should be high between extraclass and low for intraclass. Similarity measure did not follow this rule fully. That is why modified mutual NN graph improves the accuracy after thoroughly analysis.

3.3. Modified Mutual NN-Graph

For improving the retrieval rate and to obtain the affinity matrix , we used modified mutual NN graph manifold learning method as a postprocessing method. It is more consistent on shapes as well as on images and it is computationally very cheap. Such postprocessing manifold learning methods are used for improving the similarity matrix whose similarities are calculated among all images of database. There are two steps which improve these precalculated matrices. The first step is the normalization of the distance matrix which transforms the into affinity matrix and the second step is more thorough analysis of pattern of objects such as finding the shortest paths by construction of nearest neighbor graphs.

If distance matrix is given, we can do part by part normalization to achieve affinity matrix as follows: Here, is an element-wise normalization parameter, while is a distance matrix and is an affinity matrix. The normalization parameter can be defined as below: In (12), shows the th neighbor of object . Distances between features of images are not obvious metric and cannot be measured only the scale of 0 to 1. Importance of normalization for improving retrieval has been given in literature [37]. After normalization, distance matrix can be converted into affinity matrix and underlying structure of data can be defined by measuring a neighborhood in graph. Graph will try to match similarity using underlying structure.

Ideally, the normalization should give clear and embedded submanifolds of matching shapes but normalization methods are not optimal; thus we used analysis step after the normalization. Graph based methods are then used for performing such analysis which find the locally underlying structures. Interdependence between objects can be found using connected regions in these graphs such as detecting shorter routes within similar shapes rather than within nonmatching shapes, to define the neighborhood graph with modified mutual NN-graph. Graph can be constructed in vertex and edges. Let each shape be denoted by a vertex in the graph . Edges can be defined by nonnegative affinity matrix , when affinity matrix , which represents that vertices have no connections. Modified mutual NN graph between and can be described as follows: where represents the number of nearest neighbors in vertex and asymmetry coefficient . In addition, edges can be described as follows:

For shape query retrieval, graph used path lengths of connected objects in graph.

4. Results and Discussion

Our main experiments consist of integration of AIR with modified mutual NN graph. We did experiments on MPEG-7 shape database which is described in Section 4.1. Apart from accuracy, we also measured computational time with another graph transduction postprocessing method [14]. To show the efficiency of our two parts, AIR feature extraction and modified mutual NN graph, separately, we independently performed different experiments with or without integrating representation learning and manifold learning. To further show the effectiveness of modified mutual NN graph, we also conducted experiments on ORL and YALE face databases described in Sections 4.2 and 4.3, respectively. Baseline results have been improved using postprocessing manifold learning.

4.1. Retrieval on MPEG-7 Database

We conducted experiments on popular database of MPEG-7, consisting of 1400 silhouette images, for evaluating the retrieval performance. The 1400 images of the database are divided into 70 different shape classes and each class consists of 20 related images. Intraclass deformations make this database very challenging. The samples of MPEG-7 database images are shown in Figure 2. Accuracy rate on this database is calculated by so-called bulls-eye score. Bulls-eye can be calculated as considering topmost 40 nearest matches of each query image and calculating how many 20 images of same classes are present in that 40 closest matches. Top 40 retrieved images can be matched with the 20 images of same category of query image. Our method used AIR features which are more robust and category-wise accuracy could be seen in Figure 3. Our method improved the results of AIR and showed better results compared to other methods. Comparison of results of different methods with our method is given in Table 1. Results are also reported before and after normalization of modified mutual NN graph. We achieved bulls-eye score 99.47% without normalization and 99.89% after normalization. Results clearly indicate the significance of normalization. IDSC is more robust method and it is articulation invariant but its accuracy is 85.40% which is much below than our method. AIR achieved 99.55. Our method improved results to 99.89%. AIR with graph transduction performed little well but its computational time is too high. Figure 4 shows top 10 retrieved images of MPEG-7 database. Most left column images are query images while remaining 19 column images are retrieved images from database. Retrieved images are rotation translation, scale invariant, and deformation invariant. We also showed the accuracy on each class in Figure 5. Figure 5 showed the improvement in each category and accuracy of each category is more consistent compared to Figure 3.

We also measured computational time with another postprocessing graph transduction method. Computation was performed on Core i5 2.40 GHz processor with 4 GB RAM installed using MATLAB. Comparison of computational time is given in Figure 6.

Graph transduction is another manifold learning postprocessing method. We applied graph transduction method on matrix of pairwise distance. Although results are promising, major drawback of this method is computational time. Compared to graph transduction method, modified mutual NN graph showed huge boost with respect to computational time. On AIR calculated shape distances, time of modified mutual NN graph was calculated as 10.83 seconds (without normalization) and 12.07 seconds (after normalization). Conversely, 7468.89 seconds were calculated by graph transduction on AIR shape distances.

4.2. Retrieval on ORL Face Database

ORL has 400 grayscale faces of different 40 peoples and each person has 10 different images with different illumination, pose, and expression. After getting distance matrix of 400 × 400 from feature representation [38], we applied the postprocessing modified mutual NN graph. This integration improved the baseline results which were measured by bulls-eye score from top 15 closest neighbors. Results are shown in Table 2.

4.3. Retrieval on Yale Face Database B

Yale face database B is the database of faces with different poses and illumination. In this database 165 faces are collected of 15 subjects and each subject has 11 different images under different conditions. Feature representation [38] was used for obtaining subset. After calculating distance we applied modified mutual NN graph. Performance was computed using bulls-eye score from top 15 closest matches. Results are shown in Table 3.

5. Conclusion

Our work tackled the problem of shape retrieval. Articulation invariant representation was used to extract features of binary images. After obtaining similarity matrix by dynamic programming, we integrate the postprocessing modified mutual NN graph for obtaining the context information and get similarity among different images. We conducted shape retrieval experiments on widely used MPEG-7 database for shape images and measured accuracy by so-called bulls-eye score. We also reported results before and after normalization of modified mutual NN graph which clearly indicates that normalization improved accuracy results, significantly. Compared to other methods our method got better results. We also demonstrated that computational performance of our method is much higher than other competitive methods. To show consistency of postprocessing method, we also performed experiments on ORL and YALE face databases and baselines results were improved.

In the future we would like to extend this shape retrieval on cluttered images using deformation invariant shape matching and diffusion process.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by National Natural Science Foundation of China (Grant nos. 60973059 and 81171407) and Program for New Century Excellent Talents in University of China (Grant no. NCET-10-0044).