Abstract

The earth mover's distance (EMD) is a measure of the distance between two distributions, and it has been widely used in multimedia information retrieval systems, in particular, in content-based image retrieval systems. When the EMD is applied to image problems based on color or texture, the EMD reflects the human perceptual similarities. However, its computations are too expensive to use in large-scale databases. In order to achieve efficient computation of the EMD during query processing, we have developed “fastEMD,” a library for high-speed feature-based similarity retrievals in large databases. This paper introduces techniques that are used in the implementation of the fastEMD and performs extensive experiments to demonstrate its efficiency.

1. Introduction

The earth mover’s distance (EMD) was introduced in computer vision as an improved distance measure between two distributions, and it has been widely used in multimedia databases. In particular, in the area of content-based image retrieval (CBIR), where color images are retrieved from multimedia databases, it is important to apply improved distance measures. Several CBIR models have been proposed based on the histogram approach, such as the query by image content (QBIC) method proposed by Faloutsos et al. [1], which maps each image to a vector based on encoded attributes such as color distribution, texture, and shape. However, the histogram approaches using Euclidean distance and quadratic forms apply bin-to-bin distance functions; hence, these approaches are sensitive to minor shifts in feature value distribution.

The similarity distance in the EMD is calculated as the amount of changes necessary to transform one image feature into another. Rubner and Tomasi [2] have proposed a CBIR model using the EMD, and its high quality of image-similarity searches has been demonstrated. Shishibori et al. [3] have proposed a music-retrieval model using the EMD, and a query-by-humming system presents a more robust approach than the conventional methods.

The EMD is defined as a linear programming problem that can be solved using the simplex method. However, computation using the EMD and the simplex method is too complex for its application in interactive multimedia database systems. In order to achieve efficient retrieval processing in large-scale multimedia databases based on the EMD, we propose a fast retrieval algorithm for the EMD.

In order to reduce the computation times of the original distance, the proposed method uses the lower bounding distance described in Cohen and Guibas [4] and Assent et al. [5]. The lower bounding distance of the EMD is a simple approximation in terms of computation. In the case of the 𝑘-nearest-neighbor search task, this method calculates the original EMD for the first 𝑘 images, where the largest distance is set to the threshold. This method calculates the lower bounding distance for 𝑘+1 images. This method does not calculate the original EMD if the lower bounding distance exceeds the threshold, because the true EMD cannot be lower than the first 𝑘 images. In all other cases, this method calculates the original EMD, and if the true EMD is lower than the first 𝑘 images, a new image is registered and the threshold is reset.

The remaining sections in this paper are organized as follows: Section 2 introduces the outline of the earth mover’s distance. Section 3 explains the lower bounds of the EMD and proposes the algorithm of the calculation skip using the lower bounds. Section 4 evaluates the effectiveness of the presented method by using experimental results. Section 5 describes conclusion and future works.

2. Earth Mover’s Distance

2.1. Conventional Dissimilarity Measures

The 𝐿𝑝 distance is a generalized measure of the similarity between two histograms 𝑥={𝑥𝑖} and 𝑦={𝑦𝑖}. The 𝐿𝑝 distance is defined as𝐿𝑝(𝑥,𝑦)=𝑛𝑖=0||𝑥𝑖𝑦𝑖||𝑝1/𝑝,(1) where 𝑝>0.

The Minkowski distance assumes different names depending on the value of 𝑝. If 𝑝=1, the distance is called a Manhattan distance. The Euclidean distance can be computed from (1) by selecting 𝑝=2. The Euclidean distance is one of the simplest and most popular distance measures.

The 𝐿𝑝 distance is a representative of bin-by-bin dissimilarity measures, and it has been used as the the dissimilarity measures of multimedia information retrieval systems. A major disadvantage of this method is that it accounts only for the correspondence between bins with the same index, and it does not use information across bins. This is illustrated in Figure 1, which shows two pairs of one-dimensional gray-scale histograms as described in Rubner and Tomasi [2].

Although the two histograms on the left are the same except for a shift by one bin, the 𝐿1 distance between them is larger than the 𝐿1 distance between the two histograms on the right. As shown in Figure 1(a), the 𝐿𝑝 distance does not match the perceptual dissimilarity. This can be fixed using the correspondences between the bins in the two histograms and the ground distance between them, as shown in Figure 1(b). As shown in Figure 1(b), the perceptual dissimilarity is based on the correspondence between the bins in the two histograms.

The quadratic-form distance is a representative of the cross-bin dissimilarity measures. Niblack et al. [6] suggested the use of this distance for color-based retrieval. The quadratic-form distance does not enforce a one-to-one correspondence between the mass elements in the two histograms. The same mass in a given bin of the first histogram is made to correspond to the masses contained in different bins of the other histogram. However, the image-retrieval results using the quadratic-form distance include some false positives, because the quadratic-form distance tends to overestimate the mutual similarity of color distributions, as mentioned in Stricker and Orengo [7].

2.2. Earth Mover’s Distance

The EMD is based on the following linear programming problem. Let 𝑃={(𝑝1,𝑤𝑝1),,(𝑝𝑚,𝑤𝑝𝑚)} be the first signature with 𝑚 clusters, where 𝑝𝑖 is the cluster representative and 𝑤𝑝𝑖 is the weight of the cluster. Let 𝑄={(𝑞1,𝑤𝑞1),,(𝑞𝑛,𝑤𝑞𝑛)} be the second signature with 𝑛 clusters. Let 𝐷=[𝑑𝑖𝑗] be the ground-distance matrix, where 𝑑𝑖𝑗=𝑑(𝑝𝑖,𝑞𝑗) is the ground distance between cluster 𝑝𝑖 and 𝑞𝑗.

We assume that the signature 𝑃 denotes “supply places,” the signature 𝑄 denotes “demand places,” and the ground distance 𝑑𝑖𝑗 denotes the transportation cost between each supply and demand place. Our aim is to find a flow 𝐹=[𝑓𝑖𝑗] such that 𝑓𝑖𝑗 lies between 𝑝𝑖 and 𝑞𝑗, and it minimizes the following overall cost:WORK(𝑃,𝑄,F)=𝑚𝑛𝑖=1𝑗=1𝑑𝑖𝑗𝑓𝑖𝑗.(2) Equation (2) is obtained with the following constraints:𝑓𝑖𝑗0,1𝑖𝑚,1𝑗𝑛,(3)𝑛𝑗=1𝑓𝑖𝑗𝑤𝑝𝑖,1𝑖𝑚,(4)𝑚𝑖=1𝑓𝑖𝑗𝑤𝑞𝑗,1𝑗𝑛,(5)𝑚𝑛𝑖=1𝑗=1𝑓𝑖𝑗=min𝑚𝑖=1𝑤𝑝𝑖,𝑛𝑗=1𝑤𝑞𝑗.(6)

Constraint (3) allows the movement of supplies from 𝑃 to 𝑄 and not vice versa. Constraint (4) limits the amount of supplies that can be sent by the clusters in 𝑃 to their weight. Constraint (5) limits the clusters in 𝑄 to so that the supplies do not exceed the weights Constraint (6) forces the maximum possible amount of supplies possible to be moved. This amount is called the total flow. Once the transportation problem is solved, and we have found the optimal flow 𝐹, the EMD is defined as the resulting work normalized by the total flowEMD(𝑃,𝑄)=WORK(𝑃,𝑄,F)𝑚𝑖=1𝑛𝑗=1𝑓𝑖𝑗.(7)

An example of a two-dimensional EMD is shown. The signature 𝐴 as supplies is shown in Table 1, and the signature 𝐵 as demands is shown in Table 2. We assume that the ground distance is the Euclidean distance. The transportation costs between signatures 𝐴 and 𝐵 are calculated as shown in Table 3. By solving the transportation problem under the above condition, we can obtain the optimized transportation flow and weight, as shown in Figure 2.

In Figure 2, the white circles indicate supply places, and the black circles indicate demand places. The numbers inside the circle denote the weight of the supplies or demands. The arrows show the transportation flow, and the numbers beside the arrow denote the transportation weight. The total flow cost of Figure 2 is obtained as follows: 2(0.31.4142+0.22.6926+0.14.1231)2.7502.(8) The computational complexity of the transportation problem is based on the simplex algorithm, which has an exponential worst case.

2.3. Image Retrieval Using the EMD

Color distribution data applied for CBIR models is extracted from color segmentation images. As shown in Figure 3, the distribution consists of the weight, flow cost, and image feature. The weight is the number of pixels in each region. The flow cost of each distribution can be obtained by calculating the Euclidean distance between the image features in each region. The image feature of each region consists of the center position (𝑥,𝑦) and the average color signature (RGB). Figure 4 shows a sample of retrieval results using the EMD, where this sample is retrieval results from 10,000 image dataset. On the other hand, Figure 5 shows a sample of retrieval results using the 𝐿2 from the same dataset. From these results, it can be found that CBIR systems using the EMD can reflect the human perceptual similarities.

3. EMD Lower Bound

3.1. Basic Definitions and Notations

We denote a signature 𝑥 as follows:𝑥𝑥=1,𝑤1,𝑥2,𝑤2𝑥,,𝑛,𝑤𝑛(𝑋,𝑤)𝐷𝑑,𝑛,(9) where𝑥𝑥=1𝑥𝑛𝑅𝑑×𝑛,𝑤0.(10) Here, 𝑑 is the dimension of the cluster 𝑥𝑖𝑅𝑑, and 𝑛 is the number of clusters. The total weight of the signature 𝑥 is𝑤Σ=𝑛𝑗=1𝑤𝑗.(11) In the case of the 𝑘-nearest neighbor search task, let 𝑟 be the distance between the query image and the the 𝑘th candidate data during the search process.

3.2. Centroid-Based Lower Bounds

The centroid 𝑥 of the signature 𝑥=(𝑋,𝑤)𝐷𝑑,𝑛 is defined as𝑥=𝑛𝑗=1𝑤𝑗𝑥𝑗𝑤Σ.(12)

Theorem 1. Suppose that 𝑥=(𝑋,𝑤)𝐷𝑑,𝑚 and 𝑦=(𝑌,𝑢)𝐷𝑑,𝑛 are signatures of the equal total weight 𝑤Σ=𝑢Σ. Then, EMDCB(𝑥,𝑦)=𝑥𝑦EMD(𝑥,𝑦).(13) Here, the ground distance is any 𝐿𝑝 norm used to measure 𝑑(𝑥𝑖,𝑦𝑗), as described in Cohen and Guibas [4].

3.3. Calculation Skip Using the 𝐸𝑀𝐷𝐶𝐵

In the calculation of EMDCB, the method described in Ajiok et al. [8] can be applied to skip its calculation steps, because EMDCB is the 𝐿2 distance between the centroid of each signature. This method regards the calculation of each centroid as each dimension in the sigunature vector.

An example is shown in which the signature 𝐴 (Table 1) is the query, the distance between the signature 𝐶 (Table 4) and the query is calculated, and 𝑟 is set to 1.5. Then, the following centroids are obtained by (12):𝐴=(2.5,2.5),𝐶=(4.5,3.25).(14) First, the 𝐿2 distance of the first dimensional centroid is calculated as below:𝑟2=2.25<(2.54.5)2.(15) The distance of the first dimension is larger than 𝑟2; hence, the calculation steps after the second dimension can be eliminated.

3.4. Independent Minimization Lower Bounds

EMDCB denotes the lower bounds based on the 𝐿𝑝 distance. On the other hand, Assent et al. [5] have proposed independent minimization lower bounds based on cross-bin dissimilarity measures.

Theorem 2. Suppose that 𝑥=(𝑋,𝑤)𝐷𝑑,𝑚 and 𝑦=(𝑌,𝑢)𝐷𝑑,𝑛 are signatures of the equal total weight 𝑤Σ=𝑢Σ. Then, EMDIM(𝑥,𝑦)=min𝑚𝑛𝑖=1𝑗=1𝑑𝑖𝑗𝑀𝑓𝑖𝑗EMD(𝑥,𝑦).(16) Equation (16) is obtained with the following constraints: 𝑓𝑖𝑗0,1𝑖𝑚,1𝑗𝑛,(17)𝑛𝑗=1𝑓𝑖𝑗=𝑤𝑖𝑓,1𝑖𝑚,(18)𝑖𝑗𝑢𝑗,(19)𝑚𝑛𝑖=1𝑗=1𝑓𝑖𝑗=𝑤Σ=𝑢Σ,(20) where 𝑓𝑖𝑗 is the flow from 𝑥𝑖 to 𝑦𝑗, 𝑑𝑖𝑗=𝑑(𝑥𝑖,𝑦𝑗) is the ground distance between cluster 𝑥𝑖 and 𝑦𝑗, and 𝑀=𝑚𝑖=1𝑛𝑗=1𝑓𝑖𝑗=𝑤Σ=𝑢Σ.

3.5. Calculation Skip Using 𝐸𝑀𝐷𝐼𝑀

The calculation cost of the EMDIM can be reduced by weakening the constraints of the transportation problem. In Figure 2, the total flow equals the total (supplies or demands) weight, because the constraints of (4) and (5) are satisfied. The difference between the EMDIM and the normal EMD is the constraint on the amount of flows that one cluster may receive. While the normal EMD requires the sum of flows to a certain cluster to be equal to the weight of the cluster (18), the EMDIM only ensures that for any single cluster, the incoming flows do not exceed the weight of the cluster (19); that is, the total flow to a cluster can exceed its weight. The final flows of EMDIM(𝐴,𝐵) are shown in Table 5 in which the total flow to 𝐵2 exceeds 𝑤𝐵2.

From this characteristic, the flows can be calculated for each supply cluster. To calculate EMDIM, the method described in Ajiok et al. [8] can be applied to skip the calculation step of each supply cluster. This method regards the flow calculation of each supply cluster as each dimension in the signature vector.

An example of the calculation skip of the EMDIM(𝐴,𝐵) is shown, where 𝑟 is set to 1.5. First, a minimum value of total transportation costs from 𝐴0 is obtained, and its value is compared with 𝑟, as shown below:0.32.6926=0.80778<𝑟.(21) The value of 𝐴0 is smaller than 𝑟; the total transportation costs from 𝐴1 are added to 𝐴0 as follows:0.80778+0.32.6926=1.61556>𝑟,(22) where the total value is larger than 𝑟; hence, the calculation of 𝐴2 and 𝐴3 can be eliminated.

4. Fast Retrieval Algorithm for the EMD

4.1. Algorithm Using the Lower Bounds

In the case of the 𝑘-nearest-neighbor search task, the retrieval algorithm for the first 𝑘 image data is shown in Figure 6, and the retrieval algorithm from the 𝑘+1 image data is shown in Figure 7. This method calculates the original EMD for the first 𝑘 images in the database, where the largest distance is set to the threshold 𝑟. This method calculates the lower bounding distance for succeeding data from 𝑘+1 image. This method does not calculate the original EMD if the lower bounding distance exceeds the threshold 𝑟, because the true EMD cannot be lower than the first 𝑘 images. In all other cases, this method calculates the original EMD, and if the true EMD is lower than the first 𝑘 images, a new image is registered, and the threshold 𝑟 is reset.

4.2. Algorithm Using the Priority Queue

In a nearest-neighbor search based on a linear search, 𝑘 search results are obtained when 𝑘 features close to a given query are found. When a newly found result has a distance to the search feature that is smaller than the largest of the current 𝑘 results, the results are updated. The data with the largest distance must be deleted from the candidate list, and the new result must be added. The efficient execution of this procedure is provided by using a data structure called a priority queue.

There are various implementations of the priority queue, one of them being heap. The heap structure is shown in Figure 8. The diagram shows an array as a binary tree, with a key specifying the two lower keys. Every key is larger than its two child keys. Such a data structure is called a complete binary tree; the heap is defined as a complete binary tree, which is represented as an array in which all the nodes satisfy the heap condition. Here, the heap condition is that the key in any node is larger than (or equal to) the keys in its child nodes. If this heap condition is satisfied, then the largest key is always in the root node. As shown in Figure 8, the root is numbered as 1, its children as 2 and 3, their children as 4, 5, 6, and 7, and so on. In such a representation, the parent of a node 𝑎[𝑗] is 𝑎[𝑗/2] (truncated at the decimal point if 𝑗 is odd), and its children are 𝑎[2𝑗] and 𝑎[2𝑗+1]; thus, the parents and children can be identified easily for every node.

When a new search result offers a distance smaller than the largest value among the current results, the heap root (which has the largest value in the candidate list) is replaced with this newly found value. When this procedure causes a violation of the heap condition, the root value is exchanged with the largest of its children to restore the condition. If the heap condition is violated at the next level, a similar operation is applied; the procedure is terminated if a parent’s value is smaller than that of both children, or when the heap bottom is reached. This processing ensures that the data structure can be repaired so that every node meets the heap condition.

5. Evaluations

5.1. Experimental Results

In order to evaluate this method, the number of color images in the database was increased from 5,000 to 50,000 (in increments of 5,000), and the retrieval time (CPU time) to search the top 10 images was evaluated. To evaluate the number of dimensions of the cluster representative 𝑥𝑖, we prepared 5- and 11-dimensional data as the image feature. These experiments were simulated using a PowerPC G4 running at 1.5 GHz with 1.25 GB of memory.

The experimental results are shown in Figures 912. The experimental results for the 5-dimensional data are shown in Figures 9 and 10, and the results for the 11-dimensional data are shown in Figures 11 and 12. In order to make the effect of the combination of the proposed methods comprehensible, Figures 10 and 12 are enlarged graphs of Figures 9 and 11 correspondingly. In these graphs, “none” indicates the normal EMD calculation, “cb” indicates the corresponding result obtained using the centroid-based lower bounding distance, “im” indicates the corresponding result using the independent minimization lower bounding distance, and “cb_im” indicates the corresponding result by combining the “cb” and “im” methods. In addition, “noskip_*” represents the value obtained by each method without the calculation-skipping algorithm.

From the experimental results, it can be seen that when the number of color images is 50,000 and has 5 dimensions, it takes approximately 2.5 seconds to retrieve the result for a query image by using the normal EMD calculation. The combination method of the lower bounds could obtain the same result as the normal EMD in approximately 0.15 second. On the other hand, the method using the calculation-skipping algorithm took only 0.1 second.

5.2. Related Works

Some derivation algorithms of the EMD have been proposed. Pele and Werman [9] proposed the fast algorithm of EMD, and Ling and Okada [10] proposed the fast algorithm of EMD_𝐿1. The EMD can be obtain by replacing the ground distance of the EMD with the thresholded distance. EMD_𝐿1 is a replacement of the ground distance with the 𝐿1 distance. Both of EMD and EMD_𝐿1 are different from the original EMD. And they only indicate high-speed calculation methods in the case that the thresholded distance and 𝐿1 distance were used as the ground distance. On the other hand, by using our algorithm, the original EMD can be calculated efficiently. The characteristic of our algorithm is that unnecessary calculations can be skipped by using the EMD lower bounds, and the viewpoint of speedup is different from their methods.

5.3. Future Works

The proposed method selects first 𝑘 image data in the database as the initial retrieval results (Figure 6). However, if there are some outliers in first 𝑘 image data, the effect of speedup will become less, since the threshold 𝑟 is much larger. Then, the smaller the distance between the query and first 𝑘 image data is, the more the retrieval efficiency improves. In order to solve this problem, we are planning to use the hierarchical clustering index to select the initial data. VP-tree [11] and M-tree [12], which are famous metric space indices, can be used as the hierarchical clustering index. By using these indices, appropriate initial data as first 𝑘 image data can be obtained from the leaf node which is reached for the first time on the retrieval process. To be sure, these data are not true correct, but the possibility including outliers becomes low considerably. And it seems that the distance between the query and first 𝑘 image data becomes small. The above improvement will be done as the future works.

In order to validate that the hierarchical indexing tree can help to reduce the effects of outliers in the first 𝑘 images, the following preliminary experiment was done. First, the max distance between the query and top 100 image data is calculated. This distance is called distancetop and set to the initial threshold 𝑟 on the proposed method. Next, we obtain the max distance between the query and about 100 image data in the leaf node, which is reached for the first time by using the VP-tree. This distance is called by distancevp, and set to the initial threshold 𝑟 on the improved method by using the VP-tree. These distances were calculated for 5 queries and the image database, in which was registered 10,000 image data. The preliminary experimental result is shown in Table 6, where the Euclid distance is used as the distance measure, and the number of dimensions is 48. From the preliminary experimental results, it is found that the initial threshold can be set to smaller value by using the VP-tree.

6. Conclusion

In this paper, a fast retrieval method for the earth mover’s distance in multimedia databases is proposed. This method uses the lower bounding distance of EMD and combines it with a calculation-skipping algorithm. Moreover, the validity of the proposed method is supported by empirical observations. For future works, various multimedia information retrieval systems using the fast EMD should be implemented.

Acknowledgment

This work was supported in part by a grant from the Grant-in-Aid for Scientific Research nos. 17300036 and 21500940 from the Ministry of Education, Science, and Culture, Japan.