Abstract

Aiming at the problem that the image similarity detection efficiency is low based on local feature, an algorithm called ScSIFT for image similarity acceleration detection based on sparse coding is proposed. The algorithm improves the image similarity matching speed by sparse coding and indexing the extracted local features. Firstly, the SIFT feature of the image is extracted as a training sample to complete the overcomplete dictionary, and a set of overcomplete bases is obtained. The SIFT feature vector of the image is sparse-coded with the overcomplete dictionary, and the sparse feature vector is used to build an index. The image similarity detection result is obtained by comparing the sparse coefficients. The experimental results show that the proposed algorithm can significantly improve the detection speed compared with the traditional algorithm based on local feature detection under the premise of guaranteeing the accuracy of algorithm detection.

1. Introduction

Image similarity detection is a hot issue in the field of multimedia information processing. Similar image is a set of images obtained from an image of the same scene or the same object taken from different environmental conditions such as different angles or different lighting conditions and edited transformations of the same original image through different ways. Examples of some similar images are shown in Figure 1.

Image similarity detection is to judge the similarity of visual content by matching the image. According to the adopted feature, image similarity detection methods can be divided into two categories, namely, global-feature-based detection methods and local-feature-based detection methods. The global feature of the image refers to the use of one or a few feature vectors to represent whole image content. Common global features include color histograms, texture features, and block features. Because the number of feature points is small, the calculation speed of image content similarity detection based on global feature is usually very fast. However, due to the singularity of its feature selection and the roughness of the description image, the global feature is very susceptible to edits and local transformations. For example, image similarity detection with color histogram as global feature is very sensitive to the illumination of the image. Usually, similar images are created by editing transformation; similarity detection accuracy is generally relatively low based on the global features of the image content.

In recent years, some scholars have proposed local features for image similarity detection. Compared with global features, local features of the image usually have some local invariance for the illumination, rotation and scaling of the image and have been widely applied in the field of content-based image and video retrieval. Local feature points are usually local extreme points in an area of the image, and have more obvious features than the rest of the pixels in the region. Description of the local feature points is generally the combination of the characteristics of the key points and the information of the surrounding area, thus ensuring the local invariance of the feature.

The SIFT feature proposed in [1] is still the most accurate feature of image matching because of its good invariance of rotation, scaling, and scale of the local neighborhood of the image, but its 128-dimensional high dimensional feature vectors and the number of feature points detected will usually bring a large computational burden and reduce the efficiency of the algorithm. In the literature [2], dimension of the SIFT feature point is reduced by principal component analysis (PCA), and a PCA-SIFT feature is proposed. This feature reduces the SIFT feature vector from 128 degrees to 32 degrees or less, which improves the efficiency of the algorithm. Hong Kong City University [3] proposes a similar key frame detection algorithm based on the PCA-SIFT, better than the previous algorithm. Based on the timing information of key frames, the algorithm divides the key frames into different time series groups, performs similar key frame detection in PCA-SIFT in each group, and then analyzes the correlation between different story units. Although the algorithm improves the efficiency of the algorithm by PCA dimensionality reduction, it is still difficult to meet the real-time requirement of large data processing. In the literature [4], a SIFT feature filtering algorithm is proposed, which effectively reduces the SIFT feature points extracted from an image by the punishment mechanism, reduces the computational complexity, and improves the matching accuracy of the SIFT algorithm. In [5, 6], the image is classified into different categories by using the clustering algorithm. It calculates the distance between the image to be detected and different clustering center and selects the nearest several types of images and exactly matches their SIFT characters, which reduces the SIFT feature matching data and improves the detection speed of the algorithm.

In summary, the shortcomings of the current image similarity detection are as follows: although the global feature is faster, it does not have local information, which is usually sensitive to the partial change of the image and is not robust enough. Although the local feature can guarantee the local deformation of the feature points, the computational efficiency is very low due to the high dimension of the feature points. At present, the research based on local features mainly focuses on the acceleration of local invariant features, and some studies focus on the comprehensive use of global features and local features [7]. How to detect the content of similar images quickly and effectively is still a problem.

2. Principle of ScSIFT Algorithm

The algorithm of Scale Invariant Feature Transform based on Sparse Coding (ScSIFT) focuses on the sparse SIFT feature vectors. After extracts of the SIFT feature, the algorithm conducts sparse coding on the 128-dimension SIFT feature vector and establishes the feature sparse coding index to improve the matching speed. The merits of the sparse representation of the original signal depend largely on the selection and design of the overcomplete dictionary. The errors generated by the partial eigenvectors are less than the errors produced by the default fixed dictionaries when sparsity represents the rest of the eigenvectors, and different dictionary learning algorithm to learn the error between the dictionaries is basically the same. This paper uses the dictionary learning algorithm proposed in [8] to train the complete dictionary.

The main idea of the ScSIFT algorithm is to use the SIFT feature extracted from the key frame image of the query library as a training sample to train the overcomplete dictionary to obtain a set of overcomplete bases. The SIFT feature vector of the query image is sparsely encoded with the overcomplete dictionary, and the sparse feature vector is indexed. Furthermore, the SIFT feature of the image to be detected is matched with the query image feature index by using the overcomplete dictionary to obtain a set of similar candidate sets, compare the sparse coefficient of the image to be detected and the sparse coefficient of the candidate image and get similar image detection results. SIFT feature sparse coding of the sparse coefficient vector is called the image ScSIFT features.

While learning the dictionary as a training sample with the SIFT feature, the SIFT feature should be normalized. Standardization process can be seen in where is a SIFT feature matrix. is a matrix with the mean of the columns of eigenvector group .   is a matrix with the mod of the columns of eigenvector . is a normalized SIFT feature vector group.

In general, the distance between the image’s SIFT feature points uses Euclidean distance or absolute distance measurement. Euclidean distance is also known as norm distance, and the distance calculation can be seen in

The sparse representation of the eigenvector is performed by the overcomplete dictionary obtained by training, as shown in

The sparse feature vector is as shown in where is the Euclidean distance of feature vectors and . is the Euclidean distance of sparse vectors and .

From formula (4), we can see that the square of the distance between the SIFT features is related to the square of the distance between the sparse representation of the eigenvector. Thus, the distance between the original SIFT features can be represented by the distance of the sparse feature vector.

Since the dimension of ScSIFT feature vector is much higher than SIFT feature vector, it would be time-consuming to compute the distance by (5). While the vast majority of sparse vectors are zero, the ScSIFT feature distance calculation can be simplified aswhere is the element ordinal for those nonzero values in vector and zero value in vector . Similarly, is the element ordinal for those zero values in vector and nonzero value in vector . is the element ordinal for those nonzero values in both vectors and .

When comparing the ScSIFT features of the image and the candidate set image, the ScSIFT feature of the candidate set image can be indexed to improve the retrieval efficiency. The indexing process is as follows.

By sparse coding, the 128-dimensional SIFT eigenvector has been transformed into an -dimensional ScSIFT eigenvector . Only -dimensional data is nonzero in , and then the secondary index table can be established.

The ScSIFT feature vector is transformed into a binary vector; that is, where

The binary vector is used as a secondary index of the ScSIFT feature. is an -bit binary string. In the worst case, is adopted as an index for comparisons. Since , only using as an index may still result in significant computational consumption. So, on the basis of the secondary index , the first-level index can be created.

Calculate the number num of nonzero elements in the secondary index . Because the vast majority of elements in are 0, the range of num is not large and varies in the range. Use num as the first-level index of the ScSIFT feature.

When the query ScSIFT eigenvector is to be compared with the ScSIFT vector in dataset, the ScSIFT eigenvector is transformed into a binary vector according to formula (7), and the number of nonzero elements in is calculated. Then compare with the first-level index to the corresponding column , and compare and column to retrieve the corresponding ScSIFT eigenvector in the dataset. Finally, match the ScSIFT feature vector and to find the nearest neighbor.

3. Process of ScSIFT Algorithm

The implementation of image similarity acceleration detection algorithm based on sparse coding is mainly divided into three subprocesses: sparse dictionary learning algorithm, query image offline sparse coding algorithm, and real-time matching algorithm of images.

3.1. Sparse Dictionary Learning

The basic process of sparse dictionary learning algorithm is as follows:(1)Select the images from the query library, extract the SIFT features of these images, and constitute the training feature set .(2)Normalize the training feature set according to formula (1).(3)Train the overcomplete dictionary using the trained feature set .(4)Output dictionary , end.

3.2. Offline Sparse Coding

Offline sparse coding is carried out for the similar images in the query library; the basic process is as follows:(1)Read the first image of the query library.(2)Extract the SIFT feature of the image to constitute the image feature set .(3)The training feature set is normalized according to formula (1), and the normalized feature set is obtained.(4)The sparse algorithm is performed on the normalized feature set ; preserve the sparse coefficient , that is, the ScSIFT feature.(5)If there is an uncoded image in the query library, continue to read the next image , and repeat steps , , and . Otherwise, go to step .(6)Create index and save all images’ ScSIFT features in the query library.(7)End.

3.3. Real-Time Matching of Images

The basic process of the algorithm of real-time matching images is as follows:(1)Read the image , extract its SIFT feature to form the feature set , and normalize it to .(2)Get the sparse representation of the normalized feature set by the sparse algorithm, and obtain the ScSIFT feature .(3)Set the ScSIFT feature similarity distance threshold and image similarity threshold , and read the first column of ScSIFT feature .(4)Search for the nearest column coefficients of the ScSIFT feature according to the index index.(5)Calculate the distance of for each column and according to formula (6). If , then the feature matching amount of the image which the column belongs to is increased by 1.(6)Repeat steps and until all columns of the ScSIFT feature are cycled once.(7)For each feature matching amount of image which meets the condition , calculate the total number of sparse feature points and the similarity degree . If , then the image is the similar image to the query image.(8)End.

4. Experimental Results and Analysis

In the experiment, we select a total of 10816 frames extracted from 1000 videos as the query library images. Five different images are selected from the library, and, for each image, Gaussian blur, the mark transform, the gray scale transformation, and the scale cropping transform are carried out. The original image and the transformed images, altogether 50 images, are compared with the query library by similarity detection, and comparisons of experimental results are carried out.

4.1. Sparse Coding Dictionary Learning Experiment

Error of sparse coding depends mainly on the choice of overcomplete dictionary. In this paper, we choose the overcomplete dictionary based on feature learning and the overcomplete dictionary of DCT as a comparison and carry out sparse coding experiments to test the influence of different dictionaries on sparse coding. Experimental factors include two aspects: time of training dictionary and feature coding and error of the dictionary sparse coding.

The sparse coding error is calculated by where is the original data signal, is the overcomplete dictionary, is the sparse coding of the signal , and is the sparse coding regular parameter.

In this paper, number of blocks, number of cycles, and dictionary dimension are used as the control variable training dictionaries to compare the coding errors and the differences in dictionary learning time between different control variables. At the same time, use the DCT dictionary as the benchmark, and compare the coding error and the feature coding time difference between the feature learning dictionary and the DCT dictionary of different dimension. The experimental results are shown in Figure 2.

Figures 2(a) and 2(b) show the errors and the dictionary learning time between different number of blocks in the dictionary learning and different number of cycles. Here, the number of blocks in Figure 2(a) and the number of training cycles in Figure 2(b) are set to (), respectively. It can be seen from Figures 2(a) and 2(b) that change of the number of blocks and the number of cycles have no significant effect on the dictionary coding error, and the variation of the dictionary coding error is less than 0.01, while it has a greater impact on dictionary learning time. Therefore, the variable block number and the cycle number can be set to a smaller value to save training time and improve the efficiency of dictionary learning.

Figures 2(c) and 2(e) are the dictionary coding errors corresponding to the dictionary dimension change set and the dictionary learning time. Figures 2(d) and 2(f) are the coding time required for sparse coding of features in a dictionary of corresponding dimensions. Compared with Figures 2(c) and 2(e), it can be found that the coding error of the feature learning dictionary and the DCT dictionary decreases significantly with the increase of the dictionary dimension, but the DCT dictionary coding error decreases to about 0.43 without obvious change, and its dictionary construction time is basically the same. The error of the feature learning dictionary decreases from 0.28 to 0.15 as the dimension increases, but its dictionary learning time is also significantly increased. At the same time, compared with Figures 2(d) and 2(f), the two dictionary features of the coding time are significantly increased when the dictionary dimension increases, the feature learning dictionary coding time accompanied by the dimension’s increasing eventually remains at 15 seconds, and the encoding time of DCT dictionary is increased from 0 to 40 seconds significantly.

By contrast, we can find that the feature learning dictionary has obvious advantages compared with the DCT dictionary. One is that its coding error is lower than that of the DCT dictionary, and it is less than 28%, and the coding time is better than that of the DCT dictionary. When the dictionary is trained by the feature, change of the feature number and the cycle number has little effect on the coding error of the dictionary, but the dictionary learning time is significantly increased. Therefore, these two parameters can be set low to improve the dictionary learning efficiency. In this paper, set the block number batch = 400, the cycle number iter = 100, and the dictionary dimension .

4.2. ScSIFT Algorithm Validity Test

The validity of ScSIFT algorithm is verified by comparison with SIFT algorithm and SURF algorithm. In the experiment, select five key frame images and 10 sets of key frame images as query image to test different algorithm’s running time. The number of key frame images is frames, respectively.

Feature extraction time, distance calculation time between feature points, average matching time of each frame, and total run time of three algorithms are shown in Table 1. The total runtime is shown in Figure 3.

In Table 1, single-frame detection time is the average similarity detection time for a frame.

From Table 1 and Figure 3, we can find that SIFT feature extraction speed is the fastest with an average of 0.7 seconds. SURF feature extraction speed is the slowest with an average of 2.76 seconds. For the same number of key frames similarity detection, SIFT algorithm detection time is 4 times of ScSIFT algorithm. The detection time of ScSIFT algorithm and SURF is very close when the number of key frames is small, and as the number of key frames increases, the speed is slower than SURF. This is mainly because the number of feature points extracted by the SIFT algorithm is larger than the number extracted by the SURF feature. When the number of key frames is large, the disadvantage of increased computational burden is gradually evident. However, due to the slow feature point detection speed of the SURF algorithm, total running time of the algorithm is slower than that of the ScSIFT algorithm and the SIFT algorithm when the number of key frames is small. Experiments show that the average run time of ScSIFT algorithm is 52% higher than that of SIFT algorithm, which is 45% higher than SURF algorithm.

Six frames which are transformed by gray scale transformation, scale cropping transformation, tag adding, and Gaussian blur are selected. The similarity determination is performed by ScSIFT algorithm and SIFT algorithm, respectively, with other selected fifteen frames to test the robustness of ScSIFT algorithm in different edit styles. ScSIFT algorithm and SIFT algorithm for different transformations of key frame similarity detection results are shown in Tables 2 and 3.

The comparison results of Tables 2 and 3 are shown in Figure 4.

In Figure 4, “” is the ScSIFT algorithm detection results, “” is the SIFT algorithm test results. The abscissa in the figure is the image frame number, and the ordinate is similarity. Combining the results in Tables 2 and 3 and Figure 4, it is shown that the similarity search results of ScSIFT algorithm are basically the same as SIFT algorithm, and it is slightly higher than SIFT for some videos. In summary, the experimental results show that ScSIFT algorithm is similar to SIFT algorithm, but its operation speed is faster, which is 52% higher than that of the latter.

5. Conclusions

Image similarity detection is a hot issue in the field of multimedia information processing. Current image similarity detection methods are mainly based on global features or local features. Image similarity matching method based on global feature is fast but has the disadvantage of poor robustness. Image similarity matching method based on local feature is just the opposite. Aiming at the problem of high computational complexity and low efficiency based on the local feature detection method, an algorithm for image similarity detection based on sparse coding is proposed. The algorithm improves the detection speed by sparsely coding the local features and indexing them. The experimental results show that, compared with the traditional image similarity detection algorithm based on local feature, the algorithm can improve the detection speed while ensuring the detection efficiency. In the future work, we will further combine global and local features to ensure the best balance between detection accuracy and speed.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work has been supported by the National Natural Science Foundation of China under Contract no. 61571453, Science and Technology of Hunan Education Department (no. 15A020), and Science and Technology of Changsha (no. ZD1601014).