Abstract

A new document image retrieval algorithm is proposed in view of the inefficient retrieval of information resources in a digital library. First of all, in order to accurately characterize the texture and enhance the ability of image differentiation, this paper proposes the statistical feature method of the double-tree complex wavelet. Secondly, according to the statistical characteristic method, combined with the visual characteristics of the human eye, the edge information in the document image is extracted. On this basis, we construct the meaningful texture features and use texture features to define the characteristic descriptors of document images. Taking the descriptor as the clue, the content characteristics of the document image are combined organically, and appropriate similarity measurement criteria are used for efficient retrieval. Experimental results show that the algorithm not only has high retrieval efficiency but also reduces the complexity of the traditional document image retrieval algorithm.

1. Introduction

With the rapid development of science and technology, society has rapidly stepped into the information stage, and multimedia technology has been widely applied [1]. At the same time, the digital library is constantly improving its capacity, and its application often involves the real-time query of text information. For the real-time query of a massive image library, the real-time performance of its retrieval can be imagined. Traditional retrieval systems are often based on text information retrieval, which cannot meet the needs of the increasingly developing society. Due to the narrow image field applicable to the image retrieval technology based on text information, the development of content-based image retrieval technology becomes more important and urgent [2, 3].

The image itself is different from the text description; the image often contains extremely rich texture and information. At present, widely used image features mainly include colour, shape, texture, and some spatial relationship of the image. This retrieval technology based on image features overcomes the defects of text-based retrieval methods, greatly improves the retrieval rate and efficiency, and gradually becomes a hot spot in the field of image retrieval. In recent years, extensive research has been conducted on image detection based on striation characteristics, and its methods are mainly summarized as the statistical method, model method, signal processing method, and structure method [4]. Literature [5] puts forward the concept of the cooccurrence matrix in image space. Due to the small amount of information extracted from the cooccurrence matrix, texture information of the image cannot be fully expressed. The wavelet transform method has a good localization ability in both spatial and frequency domains and is an effective tool for statistical texture feature analysis [6]. However, the traditional real wavelet has poor directional resolution, which only includes horizontal, vertical, and diagonal directions, and lacks good translation invariance and directional selectivity. It is one of the effective ways to choose a reasonable function model to describe the distribution of wavelet coefficients, most of which are distributed near zero. Therefore, the generalized Gaussian density function was proposed in literature [7]. Literature [8] proposed the construction of a generalized gamma model, and literature [9] applied the Bayesian model to detailed subband amplitude coefficients. Although these models do a good job of describing the distribution of coefficients around the zero mean; however, from the careful observation and analysis of the wavelet histogram, it can be found that the coefficient density function of the wavelet region does not completely conform to the symmetric distribution. The histogram of wavelet coefficient distribution from some texture images is particularly prominent.

To solve these problems, this paper introduces the double generalized Gaussian mixture model. An effective statistical feature of texture analysis is obtained by fusing coefficients of wavelet subbands. Secondly, according to the characteristics of the document image in the digital library, this paper proposes an effective information retrieval method based on the texture feature of the document image. According to the statistical features of texture analysis, edge information in document images is extracted to construct texture features, and then, texture feature descriptors are extracted for similarity retrieval. Experiments show that this algorithm has high retrieval efficiency.

The breakdown of the paper is as follows: Section 2 provides related works. Section 3 discusses our approach. Results are presented in Section 4. Section 5 concludes the paper.

Texture can well represent the distribution characteristics of gray space between pixel neighbourhoods. Image retrieval methods based on texture feature can be roughly divided into spatial domain, frequency domain, mutual fusion of spatial domain, and frequency domain and fractal model. Wei et al. [10] proposed a new image retrieval algorithm. The algorithm classifies texture attributes into directivity, contrast, roughness, roughness, and regularity from the perspective of human visual perception. Therefore, the image retrieval algorithm based on texture feature can achieve better results. Liu and Yang proposed the method of texture unit [11]. Garg and Dhiman [12] made further improvements on the basis of the texture unit method and proposed the local binary pattern algorithm (LBP). The algorithm uses a given window to compare the pixel value in the centre of the module with each pixel value in the surrounding neighbourhood module to determine the domain pixel value. If the pixel value compared is less than the centre pixel value of the module, the pixel value of the neighbourhood is 0; otherwise, the pixel value of the neighbourhood is 1. Then, the items corresponding to the processed pixel value template and weight template are multiplied and summed. The image retrieval algorithm based on LBP can achieve better retrieval results, but the computational complexity is generally large and needs to be improved. Kugunavar and Prabhakar [13] proposed the gray level cooccurrence matrix method. The spatial texture of the image is described by the texture features such as moment of inertia, inertial state, inertial correlation coefficient, contrast fraction, and second moment angle. The wedge ring texture features extracted by Fourier transform can achieve better retrieval results. However, the texture feature belongs to the frequency domain and does not contain spatial domain information, which has certain limitations. Kayhan and Fekri-Ershad [14] made an in-depth study of fractal texture features of images.

Gabor filtering belongs to the spatial domain method and wavelet transform belongs to the frequency domain method. Zhuo and Zhou [15] proposed a texture feature extraction algorithm based on Gabor filtering. Li et al. [16] overcame the deficiency of the feature extraction algorithm based on frequency domain and proposed a texture feature extraction algorithm based on wavelet. Prasetyo et al. [17] improved the speed of feature extraction of the small ripple texture and proposed a new texture feature extraction algorithm based on the wavelet. Bu et al. [18] used the texture feature extraction algorithm of the tree wavelet transform to decompose the high-frequency part, which overcame the shortcomings of the tower wavelet transform and could describe the texture information of the image in detail. Madhavi and Patnaik [19] proposed an image retrieval algorithm based on the wavelet subband histogram statistical model. Mistry [20] proposed a texture retrieval algorithm based on the generalized Gaussian model and K-L distance. Sana and Islam [21] proposed a texture feature extraction algorithm based on the double-tree complex wavelet transform. Ashraf et al. [22] proposed a new image retrieval algorithm based on texture features of the complex rotary wavelet filter. Suresh and Naik [23] proposed a new texture image retrieval algorithm based on the traditional wavelet transform. Mistry et al. [24] proposed a texture image retrieval algorithm based on the complex wavelet coefficient model. Although we have obtained many image retrieval algorithms based on the texture feature, there are still many shortcomings to be improved.

3. Method

3.1. Principle of Content-Based Image Retrieval Technology

Content-based image retrieval technology is different from traditional information retrieval. It integrates many advanced technologies such as information science, image processing, pattern recognition, and database. It also adopted a new data model and presentation. At the same time, it uses some content information of the image itself by the system to automatically extract the features of the image. Furthermore, an appropriate similarity measurement method is used to retrieve images by similarity distance. To a large extent, it solves the problem of “what is asked but not what is answered” in image retrieval using keywords.

The content-based document image retrieval technology system is shown in Figure 1. The whole system is mainly composed of two parts. One part is the image index part, which mainly extracts the features of the sample document image and the document image library and establishes the corresponding index structure. The other part is the whole process of the document image query. After the user inputs the sample document image into the system, the retrieval system will automatically extract the feature according to its own feature extraction algorithm and perform similarity matching and then output the retrieved document image according to the calculated similarity.

3.2. Double-Broad Gaussian Distribution Model of Double-Tree Complex Wavelet Coefficient

Discrete wavelet transformation (DWT) [2527] will produce a large overlap, resulting in distortion. At the same time, it seriously affects the ability of ripple coefficients to describe the texture features of the original image. In order to overcome these disadvantages, this paper uses the zero mean double-broad Gaussian distribution to describe the distribution of the double-tree complex small wave band coefficient; the positive and negative parts of the small wave domain coefficient are function-fitting, in which the general Gaussian density function is

where fits the width of the peak value of the probability density function and simulates the speed at which the function curve changes. And is the value of image pixel. For this reason, is often called the scale parameter and is the shape parameter. When , the generalized Gaussian distribution is a Gaussian distribution. When , the generalized Gaussian distribution is the Laplace distribution.

Figure 2 plots the image and its subband coefficient histogram. The two-tree complex wavelet transform is similar to the general real wavelet transform. Most of the coefficients are distributed near the zero domain. But some texture images have the problem of incomplete symmetry. Therefore, this paper introduces the double generalized Gaussian mixture model to fit the generalized Gaussian distribution function in the negative and positive domains.

The specific modelling of the double generalized Gaussian mixture model is as follows: (1)According to the DT-CWT principle, multilayer wavelet transform is performed on the image to extract the coefficients of each subband of the double-tree complex wavelet of each image(2)Use the matrix decomposition method to separate the positive and negative coefficients of the subband coefficient matrix, that is, , where is the original coefficient matrix, is the positive part, and is the negative part(3)The two decomposed coefficient matrices are reconstructed into symmetric distributions, the combination of and , and the combination of and , forming two standard symmetric distributions. The generalized Gaussian fitting coefficient density distribution function is used

The complexity of the estimation algorithm of the scale and shape parameters of the general Gauss density function leads to its limitations in practical application. In this paper, the efficiency of the algorithm is greatly improved by using the GGD shape parameters and scale parameters based on the counter-function curve fitting. Because of the symmetrical distribution of GGD, the first-order origin moment is 0, so the absolute moment method is used to estimate the parameters and derive

Further, we can get an estimate of the shape parameter :

Scale parameter can be obtained according to Equation (2):

Using the counter-function of the curve function fitting to quickly estimate the shape parameters and scale parameters in the double-broad Gaussian distribution model, the experiment proves that the efficiency of retrieval is greatly improved without affecting the accuracy of parameter estimation, and the time complexity is reduced.

3.3. Texture-Based Document Image Retrieval Algorithm

There are some differences between the retrieval technology of document image and the usual image retrieval technology, and the basic framework of the retrieval system is the same. However, in terms of significant features, texture features are more effective than other features. Therefore, the texture characteristics of the document image extracted by the algorithm are used for similarity retrieval.

Document images are generally obtained by scanning paper documents, which usually have a lot of noise, which seriously affects the retrieval accuracy of document images. Therefore, it is necessary to do some noise-cancelling processing on the document image, using histogram equalization and median filtering method to sharpen the edge, contour, contrast, etc. of the image to enhance the image, remove noise, and reduce the impact of background information. In order to further improve the retrieval efficiency of documents, it is necessary to value the document images, and a new method of binary value of document images is proposed by drawing on the idea of block coding.

We set as a document image and divide the document image into a series of subblocks in size before the divalue is devalued. Then, we calculate the grayscale mean of all pixels in each subblock μ. We compare the grayscale and mean of all pixels in the subblock; if the grayscale value is greater than μ, change the grayscale value to 1 or 0. By this method, the original image is binary and a two-value image is obtained. The two-value process for the image is shown in Figure 3.

Depending on the visual characteristics of the human eye, when the human eye quickly glances at a document image, the edge area in the document image in the main direction of the document image is most noticed by the eye. Therefore, the texture characteristics of the image can be defined by the edge block produced by combining the bivalue process of the image. The dualization process of the image document can also be understood as some binary blocks by comparing the grayscale value of each pixel with the mean within the block. The distribution of “0” and “1” in binary blocks can reflect the texture information and shape distribution information of the image block itself to some extent.

Therefore, the texture characteristics of the document image can be defined by these binary blocks. The double-numbered complex wavelet transformation proposed in this paper can distinguish between different textures.

Combined with the dualization process of the document image, a new texture feature is extracted for retrieval. Define the binary blocks produced during the binarization of the document image as texture metadata, using the decimal number of the binary number sequence of two numbers in top-to-bottom order from left to right as the serial number (or value) of the block. Count the distribution of these texture metadata in the document image to obtain the grayscale symbiotic matrix of the texture metadata.

Suppose is a document image, and the texture metadata value in the image is . Variable is the texture metadata value located in . The elements in the grayscale symbiotic matrix are based on texture metadata. Variable is the number of the pixels of the image.

Variable η the number of texture metadata pairs ; is a four-neighbour chunk of the texture block . And the value of is . is defined as follows:

The texture of the document image is described by the texture metadata symbiotic matrix, which reflects the comprehensive information such as the direction and magnitude of change of texture and shape in the document image and the local domain. But by itself, it does not directly provide some meaningful texture features for document image retrieval. Therefore, some effective texture features need to be extracted further. In the algorithm, the following 4 typical statistics are extracted for retrieval. The formula is as follows:

After extracting 4 statistics, you get the texture feature descriptor for the document image. The texture metadata symbiotic matrix is used to extract the feature vector, and the method of the domain directional information union is used to enhance the antinoise ability of the algorithm and improve the retrieval effect of the algorithm.

3.4. Similarity Measures

Compared with other measures, Euclidean distance is a kind of similarity measure which is easy to understand and apply. It is dimensionless, and the Euclidean distance between two points is independent of the unit of measurement of the original data. Therefore, Euclidean distance is used in this paper.

We set to the sample document image to retrieve, and is the document image library in the library and extract the document image features extracted by the above feature extraction algorithm:

The similarity measurement between features is calculated using Euclidean distance, i.e.,

Because the extracted 4 components have different physical significance, and the range of values is not the same, in the similarity matching, the weight of each component is different and will directly affect the matching results, the algorithm, using the Gauss-unified method for internal normalization processing.

4. Results and Discussion

4.1. Database and Evaluation Guidelines

In order to evaluate the retrieval performance of the proposed algorithm, the Corel-1k dataset, Corel-10k dataset, and self-built database are used in this paper. Our database first constructs a library of document images of grayscale documents at different resolutions. The image library has document images in plain text, as well as mixed images that include text, pictures, and tables. In the experiment, 50 document images were selected for retrieval in 10 categories, and the accuracy of 50 search results and the average of the retrieval rate were used as the final results of the algorithm.

4.2. Performance Analysis of Dual-Broad Gauss Models

It is very important to extract the texture characteristics of the image and the statistical distribution characteristics of each subband accurately and effectively in the multiscale transformation domain. By observing the coefficient matrix density function distribution histogram, the paper proposes to extract texture features based on the double-broad Gaussian hybrid model in view of the problem that the density distribution of the texture image coefficient matrix is not strictly symmetrical. Some subclass images cannot be modelled accurately by the literature, and the reason for this problem is that the literature 26 method requires a higher initial value selection. The estimated results are relatively accurate only if the selected initial value is close to the true value.

Using the DT-CWT principle, the image is transformed by double-tree plural wavelets, and the subband coefficients of each image are extracted. By using the statistical subband coefficient distribution histogram to approximate the probability distribution function of the small wave coefficient, it is tested whether the probability distribution curve is similar to the double-broad Gaussian distribution model and compared with the single-broad Gaussian density distribution function curve. The results of the simulation are shown in Figure 4.

By analysing the experimental results of Figure 5, there is a significant left-right difference in the fitting curve, and the distribution density to the left of the zero bounding point is slightly larger. In summary, through the analysis of the experimental results, the double-broad Gaussian hybrid model can overcome the asymmetry of the distribution coefficient of the small wavelet coefficient of the texture image, reasonably describe the image content information, and fully characterize the texture characteristics.

4.3. Effectiveness Analysis of Texture Feature Algorithm

In order to compare the performance of various methods more intuitively, precision rate, recall rate, and MAP on the dataset are presented in this section. The performance curves of the method presented in this paper are all higher than those of other comparison methods, which show the effectiveness of the method to a certain extent.

In order to objectively analyse the validity of texture characteristics, this paper will be based on the gray degree cogeneration moment matrix (GLCM) [28], the dual-tree complex wavelet transform junction gray degree cogeneration moment matrix (DT-GLCM) [29], and the improved double-number complex wavelet combined with gray-degree cogeneration moment matrix (DT-GGCM) [30] which were compared with the image retrieval results obtained by the algorithm in this paper, and the results are shown in Figure 6.

As can be seen from Figure 6(a), when retrieving the first 10 images, the average accuracy of the algorithm in this paper is 89%. When the first 50 images were returned, the accuracy of the retrieval was 75%. From the distribution of ROC curves in Figure 6(b), it can be seen that the retrieval rate and accuracy of the algorithm in this paper are better than those of the three sets of comparison algorithms. Therefore, the texture characteristics extracted by the algorithm of this paper can indeed be a more comprehensive description of the graphic’s texture content, so that the calculation of the current image of the texture of the description has the ability to improve, thus improving the image of the inspection method of the accuracy rate.

In order to test the description performance of the current image by the algorithm, the image retrieval results based on the grayscale symbiotic matrix combined with colour characteristics (GLCM-H) [31], double-tree complex wavelet transformation and grayscale symbiotic matrix combined with shape features (DT-GLCM-Hu) [3234], improved double-numbered complex small wave junction gray-gradient comatrix (DT-GGCM), and the image retrieval results obtained by this paper algorithm are compared.

As can be seen from Figure 7(a), when retrieving the first 10 images, the average accuracy of the algorithm in this paper is higher than that of similar related algorithms. As can be seen from the ROC curve of Figure 7(b), the retrieval rate and accuracy of the algorithm in this paper are better than those of the other three algorithms. Therefore, the proposed improved texture feature extraction algorithm applied to library image retrieval has some validity.

In addition, the above experiments involve the texture characteristics of library images. The results show that the algorithm can dig out the texture characteristics of the image in depth. To a certain extent, the effective portrayal of images is realized, which improves the retrieval performance of image retrieval algorithms. To verify the performance of this method, this section builds a comparative experiment with other methods on the Corel-1k dataset. In the experiment of comparison of precision and recall rate, 10 images were randomly selected from each type of image as the image to be retrieved.

Table 1 and Figure 8 show the precision, recall rate, and MAP of different methods on the Corel-1k dataset. The table gives an average of the retrieval accuracy of all images to be retrieved when the first 10 images are retrieved. Compared with a comparison method to obtain higher retrieval accuracy, the retrieval accuracy of this method on the dataset is improved by 9.6%. Compared with several other methods, the recall rate of this method has been increased by 9.71%. MAP is the mean of the average accuracy of all images to be retrieved for the first 10 images. Compared with several other methods, the MAP of this method is improved by 6.93%. As can be seen from the results, our method has achieved higher results than several other methods and is also better on the Corel-1k dataset.

5. Conclusion

Based on the theory of content-based image retrieval, a document image retrieval algorithm based on texture characteristics is proposed. Under the double-tree complex wavelet transformation, the dual Gaussian hybrid model is used to ensure that the density distribution of the wavelet coefficient is not completely symmetrical. Moreover, the parameter estimation method of curve fitting is adopted to effectively ensure the efficiency of the estimation algorithm. Secondly, according to the statistical characteristic method, combined with the visual characteristics of the human eye, the edge information in the document image is extracted. On this basis, this paper constructs meaningful texture features and uses texture features to define feature descriptors of document images. On this basis, the characteristic vector of retrieval is extracted, and the rapid and accurate retrieval of the document image is realized. Experimental results show that the algorithm has good retrieval efficiency especially for documents containing charts.

Text retrieval in images is what we will study next. By detecting the text in the image, the retrieval of digital books in the digital library can be realized.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no known competing economic interests or personal relationships that may affect the work reported in this paper.

Acknowledgments

This paper is one of the research results of the Guangzhou Philosophy and Social Science Planning 2020, “Guangzhou Online Library Construction and Youth Thought Leadership Relationship Research” (Project No. 2020GZGJ180).