Abstract

The clustering problem has been extensively studied over the last 50 years; however, it still has the attention of researchers. This paper presents a topological basis of a pseudometric-based clustering model which takes into account the local and global topological properties of the data to be clustered, as per the definition of homogeneity measurement. The proposed approach takes into account the homogeneity effect produced when a new particle is added to a group. The additional element can be accumulated in the group if its local homogeneity is not altered and, therefore, it is not necessary to carry out tests in another group. A new group needs to be generated if the threshold of the local homogeneity of the group exceeds. Theoretical results, their implementation, and their application to the problem of Content Based Image Retrieval (CBIR) are presented. The tests were performed using three image databases widely used in the literature, which are “Vogel and Shiele,” “Oliva and Torralba,” and “L. Fei- Fei, R. Fergus and P. Perona.” The results are presented and compared with the most competitive methods available in the literature.

1. Introduction

Nowadays, the large amount of data on the Internet requires grouping or clustering to obtain the relevant information from them. The typical clustering algorithms are extensively used in the areas of data sciences, data mining, and pattern recognition for grouping the information having common characteristics or defining the optimal number of groups.

Literature reports many clustering paradigms among which the most important can be categorized into Partitional Clustering [15] and its variants [6, 7], Hierarchical Clustering [1, 3, 812], Density-based Clustering [1, 3, 1316], Grid-based Clustering [1, 3, 1720], Spectral Clustering [1, 3, 2124], and Gravitational Clustering [2529]. The literature review on clustering is given in the in the next section.

In this research paper, a new clustering theory inspired of the thermodynamics principle of energy and based on the topological paradigm is presented. The proposal works in accordance with the homogeneity effect that occurs when a group receives a new particle. A test in another group can be avoided if the new element does not alter the local homogeneity of the group in which it is added. On the contrary, when the threshold value of the group is surpassed, a new group must be generated. The process continues until all the elements are assigned groups. Subsequently, the proposal is applied to the Content-Based Image Retrieval (CBIR) technique in databases of natural scenery images.

The CBIR technique provides a support-oriented tool for image understanding where the aim is to associate and recognize an image only by its content. Research and development in CBIR technique enabled to make its place in the market in the form of co-marketing products such as QBIC-IBM (http://www.qbic.almaden.ibm.com/), VisualSEEL (http://www.ee.columbia.edu/ln/dvmm/researchProjects/MultimediaIndexing/VisualSEEk/VisualSEEk.htm), and MARS (https://www.ideals.illinois.edu/handle/2142/25947). The increasing attention given and search for efficient methods have made it an active research area [3033].

Regardless of the purpose of the users, each type of image has its own specific problems in terms of recognition and classification. Therefore, CBIR based techniques are designed to focus on one type of image with the natural scenery being one of the most complicated due to the mixture of irregular shapes and colors.

Our novel approach was tested and validated using three databases widely used in CBIR literature, which are described below:(i)Vogel and Shiele (VS) [34]: 700 images classified as 144 coast, 103 forest, 179 mountain, 131 prairie, 111 river/lake, and 32 sky/cloud.(ii)Oliva and Torralba (OT) [35]: 1472 images classified as 360 coast, 328 forest, 374 mountain, and 410 prairie.(iii)L. Fei–Fei, R. Fergus, and P. Perona (FP) [36]: 373 images classified as 128 Bonsai, 60 Joshua Tree, 85 Sun Flower, 64 Lotus, and 36 Water Lily.

We will refer to these image databases as VS, OT, and FP (Caltech-101).

Our results are compared with those of [34, 3739]. We would like to point out that our proposal achieves a recognition percentage of .

The system is implemented using the LabVIEW programming language (Laboratory Virtual Instrument Engineering Workbench) [40]. Due to the graphical ease of programming and the facility to interface input-output devices, this aspect is considered important because commercial computer-based devices can use this software. Visualizing in the future to implement an autonomous system of recognition of natural scenes mounted on a car, which will be managed by LabView as autonomous system [4143]. The fact of recognizing natural scenarios in the navigation of an autonomous car or possibly a drone, with a certainty, this proposed system will be an important element in the safety of autonomous vehicles.

The rest of the paper is organized as follows. In Section 2, related works are discussed. The proposed approach is presented in Section 3. A detailed description of the topology-based theory is given in Section 4. Section 5 presents the system methodology. Experimental results are given in Section 6. Finally, conclusions and future works are given in Section 7.

2. State-of-the-Art

Clustering algorithms are suitable for identifying and separating large databases into classes and they are among the major pattern recognition techniques. A clustering algorithm is expected to divide the set of features into the subsets which optimizes the intra-subset similarity and inter-subset dissimilarity, where a similarity measure is defined beforehand.

An extensive use of clustering algorithms is made in areas of data science and data mining, where the objective is to group information that has common characteristics, as well as, to define the optimal number of groups.

There are survey works giving a landscape on clustering techiques [4450], where, most of the works divide clustering algorithms on the basis of their paradigms. Thus, the most important clustering paradigms reported in the literature are: Partitional Clustering, Hierarchical Clustering, Density-based Clustering, Spectral Clustering, and Gravitational Clustering, which are summarized in Table 1. Despite the large number of clustering algorithms, one does not work for all types of clustering problems. The algorithms reported in the literature have been developed and adopted for different data to be clustered. The fundamental problem in the previously proposed algorithms reported in the literature is the overlooking of the topology of the data, both locally and globally, which limits them to work only for the data they analyze.

This proposal proposes a homogeneity measure that takes into account the local and global topological properties of the data to be clustered (see Sections 3 and 4).

Our proposal is an attempt to overcome the main limitations of typical clustering algorithms. With the topological model based on pseudometric, unlike partitional and gravitational paradigms, we no longer have the need to establish prior knowledge of the data, thus, it is not necessary to define any density-based function and the use of working only with the hyper-spherical separation functions is avoided. Therefore, based on Algebra of sets, in addition to not occupying excessive memory as the density based, spectral, and gravitational paradigms do, the calculations and results are faster. This research papers aims to compare the clustering and classification to databases of natural sceneries through CBIR methodology of different paradigms in terms of performance. The advantages and disadvantages of different clustering algorithms are beyond the scope of this research paper.

Concerning the most competitive CBIR works leading the same natural databases and scenarios are(i)Serrano [39]: used k-means algorithm (partitional paradigm) as clustering technique and k-nearest neighbor K-NN classifier.(ii)Bosch [37]: used Probabilistic Latent Semantic Analysis (PLSA) which is based on a density function (density-based and partitional paradigms). Authors used K-NN and support vector machine SVM classifiers.(iii)Lazebnik [38]: used multiresolution grouping methodology based on histograms (partitional paradigm) and a SVM classifier.(iv)Vogel [34]: used grid segmentation (partitional paradigm), K-NN, and SVM classifiers.

Authors take into account eleven classes: coast, forest, mountain, prairie, schooner, bonsai, lotus flower, aquatic flower, sky, cloud, and Joshua’s tree. The classification performance of the authors does not reach (about ). Throughout our clustering method, is reached.

3. Proposal

The traditional models categorize the images of different natural sceneries (forest, desert, mountain, river, etc.) by recognizing the necessary and sufficient characteristics of that particular image or images. However, there is little consensus on what these characteristics should be.

The clustering process reflects the way visual information is used, associating common elements between different images. Therefore, the hypothesis is the way to associate and create the categories of natural sceneries using mathematical functions which measure association and the decision is made on the basis of that assessment. The approach developed in this research paper compares the categorization patterns and decides whether or not the images belong to the same category.

Therefore, the following model of human thought is considered. The way in which different objects are grouped by a person is by evaluating the attributes and properties of the object in a given class. This is done by generating association values which are based on one function; the affinity function which measures how close one image is to another. The individual generates values and decides the convenience of associating an object to a group and thus form partitions.

The principle of homogeneity states that, for a closed system, internal homogeneity maximum is reduced to a minimum value in equilibrium. Following this line of thought, if we associate the ensembles with a homogeneity function, which is obtained in principle with the affinity function, a group of images will be formed by common elements if they are kept below their homogeneity level.

A subset of a group may be designated, which we will call representatives. A good way of associating an object to a group is when there is affinity between that new object and the group of representatives, that is, when the object added to the group keeps its homogeneity level low, by maintaining the qualities of that group and therefore the association of new objects will be clearer. In such case, it is sought that even with the least affinity that is on the object, it remains below the level of homogeneity. If it turns out that the object has no relation to this grouping, then this leads to a rejection, and a new group is created in which the elements are labelled as ‘other’ because they have no similar properties to each other or to the created groups.

4. Topological-Based Clustering Theory

Definition 1. Let be a nonempty set. The power set is the set of all subsets of . The Cartesian product of sets and is the set of all ordered pairs with and [51].
A partition of a set is a set of nonempty subsets of such that every element is in exactly one of these subsets. Equivalently, a family of sets is a partition of if and only if the following conditions are satisfied:(i)For each , .(ii)For and implies .(iii).

Definition 2. A pseudometric of a set is a function satisfying the conditions defined by (1), for all , , and .The pair is called a pseudometric space.
Besides, for all and , an open-ball function of radius and around is defined as .

Definition 3. Let be a pseudometric space and be a subset of . If an element satisfies the condition (2), then is called representative of , denoted by .For all , .

Definition 4. Let be a pseudometric space and be a subset of with ; then the distance between and is defined asAlso, a homogeneity function , with , is defined aswhere .
It is noted that the calculation of the homogeneity of a set is independent of the choice of the representative of . By definition, if and are representative of , then, the following condition must be satisfied:A point is related to if .

Definition 5. Let be a pseudometric space and . A partition of is called a -partition if for each .

Definition 6. Let be a pseudometric space, be a subset of , and be the set of representatives of . The set (6) given below is called a -cover of .

Lemma 7. Let be a pseudometric space. For each finite set , the set of representatives is a nonempty set.

Proof. For the set , considerstake . There is an such that . For definition of it follows thatfor each . This proves that is a representative of .

Example 8. Consider the sets , and in with the Euclidean metric (see Figure 1). The elements of the set marked with a dot are distributed in two concentric circles with radii 1, and 2. The representative of the set is the element in the center of the circle marked with a star. Consider the subsets and as removing a circle from (see Figure 1). Also, it can be noted that the homogeneity of set , , and is , , and , respectively.
This example shows that the homogeneity is not invariant under subsets, for does not imply .
The following result shows one way to preserve the homogeneity.

Proposition 9. Let be a pseudometric space, , and and be the set of representatives and homogeneity of , respectively. The -star cover with satisfies .

Proof. Consider and ; then . Let .It can construct the succession which satisfied for and that . Therefore .

Theorem 10. Let be a pseudometric space and . There exists a partition of X, where is -partition and the element are not related to for each .

Proof. It will construct a family of sets which satisfies the following properties for each : Take . If then . Otherwise, define . It can assume without loss of generality that there exists such that condition is satisfied for . Let and ; then take and consider . By Proposition 9 the following inequalities hold:Therefore condition is satisfied. The properties (i) and (ii) are fully satisfied.
Suppose that, for , a succession of families of sets has been constructed satisfying the conditions (i)-(iii).
Let and take .
Case I ( is not related to any ). Define .
If then , and . By definition of the property , therefore property (ii) is satisfied. The property (i) and (ii) are preserved due to the family of sets is the same in and .
If then , , and .
Note and . Properties (i) and (ii) are easily verified. is a the condition (iii) is satisfied.
Case II (there exists such that is related to ). Consider the set of indexes and let . Defines . Since is related to the inequality implies property (iii). Note that . Therefore properties (i) and (ii) are satisfied.
By induction, there exists a decomposition of the set of the form , where the family of sets satisfies with for each . The theorem is proved.

Observe that the theorem indicates an algorithm which allows to create partitions of a set from a pseudometric and a parameter . This partition consists of a family of sets whose elements are clusters. Elements of the set do not belong to any class.

5. Methodology

In this section, we describe each stage of the proposed methodology (applying our topological clustering theory based on a pseudometric) for image retrieval of natural scenes from a database. As shown in Figure 2, the flow diagram for the whole proposal, there is a training phase and a testing phase explained as follows.

5.1. Training Stage

This stage is divided into four main phases: Image database, Histogram estimation, Calculate the distance, and Add the object into the suitable cluster, see Figure 2 upper line, which are described below.

5.1.1. Image Database

To develop a CBIR system, it is necessary to use databases of images for the training of the system. In this phase, standard and free-use databases such as Vogel and Shiele (VS) [34], Oliva and Torralba (OT) [35], and L. Fei–Fei, R. Fergus, and P. Perona (FP) [36] can be used.

The classes used in both the training and the experimentation phases are shown in Figure 3.

5.1.2. Histogram Estimation

The histogram provides a summary of the distribution of pixel values in an image. The color histogram of an image is relatively invariant with respect to translation and rotation on the axes. The comparison of the colors contained in two images can be made using histograms. The color histograms are suitable for this investigation [5254]. A histogram represents the number of occurrences of the values of a data set (see the following):

A good practice is to perform normalization. It is usually represented as a density probability function. This preprocessing task is carried out to bring the values to an interval regardless of the number of elements they have.

In this part, the values of the image are still expressed in RGB color space. We also use HSI (Hue, Saturation, and Intensity) values. The first step for the conversion is to normalize the RGB values, as

After normalization, the HSI components are obtained by

For convenience, for neutral color values such as black, white, and gray, where values for R, G, and B are equal, H = 0. Similarly, the H, S, and I values are converted into the following ranges:

5.1.3. Calculating the Distance

Consider the set of images. There is an association of an image to the corresponding histogram . For two images and , the following relationship is defined:

In this case the function is a pseudometric.

Take , , and images in and , , and with their corresponding histograms. The properties of Definition 2 are fulfilled:

A pseudometric has been built in the space of the images. Theorem 10 will allow creating clusters and perform image recovery. For this study, we consider using the following distance functions [55].

Intersection:

Chi-Square:

Correlation:

The calculation of the distances must be modified in order to eliminate the indeterminate forms. If the direct substitution produces an indeterminate form then these values are discarded.

5.1.4. Add the Object into the Suitable Cluster

Having the representatives of each cluster, the homogeneity of the element to be added to the representatives is calculated. Through Theorem 10 and Proposition 9 of Section 4, homogeneities are calculated for each representative. Then, the subject image is assigned to the cluster where homogeneity is not altered.

5.2. Testing Stage

This stage is divided into five main tasks and proceeds as depicted in Figure 2 bottom line. A query image is presented to the system, and the same two first processing steps used for learning are applied to it: histogram estimation and Calculate the Distance: query to representatives. At this point, theorem one is applied measuring through a defined pseudometric. By Proposition 9 of Section 4, homogeneities are calculated for each cluster defined in learning phase. Then, the query image is assigned to the cluster where homogeneity is not altered. Therefore, the -images of the cluster are recovered.

6. Experimental Results

In order to test the whole performance of this proposal, 3 databases were concatenated and then divided by classes. The division was done in eleven classes (See Table 2) because natural sceneries are one of the most complicated due to the textures, the mixture of irregular shapes, and color variety.

All tests were carried out under “Cross-validation” method, which consists of random selection of half images set apart from database of each class for training task, and the other half part for testing task. This procedure is carried out during 20 iterations and, finally, an average value is obtained of retrieval percentages.

The system was implemented using the LabVIEW programming language, which is a computer systems engineering software that facilitates testing, measurement, and provides control with quick access to hardware and data information. The programs developed with LabVIEW are called Virtual Instruments, or VIs, and their origin came from instrument control, although today it has expanded widely not only to control all types of electronics but also to the embedded programming, communications, mathematics, etc. Its main characteristic is the ease of use for professional programmers. Relatively complex programs can be made and the facility to interface input-output devices. Visualizing in the future is to implement an autonomous system of recognition of natural scenes mounted on a car or a drone, which will be managed by LabView as autonomous system [4143]. The fact of recognizing natural scenarios in the navigation of an autonomous car or possibly a drone, with a certainty, contributes to the safety of autonomous vehicles.

The developed LabVIEW application screenshot is shown in Figure 4 where the query image, the main feature of the query image which is the histogram, and the retrieved image can be seen at the top left corner, at the lower left corner, and at the center right, respectively. The most related image file can be seen at the right bottom of the figure. Also, the histogram of the query image and the histogram of the most similar images can be seen.

Several subroutines were made to integrate two main modules: training and testing tasks. Only the latter has the image display for performance and processing speed issues. For each image of the selected database, the histogram is computed and sent to a .tdms file.

The value of parameter is experimentally estimated as an intermediate value between the representatives. For each class an average of the attributes was taken and it was defined as the representative. In order to determine the parameter , only the representatives were taken. Therefore, the parameter was defined. From now, this parameter is used for the whole clustering algorithm. It was observed that a value of parameter smaller restricts very much the classes of objects and created many classes that apparently were different however the classes contain common elements. A higher value of parameter significantly increased the size of the set, adding elements that do not have common characteristics.

As there are eleven possible classes of natural scenarios, only 3 examples of query and retrieval images are presented: Bonzai (See Figure 5(a)), Prairie (See Figure 5(b)), and Forest (See Figure 5(c)). The query image (red label in each sub figure) is presented and the system was configured to present the 5 most similar images. Note how, in the three cases of query images, the results of the 5 images are very similar to the query one. As the query image is included in the training database, consequently, in the search result, the first image found is equal to the searched one.

In order to evaluate the performance of the system created, precision and recall formulas are used [56]:

In the experiment, the precision vs. recovery graph is used for a query image. Graphs are made with the three distances in RGB and HSI color formats; averages were considered for each measure used [57].

The precision-recall analysis for the mountain, bonsai, and forest scenarios is presented in Figures 6, 7, and 8. It can be seen that the best results are obtained with the distance function “Intersection” for the RGB color model (dark blue line). Besides, the worst result was obtained for the “correlation” distance function in the HSI color model (green line).

Applying our proposal to the CBIR problem, a query image is always retrieved in the correct scenario. If the query image belongs to the training database, then it is found first. Otherwise, it is in the right scenario within the first 4 retrieved images (as an example, see Figure 5 for bonsai, prairie, and forest classes).

6.1. Comparison to Previous Classification Results: Natural Scene Classification

The comparison between the proposed method in this research paper and other competitive methods reported in the literature is focused on natural scenarios. Therefore, in the comparison, eleven classes are taken into account, coast, forest, mountain, prairie, schooner, bonsai, lotus flower, aquatic flower, sky, cloud, and Joshua’s tree. The most competitive results leading to the same natural databases and scenarios are Serrano [39], Bosch [37], Lazebnik [38], and Vogel [34].

As shown in Table 3 and Figure 9 where authors used the same database, the proposed method improves the results in all cases and for the all classes. Clarifying that in Table 3 and Figure 9, the zero values of the other authors indicate that they did not work with the respective image class.

Remarking that Bosch [37] has the same performance for the forest class as our model, but Bosch has worked only with four classes. Besides that, Vogel [34] has percentages above in four classes: forest, prairie, skies, and clouds classes; the proposed method improves the result about , i.e., from (mean value for the four best classification) to . Finally, compared with the two authors (Lazebnik [38] and Serrano [39]), our results are far better.

Finally, comparing with articles [30, 5860] that use other image data bases (Corel A, Corel B, and Caltech) which include some natural scenarios: costs, flowers, and mountains. It can be seen in Table 4 that this proposal is superior.

We consider that we have obtained better results than the previous published results since the proposed clustering algorithm, based on topology theory, takes into account both the local and the global topology of the data to be clustered.

7. Conclusions

The proposed theory of a pseudometric-based clustering model and its application in Content-Based Image Retrieval worked as it was expected. The developed clustering method based on topology theory, has successfully operated for the clustering of natural scenery. The proposal based on a topological pseudometric, provides a paradigm within the literature of clustering algorithms.

This proposal takes into account the local and global topological properties of the data to be clustered, in a definition of homogeneity measurement.

The proposal tries to overcome the main limitations of typical clustering algorithms: (1) no longer have the need to establish an a prior knowledge of the data, thus, it is not necessary to define any density-based function, (2) there is no need to work only with hyper-spherical separation functions (partitional and gravitational paradigms). Finally, (3) as our proposal is based on Algebra of sets, calculations and results are faster and without excessive memory consumption (as density-based, spectral and gravitational paradigms do).

Using the same image databases, the comparison with the most competitive works, the proposed method improves the results in all cases and for all the classes.

A query image is always retrieved in the correct scenario. If the query image belongs to the training database, then it is found first. Otherwise, it is in the right scenario within the first 4 retrieved images. As a consequence, the whole system has an efficiency of for eleven natural sceneries.

The application was deployed using the LabVIEW vision assistant. This aspect is considered important because commercial computer-based devices can use this software. The observed processing time was linear with respect to the number of elements in the database. There was a considerable reduction in the training and testing times.

Theorem 10 is not limited to images; besides in any space that defines a pseudometric, this algorithm can be used. The results obtained are very promising and future work includes the application of these functions in other sets and applications.

As future work, the authors propose developing a CBIR system in FPGA hardware or mobile device, such as a camera device to acquire the images to process in real time using this methodology. Finally, the authors have shown a real-world experiment, in which this function obtained very good results.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

I. Osuna-Galán, Y. Pérez-Pimentel, Carlos Aviles-Cruz, and Juan Villegas-Cortez contributed equally to this work.