A Greedy Clustering Algorithm Based on Interval Pattern Concepts and the Problem of Optimal Box Positioning
We consider a clustering approach based on interval pattern concepts. Exact algorithms developed within the framework of this approach are unable to produce a solution for high-dimensional data in a reasonable time, so we propose a fast greedy algorithm which solves the problem in geometrical reformulation and shows a good rate of convergence and adequate accuracy for experimental high-dimensional data. Particularly, the algorithm provided high-quality clustering of tactile frames registered by Medical Tactile Endosurgical Complex.
We consider the problem of clustering, that is, splitting a finite set into disjoint subsets (called clusters) in such a way that points from the same cluster are similar (with respect to some criterion) and points from different clusters are dissimilar (see, e.g., ). It is convenient to present the input data in the form of a numerical context (table) whose rows correspond to objects and columns correspond to attributes of the objects.
Formal concept analysis (FCA) is a data analysis method based on applied lattice theory and order theory. The object-attribute binary relation is visualized with the use of the line diagram of the concept lattice. Within the framework of this theory a formal concept is defined as a pair (extent, intent) obeying a Galois connection (for exact definitions see the monograph  by Ganter and Wille).
There exist several generalizations of FCA to fuzzy and numerical contexts. One of them is known as the theory of pattern structures introduced by Ganter and Kuznetsov in . An important particular case of pattern concepts, which are the key object in the theory of pattern structures, is interval pattern concepts with the operation of interval intersection. Interval pattern concepts allow one to apply cluster analysis to rows of formal numerical contexts. In this case the criterion of similarity consists in belonging of all the differences between the values of the corresponding attributes to given intervals.
It can be easily seen that the problem of finding an interval pattern concept of maximum extent size (i.e., cardinality) can be reformulated as the problem of the optimal positioning of a -dimensional box with given edge lengths for the given set , that is, finding a position of the box that maximizes the number of points of the set enclosed by the box (the details are given below in Section 2.2).
The existing algorithms that solve the problem of finding the optimal position of a box do not allow one to obtain an exact or at least approximate solution for high-dimensional data within a reasonable time (see a detailed survey in Section 2.2). The main goal of this paper is to propose a greedy algorithm which gives an approximate solution to this problem and a clustering algorithm based on the optimal positioning problem. We propose a clustering algorithm with worst-case time and space complexity, where denotes the number of iterations of the main stage of the algorithm, and parameters and regulate the duration of each iteration. Greater number of iterations and greater duration of each iteration provide better approximation.
The rest of the paper is organized as follows. In Section 2 we introduce the main definitions and formalize the statement of the problem. In Sections 3 and 4 we formulate our algorithms. In Sections 5 and 6 we describe the validation results and make some concluding remarks.
2. Main Definitions and Statement of the Problem
In this section we start with the main definitions from the theory of formal concepts and then present a geometrical reformulation of the problem of finding the interval pattern concept of maximum extent size (we call it simply the maximum interval pattern concept).
2.1. Formal Concepts
Definition 1. An upper (lower) semilattice is a partially ordered set such that for any elements there exists a unique least upper bound (greatest lower bound, resp.).
Definition 2. A semilattice operation on the set is a binary operation : that features the following properties for a certain and any elements : (i) (idempotency).(ii) (commutativity).(iii) (associativity).(iv).
Definition 3. A lattice is an ordered set which is at the same time an upper and a lower semilattice.
Definition 4. Let and be partially ordered sets. A Galois connection between these sets is a pair of maps : and : (each of them is referred to as a Galois operator) such that the following relations hold for any and : (i) (anti-isotone property).(ii) (anti-isotone property).(iii) and (isotone property).Applying the Galois operator twice, namely, and , defines a closure operator.
Definition 5. A closure operator on is a map that assigns a closure to each subset under the following conditions: (i) (monotony).(ii) (extensity).(iii) (idempotency).
Definition 6. A pattern structure is a triple , where is a set of objects, is a meet-semilattice of potential object descriptions, and : is a function that associates descriptions with objects.
The Galois connection between the subsets of the set of objects and the set of descriptions for the pattern structure is defined as follows:
Definition 7. A pattern concept of the pattern structure is a pair , where is a subset of the set of objects and is one of the descriptions in the semilattice, such that and ; is called the pattern extent of the concept and is the pattern intent.
A particular case of a pattern concept is the interval pattern concept. The set consists of the rows of a numerical context, which are treated as tuples of intervals of zero length. An interval pattern concept is a pair , where is a subset of the set of objects and is a tuple of intervals with ends determined by the smallest and the largest values of the corresponding component in the descriptions of all objects in .
Interval pattern concepts are convenient to use in the analysis of numerical contexts, when there is a need to divide all data into clusters that comprise objects in which the numerical data is similarly “distributed” in the rows.
For each component of an interval pattern concept we introduce the width : the difference between the largest and the smallest values of the component. Then a clustering procedure can be defined using a standard greedy approach. Specifically, at each step the maximum interval pattern concept is identified, that is, an interval pattern concept with the maximum number of objects, whose width with respect to each component does not exceed a predefined . The objects of the identified interval pattern concept are combined into a cluster and excluded from the set of objects analyzed at subsequent steps.
In Example presented in Table 1 the objects are pupils and the numerical data of the context consist of the grades they got at exams in various disciplines.
We need to divide the set of pupils into clusters in such a way that the grades of pupils in the same cluster differ by at most 1 for each of the disciplines. Such a setting corresponds to ; in this case we obtain 6 clusters (interval pattern concepts whose width is not greater than 1), each containing one pupil. In the case we arrive at the same 6 clusters.
When we have five clusters , , , , , and in the case we obtain three clusters , , .
Example 2. In the conditions of the previous example let us set , , , . Then the set of pupils can be divided into four clusters , , , :
Clustering methods based on interval pattern concepts find applications in the analysis of experimental data. For instance, applications of such methods to gene expression analysis were discussed in [4, 5].
Let be a set of points in () and be a set of positive real numbers.
Definition 8. A -orthotope (also called a box) with center and edge lengths is the Cartesian product of the intervals
It can be easily seen that the problem of identification of maximum interval pattern concept can be reformulated in terms of finding the optimal position of the box with the edge lengths , that is, maximizing the number of points of set enclosed by the box. This formulation can be generalized to the problem of finding the optimal position of a ball in an arbitrary metric space, since any box can be treated as a ball in the stretched metric in which the distance between the points and is defined as
The problem of optimal positioning has been well studied for : some lower and sharp upper bounds on complexity are known (see, e.g., [6, 7]). However, to the best of our knowledge for the case of an arbitrary dimension no lower bounds and no efficient exact algorithms are available so far. de Figueiredo and da Fonseca noted  that the problem can be solved exactly in roughly time by projecting the points onto a -dimensional paraboloid and using half-space range searching data structures . In the same paper for the case of weighted points under certain additional restrictions they also obtained a lower bound for exact algorithms and indicated that existing algorithms for the unweighted version of the problem do not beat this lower bound in the worst case. Eckstein et al. showed that a generalization of the problem of optimal positioning whose input also includes a set of prohibited points is NP-hard .
Known approximate algorithms for optimal positioning also have time complexity which depends on exponentially. For example, de Figueiredo and da Fonseca suggested an approximate algorithm  which solves the problem in worst-case time , where is a given approximation parameter. Due to exponential dependence on these approximate algorithms are also practically inapplicable in the case of high dimension, and there is a need to develop an algorithm which can produce an approximate solution in reasonable time.
3. A Greedy Algorithm for Finding an Approximately Optimal Position of a Box
In this section we present a greedy algorithm for finding an approximately optimal position of a box with edge lengths for a set (the order in which points are listed in is insignificant). This algorithm is auxiliary for the clustering method described in Section 4.
The algorithm has several input parameters: positive real numbers , , and a function : . The parameters , , and regulate the duration of one iteration. The function takes the values and as inputs and returns the number of iterations at the main stage of the algorithm. Greater number of iterations and greater duration of each iteration provide better approximation.
The algorithm includes two basic stages: the preprocessing stage and the main stage.
(1)At the first stage of our algorithm the box with the edge lengths is transformed into the unit cube (we call it simply the cube) by means of dividing the th coordinate of each point by , . This stage can be performed in operations.(2)We consider the integer lattice with edges of length 1, compute the number of points of in each cell, and denote the cell that contains the maximum number of points by . The cell is called the base cube. Let denote the center of . This stage requires operations as well.(3)At the final step of the preprocessing stage we build a - tree data structure (which is used at the main stage to organize the fast range search) in operations with the space complexity of (see [11, 12]).
3.2. The Main Stage
Let : denote the function which counts number of points of the set in an arbitrary subset of . The main idea of our algorithm consists in constructing a finite sequence of cubes that starts from a random point in the base cube and satisfies the condition that the next cube contains more points than the previous one. Let denote these cubes with centers , respectively. In our notation we have and for all . After iterations the algorithm returns a locally optimal cube .
Definition 9. The -neighborhood of a cube with center is the set consisting of all cubes with centers at points of the form for all , that is, all cubes obtained through translation of along one of the axes by the distance .
Now we describe the procedure of constructing the sequence of cubes. Let be an arbitrary point in the base cube and , be the cube with center at , . In order to get a definite estimate on the precision of the algorithm (see Theorem 11) we initialize the first iteration deterministically by taking the center of as . Other iterations are initialized randomly.
Suppose that the cubes with centers , respectively, and the numbers have been already constructed. There are two possible cases.(1)If there exists a cube in the -neighborhood of such that , then we set , take the center of as , and take . In other words, if there exists a cube in the -neighborhood of the current cube which contains more points of , then we move the current cube to this position.(2)If there are no such cubes (i.e., all cubes in the -neighborhood of the current cube contain at most the same number of points), then we set , , and (i.e., decrease the current step size). If (the step size threshold is reached), then the procedure is ended and is returned as the procedure result.
In order to obtain acceptable time complexity we impose additional restrictions on the selection of the next cube. These assumptions are necessary to avoid the situation where the length of the sequence grows exponentially with . Validation on experimental data confirmed that these restrictions do not essentially affect the clustering results.
Restriction 1. All cubes in the sequence must have common points with the base cube .
In Figure 1 we present an example of a set for which this requirement causes a significant difference between the exact solution and the solution produced by the algorithm. However, this difference is essentially reduced at further steps of the clustering algorithm as generally it affects only the order in which clusters are constructed.
Restriction 2. For each individual coordinate it is not allowed to translate the cube in the opposite directions at different steps of the procedure described above.
The above restrictions lead to the following lemma.
Lemma 10. The main stage of the algorithm has worst-case time complexity.
Proof. First we get an upper estimate for the length of the sequence of cubes (for an arbitrary ). Due to Restrictions and we have Thus, Restrictions 1 and 2 avoid the situation where the length of the sequence grows exponentially with . Each step of the procedure of constructing the sequence of cubes requires evaluations of the function for the cubes (i.e., range searches). With the use of a - tree the range search can be performed with worst-case time complexity (see ). The procedure of constructing the sequence of cubes involves iterations, so the above complexity bound holds.
Note that we also have a trivial estimate , as grows, and hence . Thus, without the imposed restrictions the worst-case complexity estimate holds, and hence the restrictions can be omitted without violation of practical feasibility in case if the number of objects has the same order as the dimensionality .
3.3. Precision and Complexity of the Algorithm
Theorem 11. Let be a center of , be a cube produced by an algorithm iteration which was initialized with (and so for this iteration ), and be an optimal cube (i.e., , where maximum is taken over all unit cubes in ). Then and this estimate is sharp.
Proof. The upper estimate is trivial. The lower estimate follows from the fact that is covered by at most cells of the integer lattice with edges of length 1, and hence An example that shows that the estimate is sharp is similar to the example from Figure 1. For example, we can locate the center of at the integer lattice node and put points in in such a way that each cell of the integer lattice contains at most one of these points. Then, we select an arbitrary cell of the integer lattice that is distant from and put one point to this cell, which becomes .
Theorem 12. The algorithm for finding an approximately optimal position of the box has worst-case time complexity and space complexity.
Proof. Combining the estimates for the time and space complexity of the preprocessing stage and the main stage of the algorithm gives the bounds mentioned above.
Note that omitting Restrictions and results in the worst-case time complexity estimate
4. Clustering Algorithm
Now let us consider the clustering problem, that is, the problem of splitting the given set into mutually disjoint subsets . Following interval pattern concept approach, we construct clusters with controlled interval pattern concept width. We propose a clustering algorithm based on the greedy approach and the procedure for finding an approximately optimal position of a box described in Section 3. The algorithm is not sensitive to the order in which points are given. The parameters of the algorithm include positive real numbers and all parameters of the positioning algorithm, namely, , and .
First, we put and find an approximately optimal position of the box with the edge lengths for the set . Now suppose that the sets and have been already constructed and let . If then the procedure is ended. Else we find an approximately optimal position of the box for the set . The output of this procedure is a set of clusters .
In order to avoid producing a lot of small clusters consisting of outliers we impose one more restriction.
Restriction 3. The resulting clusters must include at least objects.
With this restriction if the size of is less than then the procedure ends (and points belonging to are considered unclustered and referred to as outliers).
Restriction together with Theorem 12 immediately leads to the following theorem.
Theorem 13. The clustering algorithm has worst-case time complexity and space complexity.
If Restrictions – are omitted, the worst-case time complexity estimate is
Validation of the clustering algorithm developed in this study was performed on a dataset of tactile images registered by the Medical Tactile Endosurgical Complex (MTEC) during examination of artificial samples. MTEC allows intraoperative mechanoreceptor tactile examination of tissues and is already used in endoscopic surgery [14–16]. As methods for automated analysis of medical tactile images are still insufficient, validation results in particular and the developed clustering algorithm in general provide new opportunities for the medical domain applications.
The key component of MTEC is a tactile mechanoreceptor [17, Fig. 1]. Its operating head is equipped with 19 pressure sensors that perform synchronous measurements 100 times per second. Each measurement result (called “a tactile frame” and consisting of 19 values) is wirelessly transmitted to a computer that performs preprocessing and visualization. The sensors are located at the operating head surface which is a circle with diameter 20 mm.
In order to create a dataset of tactile images we utilized MTEC for tactile examinations of three types of artificial samples. The samples were similar to the L-samples utilized in the study —they were made using a soft silicone (Ecoflex 00-10, Shore hardness 00-10A) according to manufacturer’s instructions and had a shape of a rectangular block with length, width, and height of 40 mm, 35 mm, and 11 mm, respectively. The difference was in sizes and shapes of hard inclusions enclosed in the samples. For the first sample type (ST1) the inclusion had a form of a spherical cap with base diameter 8 mm and height 2.4 mm oriented for palpation from the convex side. For the second sample type (ST2) the inclusion had a form of a spherical cap with a base diameter 4.7 mm and height 1.7 mm also oriented for palpation from the convex side. For the third sample type (ST3) the inclusions were the same as for ST2, but they were oriented for palpation from the flat side. For all sample types the inclusions were located in the center at height of approximately 3 mm. Thus, sample types were similar with a difference in either size or convexity of the inclusion. These samples simulated tissue with malignant neoplasms.
Totally 55 tactile examinations of the described samples were performed using MTEC. The contact angle was kept approximately equal to , and inclusions were located close to the center of the operating head surface. We performed twenty-two, seventeen, and sixteen examinations for samples of ST1, ST2, and ST3 types, respectively. For each examination one tactile frame was selected, namely, the one with the largest standard deviation (SD) of values, and other tactile frames were disregarded. Visualization of tactile frames for each sample type is presented in Figures 2(a)–2(c).
Thus, each examination was associated with a point in , and the total number of points was 55. This set of points was clustered using the developed clustering algorithm, and the results were compared with the results of -means clustering (, Euclidean distance; see, e.g., ), which was used as a reference. Scikit-learn implementation  of -means algorithm was utilized. Adjusted and raw Rand indexes (clustering result versus original classes; see, e.g., ) were used as compared characteristics of the clustering quality. Note that both clustering algorithms use random initialization, so multiple runs were performed for clustering quality estimation (specifically, 100 runs were performed to estimate Rand index for each algorithm with given parameters).
The results produced by both the proposed algorithm and by the -means algorithm were unsatisfactory. However, the poor quality of the resulting clustering was predictable as examining of a single sample can result in tactile frames that are essentially different with respect to representation by a point in due to rotation and slight shifts of a tactile mechanoreceptor.
To get better results we mapped the data to the new 9-dimensional space of attributes. The new attributes included(i)SD of all values in a tactile frame;(ii)mean and SD of the values corresponding to 7 middle sensors;(iii)mean and SD of the values corresponding to 12 outer sensors;(iv)mean and SD of the values corresponding to sensors that belong to the main diagonals (3 diagonals each consisting of 5 sensors, 13 sensors in total; see Figure 2(d) for details);(v)mean and SD of the values corresponding to sensors that belong to the secondary diagonals (6 diagonals each consisting of 4 sensors, 12 sensors in total; see Figure 2(d) for details). These attributes are robust to rotations proportional to . The values of mean and SD were computed after scaling the values to segment.
Transition to the new attribute space essentially improved the clustering quality, but our algorithm left 10–14 points as outliers ( was set equal to 8; the values of , , and were set equal to , , and respectively, and was set equal to for all attributes). A representative result of one run is presented in Table 2. Then we placed outliers points to the obtained clusters by the -nearest neighbors algorithm (, unweighted; see, e.g., ). A representative result for one run is presented in Table 3.
Table 4 contains mean values and SDs for Rand indices and timing information.
As one can see, the proposed algorithm has an acceptable running time, and both our and -means algorithm reach mean quality plateau already at 20 iterations.
The advantage of the proposed algorithm over the -means algorithm with respect to the clustering quality was statistically significant. For example, for 20 iterations and adjusted Rand index the comparison of our algorithm with outliers and the -means on 100 runs resulted in Mann–Whitney -test two-tailed value equal to . As outliers are the points that are the most difficult for clustering, the advantage of our algorithm complemented by kNN-attributing of outliers to clusters over the -means was lower but still firmly significant with Mann–Whitney -test two-tailed value equal to .
Interestingly, the transition to the new attribute space improved the quality of our algorithm more than the quality of the -means clustering. For example, for 20 iterations, adjusted Rand index, and 100 runs, the comparison of the clustering quality for the initial attribute space and the new attribute space resulted in Mann–Whitney -test two-tailed values not exceeding for both “with outliers” and “no outliers” versions of our algorithm, while for -means the value was .
In this paper we proposed a greedy clustering algorithm based on interval pattern concepts. The obtained theoretical estimate on algorithm complexity proved computational feasibility for high-dimensional spaces, and the validation on experimental data demonstrated high quality of the resulting clustering in comparison with conventional clustering algorithms such as -means.
Particular results obtained during validation, such as a new attribute space for tactile frames registered by the Medical Tactile Endosurgical Complex, have individual significance as they provide new opportunities for the medical domain applications aimed at automated analysis of tactile images.
Dataset of tactile frames used for the validation and the Python script that implements the developed clustering algorithm are available upon request from the authors.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The authors thank Dr. Alexey V. Galatenko and Dr. Vladimir V. Galatenko for valuable comments and discussions. The research was supported by the Russian Science Foundation (Project 16-11-00058 “The Development of Methods and Algorithms for Automated Analysis of Medical Tactile Information and Classification of Tactile Images”).
M. Kaytoue, S. Duplessis, S. O. Kuznetsov, and A. Napoli, “Two fca-based methods for mining gene expression data,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5548, pp. 251–266, 2009.View at: Publisher Site | Google Scholar
V. M. Staroverov, V. V. Galatenko, T. V. Zykova, Y. I. Rakhmatulin, D. V. Rukhovich, and V. E. Podol'skii, “Automated real-time correction of intraoperative medical tactile images: sensitivity adjustment and suppression of contact angle artifact,” Applied Mathematical Sciences, vol. 10, pp. 2831–2842, 2016.View at: Publisher Site | Google Scholar
T. Mitchell, Machine Learning, McGraw Hill, New York, NY, USA, 1997.