Research Article | Open Access
Spatial Cluster Analysis by the Bin-Packing Problem and DNA Computing Technique
Spatial cluster analysis is an important data mining task. Typical techniques include CLARANS, density- and gravity-based clustering, and other algorithms based on traditional von Neumann's computing architecture. The purpose of this paper is to propose a technique for spatial cluster analysis based on sticker systems of DNA computing. We will adopt the Bin-Packing Problem idea and then design algorithms of sticker programming. The proposed technique has a better time complexity. In the case when only the intracluster dissimilarity is taken into account, this time complexity is polynomial in the amount of data points, which reduces the NP-completeness nature of spatial cluster analysis. The new technique provides an alternative method for traditional cluster analysis.
Spatial cluster analysis is a traditional problem in knowledge discovery from databases . It has wide applications since increasingly large amounts of data obtained from satellite images, X-ray crystallography, or other automatic equipment are stored in spatial databases. The most classical spatial clustering technique is due to Ng and Han  who developed a variant PAM algorithm called CLARANS, while new techniques are proposed continuously in the literature aiming to reduce the time complexity or to fit for more complicated cluster shapes.
For example, Bouguila  proposed some model-based methods for unsupervised discrete feature selection. Wang et al.  developed techniques to detect clusters with irregular boundaries by a minimum spanning tree-based clustering algorithms. By using an efficient implementation of the cut and the cycle property of the minimum spanning trees, they obtain a performance better than , where is the number of data points. In another paper, Wang and Huang  developed a new density-based clustering framework by a level set approach. By a valley seeking method, data points are grouped into corresponding clusters.
Adleman  and Lipton  pioneer a new era of DNA computing in 1994 with their experiments which demonstrated that the tools of laboratory molecular biology could be used to solve computation problems. Based on Adleman and Lipton's research, a number of applications of DNA computing in solving combinatorially complex problems such as factorization, graph theory, control, and nanostructures have emerged. There appeared also theoretical studies including DNA computers which are programmable, autonomous computing machines with hardware in biological molecules mode; see  for details.
According to Păun et al. , common DNA systems in DNA computing include the sticker system, the insertion-deletion system, the splicing system, and H systems. Among those, the sticker system has the ability to represent bits which is similar to the silicon computer memory. In a recent work, Alonso Sanches and Soma  propose an algorithm based on the sticker model of DNA computing  to solve the Bin-Packing Problem (BPP), which belongs to the class NP-Hard in the strong sense. The authors show that their proposed algorithms have time complexities bounded by which are the first attempt to use DNA computing for the Bin-Packing Problem. Here the integer is the number of items to be put in the bins.
Inspired by the work of Alonso Sanches and Soma , we propose a new DNA computing approach for spatial cluster analysis in this paper by the Bin-Packing Problem technique. The basic idea is to take clusters as bins and locate data points into bins. In order to complete evaluation of clustering, we need to accumulate dissimilarities within clusters. By the sticker system we can accomplish these tasks. We also show that our algorithm has a time complexity in polynomial in the case when only intracluster dissimilarity is considered, relative to the amount of data points. Notice that cluster analysis is NP-complete. It is interesting to notice that the method in this paper is new in cluster analysis.
The rest of this paper is organized as follows: in Section 2, we present the Bin-Packing Problem formulation of spatial clustering problem for the purpose of this paper. Then in Section 3 some basic facts on sticker model are presented with implementation of some new operations. The following two sections are devoted to the coding of the problem and the algorithms of clustering with sticker system. Finally, a brief conclusion is reached.
2. Formulation of the Problem
Let be the real Euclidean space of dimension . A subset is called a spatial dataset with points and , where for each . A clustering problem over is to group the dataset into partitions called clusters where the intracluster similarity is maximal and the intercluster similarity is minimal. In this sense, clustering is an optimization process in two levels: one is maximization and the other is minimization. Here the integer indicates the number of clusters. There are two kinds of clustering when we consider as a parameter. The first kind is fixed number clustering, where the number of clusters is a priori determined. The second kind is flexible clustering where the number is chosen as one of the parameters to meet the two level optimization problem.
Now we denote a partition of by with and for . If we define as the intracluster dissimilarity measure for and as the intercluster similarity measure for , then the two kinds of clustering problems are formulated as follows:
To simplify the multiplicity of optimization, we often use the following variation of the above problems:
For the purpose of this paper, we will use a simplified version of the total energy as shown in the following equation:
In the case when the number of clusters is a variable, the total energy is computed for nonempty clusters and the optimized number of clusters is the counting of nonempty bins:
We now propose a Bin-Packing Problem (BPP) formulation of the clustering problem as stated above. The classical one-dimensional BPP is given as a set of items with respective weights . The aim is to allocate all items into bins with equal capacity and by using a minimum number of bins . For clustering purpose we assume that there are empty bins and we allocate all items into the bins with least energy. If we consider as a variable, then the problem is to allocate points into bins with least energy. The capacity restriction is removed. For the two cases of clustering, there are altogether (, resp.) combinations of allocation and the best solution can be achieved by brute force search.
First we consider the case when is fixed. To solve the problem, we consider an array of integers
The th bin (cluster) is defined as for . We will identify the allocation with its corresponding partition. Therefore the energy function is defined on of all allocations. In order to guarantee the bins are nonempty, we need to add a restriction that for , where denote the cardinality of the set .
Then the final problem is
Next when is a variable, the array is
The th bin (cluster) is defined as for . The energy function defined on to be optimized is
3. A Sticker DNA Model
First we recall some standard operations of DNA computing as shown in . They are merge, amplify, detect, separate, and append.(i)merge: . Two given tubes are combined into one without changing the strands which.(ii)amplify: Given a tube , amplify produces two copies of and then make empty.(iii)detect: Given a tube , return true if contains at least one DNA strand, otherwise return false.(iv)separate: and . Given a tube and a word , a new tube (or ) is produced with the strands in which contain as substring (resp., do not contain).(v)append: Given a tube and a word , affixes at the end of each sequence in .
The sticker model is based on the paradigm of Watson-Crick complementarity and was first proposed in . There are two kinds of single-stranded DNA molecules, the memory strands and sticker strands, in this model. A memory strand is bases in length and contains nonoverlapping substrands, each of which is bases long, where . A sticker is bases long and complementary to exactly one of the substrands in the memory strand. A specific substrand of a memory strand is either on or off and is called a bit. If a sticker is annealed to its matching substrand on a memory strand, then the particular substrand is said to be on. Otherwise it is said to be off. These partially double strands are called memory complexes.
The basic operations of the sticker model are merge, separate, set, and clear and are listed as follows . Among these, merge is exactly as the standard operation as shown before.(i)separate: and . Given a tube , a new tube (or ) is produced with the th bit on (resp., off).(ii)set: . A new tube is produced from by turning the th bit on.(iii)clear: . A new tube is produced from by turning the th bit off.
Now we consider a test tube consisting memory complexes . We define the length of as the number of bits, that is, the number of substrands (stickers) contained in denoted by . Each numerical value is represented by -bit stickers, where is a constant designed for a certain problem. For a -bit stickers , the corresponding numerical value is denoted by . The substring in a memory complex from the th bit to the th bit of is denoted by , where is an integer with . Apart from the basic operations, we need more operations designed and inspired by Alonso Sanches and Soma  in order to handle numerical computations.(i)increment: . For each , generate a strand with and let be replaced by the collection of such new strands .(ii)add: . For each , generate a strand with and let be replaced by the collection of such new strands . Here is an integer in binary form with length .(iii)compare: . For each , if , then let ; if , then let ; else let .(iv)weigh: . For each , if , then let ; if , then let ; else let .(v)clearq: . For each strand in the tube , turn all bits off from th bit to th bit.
4. Sticker Algorithms for Fixed
Now we consider solving the spatial clustering problem as described in Section 2, where the number of clusters is fixed, and , for each . A partition of the dataset is denoted by which is an array of integers
For two points we use to denote the Euclidean distance between them. We use to denote the diameter of . Let the dissimilarity measure of be . Now we convert the dissimilarity measure into binary string consisting of “0”s and “1”s. For an acceptable given error rate to measure the dissimilarity , divide the interval into subintervals with equal width . Now choose an integer such that . Then we can use a bits string to represent the subintervals. For let its corresponding string be , where operator is the largest integer without exceeding it. We will use a sticker system with stickers in length that is capable of representing numbers between .
Now we define the dissimilarity matrix as
For the partition , the th bin (cluster) is defined as for . A partition is called feasible if for . The energy of a partition defined by (9) has the following form:
For an integer , we use to represent the subsequence of stickers corresponding to . Conversely, if is a sequence, we use to denote the numerical value decoded by the -bit sticker . By the sticker model , a memory complex is designed as the coding of :
Then we append stickers representing and . Finally we append stickers to store the cardinality of clusters. The structure of stickers for our problem is shown in Figure 1. The clustering algorithm consists of four steps as shown in Algorithm 2.
5. Sticker Algorithms for Variable
In this section we consider cluster analysis when is a variable. In this case a partition of the dataset is denoted by which is an array of integers
Similar to the previous section, the th bin (cluster) is defined as for . Notice that may be empty and the final number of clusters is the counting of nonempty clusters. The energy of a partition defined by (11) has the following form:
Now the coding is . Then we append numbers of bits, that is, stickers representing and . Next we append values to store the counting number of the clusters. Finally we append a value to store the number of valid (nonempty) clusters. The structure of stickers in this case is shown in Figure 2.
The clustering algorithm consists of four steps as shown in Algorithm 3.
In this paper we presented a new DNA-based technique for spatial cluster analysis. Two cases when the number of clusters is predefined and not determined are considered. If we take the scale of data , and the length of bits for a sticker , as a variables, then clearly Algorithm 1 has a time complexity of . Among the four steps of Algorithm 2, the operator generate has a time complexity of , and the operator energy has complexity of . The remaining two operators all have complexity of . Thus the total time complexity for fixed number of clusters is . In the other case when is dynamic, time complexity for the four algorithms changes to , , , and . Hence the total complexity is . The reason why our complexity is worse than that of  (of course for a different problem) is that the summation of dissimilarity is time consuming. It is interesting if one can reduce this complexity to .
Finally we will point out that up to the authors knowledge, this is the first research in cluster analysis by sticker DNA systems. It provides an alternative solution for this traditional knowledge engineering problem, which is not combinatorial in nature. Comparing many applications of DNA computing mainly in combinatorial problems, this is still interesting.
Research is supported by the Natural Science Foundation of China (no. 61170038, 60873058), the Natural Science Foundation of Shandong Province (no. ZR2011FM001), and the Shandong Soft Science Major Project (no. 2010RKMA2005).
- H. Jiawei and M. Kamber, Data Mining Concepts and Techniques, Elsevier, Singapore, 2nd edition, 2006.
- R. T. Ng and J. Han, “CLARANS: a method for clustering objects for spatial data mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 5, pp. 1003–1016, 2002.
- N. Bouguila, “A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 12, pp. 1649–1664, 2009.
- X. Wang, X. Wang, and D. M. Wilkes, “A divide-and-conquer approach for minimum spanning tree-based clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 7, pp. 945–958, 2009.
- X. F. Wang and D. S. Huang, “A novel density-based clustering framework by using level set method,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 11, pp. 1515–1531, 2009.
- L. M. Adleman, “Molecular computation of solutions to combinatorial problems,” Science, vol. 266, no. 5187, pp. 1021–1024, 1994.
- R. J. Lipton, “DNA solution of hard computational problems,” Science, vol. 268, no. 5210, pp. 542–545, 1995.
- G. Păun, G. Rozenberg, and A. Salomaa, DNA Computing. New Computing Paradigms, Texts in Theoretical Computer Science. An EATCS Series, Springer, Berlin, Germany, 1998.
- C. A. Alonso Sanches and N. Y. Soma, “A polynomial-time DNA computing solution for the bin-packing problem,” Applied Mathematics and Computation, vol. 215, no. 6, pp. 2055–2062, 2009.
- S. Roweis, E. Winfree, R. Burgoyne et al., “A sticker based model for DNA computation,” Journal of Computational Biology, vol. 5, no. 4, pp. 615–629, 1998.
Copyright © 2013 Xiyu Liu and Jie Xue. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.