Abstract

Spatial cluster analysis is an important data mining task. Typical techniques include CLARANS, density- and gravity-based clustering, and other algorithms based on traditional von Neumann's computing architecture. The purpose of this paper is to propose a technique for spatial cluster analysis based on sticker systems of DNA computing. We will adopt the Bin-Packing Problem idea and then design algorithms of sticker programming. The proposed technique has a better time complexity. In the case when only the intracluster dissimilarity is taken into account, this time complexity is polynomial in the amount of data points, which reduces the NP-completeness nature of spatial cluster analysis. The new technique provides an alternative method for traditional cluster analysis.

1. Introduction

Spatial cluster analysis is a traditional problem in knowledge discovery from databases [1]. It has wide applications since increasingly large amounts of data obtained from satellite images, X-ray crystallography, or other automatic equipment are stored in spatial databases. The most classical spatial clustering technique is due to Ng and Han [2] who developed a variant PAM algorithm called CLARANS, while new techniques are proposed continuously in the literature aiming to reduce the time complexity or to fit for more complicated cluster shapes.

For example, Bouguila [3] proposed some model-based methods for unsupervised discrete feature selection. Wang et al. [4] developed techniques to detect clusters with irregular boundaries by a minimum spanning tree-based clustering algorithms. By using an efficient implementation of the cut and the cycle property of the minimum spanning trees, they obtain a performance better than , where is the number of data points. In another paper, Wang and Huang [5] developed a new density-based clustering framework by a level set approach. By a valley seeking method, data points are grouped into corresponding clusters.

Adleman [6] and Lipton [7] pioneer a new era of DNA computing in 1994 with their experiments which demonstrated that the tools of laboratory molecular biology could be used to solve computation problems. Based on Adleman and Lipton's research, a number of applications of DNA computing in solving combinatorially complex problems such as factorization, graph theory, control, and nanostructures have emerged. There appeared also theoretical studies including DNA computers which are programmable, autonomous computing machines with hardware in biological molecules mode; see [8] for details.

According to Păun et al. [8], common DNA systems in DNA computing include the sticker system, the insertion-deletion system, the splicing system, and H systems. Among those, the sticker system has the ability to represent bits which is similar to the silicon computer memory. In a recent work, Alonso Sanches and Soma [9] propose an algorithm based on the sticker model of DNA computing [10] to solve the Bin-Packing Problem (BPP), which belongs to the class NP-Hard in the strong sense. The authors show that their proposed algorithms have time complexities bounded by which are the first attempt to use DNA computing for the Bin-Packing Problem. Here the integer is the number of items to be put in the bins.

Inspired by the work of Alonso Sanches and Soma [9], we propose a new DNA computing approach for spatial cluster analysis in this paper by the Bin-Packing Problem technique. The basic idea is to take clusters as bins and locate data points into bins. In order to complete evaluation of clustering, we need to accumulate dissimilarities within clusters. By the sticker system we can accomplish these tasks. We also show that our algorithm has a time complexity in polynomial in the case when only intracluster dissimilarity is considered, relative to the amount of data points. Notice that cluster analysis is NP-complete. It is interesting to notice that the method in this paper is new in cluster analysis.

The rest of this paper is organized as follows: in Section 2, we present the Bin-Packing Problem formulation of spatial clustering problem for the purpose of this paper. Then in Section 3 some basic facts on sticker model are presented with implementation of some new operations. The following two sections are devoted to the coding of the problem and the algorithms of clustering with sticker system. Finally, a brief conclusion is reached.

2. Formulation of the Problem

Let be the real Euclidean space of dimension . A subset is called a spatial dataset with points and , where for each . A clustering problem over is to group the dataset into partitions called clusters where the intracluster similarity is maximal and the intercluster similarity is minimal. In this sense, clustering is an optimization process in two levels: one is maximization and the other is minimization. Here the integer indicates the number of clusters. There are two kinds of clustering when we consider as a parameter. The first kind is fixed number clustering, where the number of clusters is a priori determined. The second kind is flexible clustering where the number is chosen as one of the parameters to meet the two level optimization problem.

Now we denote a partition of by with and for . If we define as the intracluster dissimilarity measure for and as the intercluster similarity measure for , then the two kinds of clustering problems are formulated as follows:

To simplify the multiplicity of optimization, we often use the following variation of the above problems:

Next we only consider the cluster problem (1) or (3). In order to unite the two optimization formula: we introduce the following total energy function:

For the purpose of this paper, we will use a simplified version of the total energy as shown in the following equation:

In the case when the number of clusters is a variable, the total energy is computed for nonempty clusters and the optimized number of clusters is the counting of nonempty bins:

We now propose a Bin-Packing Problem (BPP) formulation of the clustering problem as stated above. The classical one-dimensional BPP is given as a set of items with respective weights . The aim is to allocate all items into bins with equal capacity and by using a minimum number of bins [9]. For clustering purpose we assume that there are empty bins and we allocate all items into the bins with least energy. If we consider as a variable, then the problem is to allocate points into bins with least energy. The capacity restriction is removed. For the two cases of clustering, there are altogether (, resp.) combinations of allocation and the best solution can be achieved by brute force search.

First we consider the case when is fixed. To solve the problem, we consider an array of integers

The th bin (cluster) is defined as for . We will identify the allocation with its corresponding partition. Therefore the energy function is defined on of all allocations. In order to guarantee the bins are nonempty, we need to add a restriction that for , where denote the cardinality of the set .

Then the final problem is

Next when is a variable, the array is

The th bin (cluster) is defined as for . The energy function defined on to be optimized is

3. A Sticker DNA Model

First we recall some standard operations of DNA computing as shown in [8]. They are merge, amplify, detect, separate, and append.(i)merge: . Two given tubes are combined into one without changing the strands which.(ii)amplify: Given a tube , amplify produces two copies of and then make empty.(iii)detect: Given a tube , return true if contains at least one DNA strand, otherwise return false.(iv)separate: and . Given a tube and a word , a new tube (or ) is produced with the strands in which contain as substring (resp., do not contain).(v)append: Given a tube and a word , affixes at the end of each sequence in .

The sticker model is based on the paradigm of Watson-Crick complementarity and was first proposed in [10]. There are two kinds of single-stranded DNA molecules, the memory strands and sticker strands, in this model. A memory strand is bases in length and contains nonoverlapping substrands, each of which is bases long, where [8]. A sticker is bases long and complementary to exactly one of the substrands in the memory strand. A specific substrand of a memory strand is either on or off and is called a bit. If a sticker is annealed to its matching substrand on a memory strand, then the particular substrand is said to be on. Otherwise it is said to be off. These partially double strands are called memory complexes.

The basic operations of the sticker model are merge, separate, set, and clear and are listed as follows [8]. Among these, merge is exactly as the standard operation as shown before.(i)separate: and . Given a tube , a new tube (or ) is produced with the th bit on (resp., off).(ii)set: . A new tube is produced from by turning the th bit on.(iii)clear: . A new tube is produced from by turning the th bit off.

Now we consider a test tube consisting memory complexes . We define the length of as the number of bits, that is, the number of substrands (stickers) contained in denoted by . Each numerical value is represented by -bit stickers, where is a constant designed for a certain problem. For a -bit stickers , the corresponding numerical value is denoted by . The substring in a memory complex from the th bit to the th bit of is denoted by , where is an integer with . Apart from the basic operations, we need more operations designed and inspired by Alonso Sanches and Soma [9] in order to handle numerical computations.(i)increment: . For each , generate a strand with and let be replaced by the collection of such new strands .(ii)add: . For each , generate a strand with and let be replaced by the collection of such new strands . Here is an integer in binary form with length .(iii)compare: . For each , if , then let ; if , then let ; else let .(iv)weigh: . For each , if , then let ; if , then let ; else let .(v)clearq: . For each strand in the tube , turn all bits off from th bit to th bit.

We only give a DNA algorithm for and as the other algorithms are presented in Alonso Sanches and Soma [9]. Suppose the binary digits of the integer is (see Algorithm 1).

weigh
; ;
repeat
   ,
  if   then
    ;
  else
    ;
  endif
  
until     or ( no)
  
clearq
; ;
repeat
  
  
  
until  

4. Sticker Algorithms for Fixed

Now we consider solving the spatial clustering problem as described in Section 2, where the number of clusters is fixed, and , for each . A partition of the dataset is denoted by which is an array of integers

For two points we use to denote the Euclidean distance between them. We use to denote the diameter of . Let the dissimilarity measure of be . Now we convert the dissimilarity measure into binary string consisting of “0”s and “1”s. For an acceptable given error rate to measure the dissimilarity , divide the interval into subintervals with equal width . Now choose an integer such that . Then we can use a bits string to represent the subintervals. For let its corresponding string be , where operator is the largest integer without exceeding it. We will use a sticker system with stickers in length that is capable of representing numbers between .

Now we define the dissimilarity matrix as

For the partition , the th bin (cluster) is defined as for . A partition is called feasible if for . The energy of a partition defined by (9) has the following form:

For an integer , we use to represent the subsequence of stickers corresponding to . Conversely, if is a sequence, we use to denote the numerical value decoded by the -bit sticker . By the sticker model [9], a memory complex is designed as the coding of :

Then we append stickers representing and . Finally we append stickers to store the cardinality of clusters. The structure of stickers for our problem is shown in Figure 1. The clustering algorithm consists of four steps as shown in Algorithm 2.

 (a) generate: Generate multiple copies of all the combinations as . Append
   as the position numbers of . Then append and to store the energies.
 (b) energy: Compute the dissimilarities of the clusters and store the energy.
 (c) prune: Discard unfeasible partitions, that is, those where there exists empty clusters.
 (d) find: Find the best solution.
Now we present algorithms to implement the above procedures.
 (a) Generation of all the possible solutions. Append values in order to store the energies.
generate
for   to     do
  for     to     do
   
  endfor
  for     to     do
   
  endfor
  for     to     do
   
  endfor
  for     downto     do
   
  endfor
endfor
.
for     to     do
  
endfor
 (b) Energy computation. The problem is to compute totals of energy for those     where   . Hence
      and   . The total energy is stored in   .  At the same time,
   the counting number of each bin is stored in the following     stickers.
energy
for   to      do
  
endfor
for     to     do
  for     to     do
   
   
   
   if     yes  then
    
   endif
  endfor
endfor
for     to     do
  for     to     do
   for     to     do
    
   endfor
  endfor
  
  
endfor
 (c) The third step is to eliminate unfeasible partitions. This is done by checking the last     stickers.
prune
,   
for     to     do
  
  
endfor
 (d) The last step is to find the best solution with least energy. If yes in the final step,
   then we get the optimal solution.
find
,
for     to     do
   ,  
  if     no then
   
  else
   
  endif
endfor
.

5. Sticker Algorithms for Variable

In this section we consider cluster analysis when is a variable. In this case a partition of the dataset is denoted by which is an array of integers

Similar to the previous section, the th bin (cluster) is defined as for . Notice that may be empty and the final number of clusters is the counting of nonempty clusters. The energy of a partition defined by (11) has the following form:

Now the coding is . Then we append numbers of bits, that is, stickers representing and . Next we append values to store the counting number of the clusters. Finally we append a value to store the number of valid (nonempty) clusters. The structure of stickers in this case is shown in Figure 2.

The clustering algorithm consists of four steps as shown in Algorithm 3.

 (a) generate: Generate multiple copies of all the     combinations as   .  Append
   as the position numbers of . Then append and to store the energies.
 (b) energy: Compute the dissimilarities of the possible clusters and store in energy.
 (c) find: Find the best solution.
 (d) count: Count the number of clusters.
Now we present algorithms to implement the above procedures.
 (a) Generation of all the possible   solutions. Append values in order to store the energies.
generate
for     to     do
  for     to  
   
  endfor
  for     to  
   
  endfor
  for     to  
   
  endfor
  for     downto  
   
  endfor
endfor
.
for     to  
  
endfor
.
 (b) Energy computation. The problem is to compute totals of energy for those     where .
  Hence   and   .  The total energy is stored in   .  At the same time,
   the counting number of each bin is stored in the following     stickers.
energy
for     to     do
  
endfor
for     to     do
  for     to     do
   
   
   
   if   yes then
    
   endif
  endfor
endfor
for      to      do
  for      to      do
   for      to      do
    
   endfor
  endfor
  
  
endfor
 (c) The next step is to find the best solution with least energy. If   yes in the final step,
   then we get the optimal solution. The final number     of clusters in stored in the last sticker.
find
,
for     to     do
   ,  
  if     no then
   
  else
   
  endif
endfor
.
 (d) The final step is to count the number of clusters. It is stored in the last sticker while in the variable   .
count
for     to     do
  
  if   yes then
   
   
  endif
endfor

6. Conclusion

In this paper we presented a new DNA-based technique for spatial cluster analysis. Two cases when the number of clusters is predefined and not determined are considered. If we take the scale of data , and the length of bits for a sticker , as a variables, then clearly Algorithm 1 has a time complexity of . Among the four steps of Algorithm 2, the operator generate has a time complexity of , and the operator energy has complexity of . The remaining two operators all have complexity of . Thus the total time complexity for fixed number of clusters is . In the other case when is dynamic, time complexity for the four algorithms changes to , , , and . Hence the total complexity is . The reason why our complexity is worse than that of [9] (of course for a different problem) is that the summation of dissimilarity is time consuming. It is interesting if one can reduce this complexity to .

Finally we will point out that up to the authors knowledge, this is the first research in cluster analysis by sticker DNA systems. It provides an alternative solution for this traditional knowledge engineering problem, which is not combinatorial in nature. Comparing many applications of DNA computing mainly in combinatorial problems, this is still interesting.

Acknowledgments

Research is supported by the Natural Science Foundation of China (no. 61170038, 60873058), the Natural Science Foundation of Shandong Province (no. ZR2011FM001), and the Shandong Soft Science Major Project (no. 2010RKMA2005).