Abstract

We consider some modifications of the neural gas algorithm. First, fuzzy assignments as known from fuzzy c-means and neighborhood cooperativeness as known from self-organizing maps and neural gas are combined to obtain a basic Fuzzy Neural Gas. Further, a kernel variant and a simulated annealing approach are derived. Finally, we introduce a fuzzy extension of the ConnIndex to obtain an evaluation measure for clusterings based on fuzzy vector quantization.

1. Introduction

Prototype based vector quantization (VQ) is an approved method to cluster and compress very large data sets. Prototype based implies that the data are represented by a much smaller number of prototypes. Famous methods are c-means [1], self-organizing maps (SOM) [2], and neural gas (NG) [3]. These methods have in common that each data point is uniquely assigned to its closest prototype. Therefore, they are also called crisp vector quantizers. Yet, in practical applications, data are often overlapping making it hard to separate clusters. For this kind of data fuzzy vector quantizing, algorithms have been developed, for example, fuzzy c-means (FCM) [4] and fuzzy SOM (FSOM) [5]. Now, each datapoint can be partially assigned to each prototype. The FSOM is an extension of the FCM taking the neighborhood cooperativeness into account. Yet, as common to SOM, this neighborhood is bound to an external topological structure like a grid. In this paper we combined FCM with NG, thus exploiting the advantages of each: fuzziness from FCM and dynamic neighborhood cooperativeness without structural restrictions from NG. Our new approach is called Fuzzy Neural Gas (FNG).

Beside its basic functionality we also introduce some variations of FNG. First, we propose the kernel fuzzy neural gas (KFNG) where we consider differentiable kernels to adapt the metric. This allows the algorithm to operate in the same structural space as support vector machines (SVM) [6], which are known to deliver respectable results [7]. In [6], it has been shown that this modified optimization space is equivalent and isometric to a reproducing kernel Hilbert or Banach space, which proves to be beneficial for unsupervised VQ, that is also for FNG.

For another variant of FNG we were inspired by simulated annealing (SA), a method which allows temporary deterioration of an optimization process to stabilize its long term behavior. To obtain an SA-like approach, we introduce negative learning and call the new method pulsing Neural Gas (PNG). The idea can also be transferred to FNG resulting in Pulsing Fuzzy Neural Gas (PFNG).

Clustering in general is an ill-posed problem and it is difficult to validate a cluster solution. Specification the validation of very large data sets, where a cluster might be represented by more than one prototype, turns out to be a challenge. There exist a number of validity measures based on separation and compactness, yet most of them presume that each cluster should be represented by exactly one prototype. Tademir and Merényi proposed the ConnIndex [8], which is suited to evaluate crisp clusterings, where each cluster contains more than one prototype. This ConnIndex takes the neighborhood structure between the learned prototypes into account to transfer the information of the full data set to the cluster validation process. We propose a modification for fuzzy cluster solutions and use this Fuzzy ConnIndex in the experimental section.

In the experimental section, we use three different data sets, an artificial one and two real world problems, to compare the cluster solutions obtained by FNG with those obtained by FCM. For evaluation purposes the Fuzzy ConnIndex is applied. Further, we demonstrate the performance of Pulsing Neural Gas on a checkerboard data set. This type of problem is highly multimodal and usually the algorithms do not find all clusters.

2. Fuzzy Neural Gas

The Fuzzy Neural Gas algorithm is a vector quantizer suitable for overlapping data resulting in fuzzy cluster solutions. It is a combination of the Neural Gas (NG) algorithm which incorporates neighborhood relations between data points and prototypes and the Fuzzy c-Means (FCM) which provides a way to obtain fuzzy data point assignments. In the following section, the NG and the FCM are presented shortly to reproduce the derivation of the FNG originally published in [9]. Besides providing an understanding for the principle functioning of the FNG, the description of the basic algorithms is also useful in preparation of Section 4, where a fuzzy cluster validation method called Fuzzy ConnIndex (fConn) is presented.

2.1. Neural Gas

The Neural Gas vector quantizer [3] is an approach which utilizes the dynamic neighborhood between the prototypes , , to obtain a clustering of data samples , , from a data set . This neighborhood function is based on a winner ranking of the prototypes for each data point. The rank of prototype is obtained by with the heaviside function , if and only if and else, and a dissimilarity measure which determines the distance between data point and prototype . Usually the Euclidean distance is used for .

The neighborhood of a data point is specified by where the rank of prototype is an essential part. For the neighborhood only the prototypes within a certain range according to their rank are considered, giving the closest prototype the highest emphasis. The constant is arbitrarily chosen.

The neighborhood can be used to calculate the local costs: which resemble the local distortions around prototype weighted by the neighborhood cooperativeness.

The Neural Gas cost function which has to be minimized for optimal clustering directly embeds the local costs: The normalisation constant depends on and is the data density.

The minimization of the cost function (4) is performed by stochastic gradient descent with respect to the prototypes. Given a data point the prototype update rule yields where is the learning rate [3].

After convergence of the algorithm the whole data set is approximated by the set of prototypes. The receptive field of each prototype is defined as

For crisp clusterings it has been shown in [3] that the NG algorithm results in better cluster solutions than Self-Organizing Maps (SOM) [2] due to its flexible neighborhood compared to the fixed grid of a SOM.

2.2. Fuzzy c-Means

The Fuzzy c-Means [10] is also a vector quantizer where each cluster is represented by a prototype located in its center of gravity. Yet contrary to NG, a data point can be assigned to more than one prototype. The cost function to minimize is given by where the fuzzy assignment of data point to prototype is described by . If the restriction is valid, the clustering is called probabilistic, otherwise possibilistic. The exponent regulates the fuzziness and according to [10] it should be set to . Again, the distance is usually chosen to be the Euclidean distance.

The algorithm itself is an alternating optimization of prototypes and fuzzy assignments. The update of the prototypes is carried out by keeping the assignments fixed and vice versa the assignments are adapted based on fixed prototypes: Since the definition of the receptive field (6) does not reflect the information contained in the fuzzy assignments, we define the fuzzy receptive field as

2.3. Combining NG and FCM to the Fuzzy Neural Gas

As mentioned above the Fuzzy Neural Gas can now be obtained by combining NG and FNG. Thereby, the FCM distance function in (7) is replaced by local costs similar to the NG local costs (3): yielding the cost function The local costs (11) take the dynamic neighborhood structure according to into account, where the value is the neighborhood range and assures that . For optimal performance should be decreased adiabatically in the course of optimization. Note that the neighborhood contrary to the NG neighborhood is based on the winning ranks according to the best matching prototype and not as known from NG according to the data. The ranks are calculated similar to (1): where again is the heaviside function.

Analogous to FCM, the update of the prototypes and the fuzzy assignments follows an alternating optimization scheme to minimize the FNG cost function (12). The update scheme consists of two update steps: updating the prototypes while keeping the fuzzy assignments fixed and updating the assignments while retaining the prototypes. The update rules are obtained by Lagrange optimization taking the side condition into account.

A batch update considering all the data samples at once is possible if the Euclidean distance is used for the calculation of the local costs (11). The resulting equations can be solved for and , respectively, yielding Note that the update of the fuzzy assignments is similar to the FCM assignment update (9) yet instead of the distances the local costs (11) are considered.

For other distances besides the Euclidean distance, the equation obtained by Lagrange optimization might not be solvable for . In that case, the prototypes have to be adapted online via stochastic gradient descent in order to minimize the FNG cost function (12). The corresponding update rule is Since the derivative of the distance has to be considered, the distance measure is required to be differentiable with respect to . Any measure fulfilling this restriction is a suitable measure; that is, alternative to the commonly used Euclidean distance generalized divergences as well as (differentiable) kernels might be used depending on the specific problem at hand. The latter aspect concerning (differentiable) kernels is investigated in detail in the next subsection.

2.4. Fuzzy Neural Gas with Differentiable Kernels

For vector quantizers the distance between prototypes and data samples is determined by a distance measure . For FNG this distance has to be differentiable, since the derivative of the distance function is considered in the prototype update rule (16) to minimize the cost function. This implies that basically any differentiable distance measure is applicable. The common Euclidean distance can be used as well as generalized divergences [11] or (differentiable) kernels [12]. Each reproducing kernel uniquely corresponds to a kernel feature map , where is a Hilbert space in a canonical manner [13]. Denote to be the image of . The inner product of is consistent with a kernel; that is, . Universal continuous kernels ensure the injectivity and continuity of the map. Further, in that case is a subspace of [13]. The inner product defines a metric by The nonlinear mapping into the Hilbert space provides large topological richness for the mapped data, which is used for classification in SVMs. However, this topological structure of the image may result in better clustering abilities for unsupervised vector quantization.

An example of a universal kernel is the widely known Gaussian kernel: where is the Euclidean norm. This kernel and the distance metric based thereon can be differentiated easily and is therefore suitable to be used with FNG. A disadvantage is that the parameter has to be estimated, which is known to be a crucial task.

Another simple yet effective kernel is the ELM kernel (extreme learning machine) [14]. The kernel function is defined as and is simply the normalized dot product in the feature space. In the context of FNG the number of hidden variables corresponds to the number of intrinsic dimensions [15] of with . In case that the mapping is not known, for the kernel can be estimated by an analytic expression [16]: which is the so-called asymptotic ELM kernel, where is the Gaussian distribution of the data.

3. Pulsing Neural Gas

It has been shown that the Neural Gas algorithm converges to a global minimum in infinite time [3]. Yet in practice, time is limited and prototypes might only have reached a local minimum by the time the algorithm stops.

The proposed method in this section called Pulsing Neural Gas is a combination of NG and Simulated Annealing (SA), another widely known technique for solving optimization problems. SA is a probabilistic metaheuristics which accepts a random solution with a certain probability following the Boltzmann-Gibbs distribution . This probability depends on the difference between a random solution and the former accepted solution and a temperature which is decreasing over time and convergese to zero. Caused by the cooling respective annealing of the temperature , towards the end of the optimization process a deterioration of the cost function is accepted with lower probability than at the beginning. This leads to a stable behavior in the periphery of the global minimum.

To transfer this idea to (Fuzzy) Neural Gas a correspondent to the deterioration in SA has to be found. For the common NG the cost function (4) is minimized by performing stochastic gradient descent learning. Although it cannot be guaranteed, on average the value of the cost function decreases which we consider as positive learning. We now introduce negative learning; that is, we allow the algorithm to perform a negative learning step, which increases the cost function temporarily. Hence, on average, the algorithm performs positive learning, but once in a while with a certain decreasing probability following a Gibbs-distribution a negative learning step causes a disturbance. Possibly this helps to overcome local minima and speeds up convergence to the global minimum.

First considerations took gradient ascent learning into account. However, investigations have shown that this strategy leads to an unstable learning behavior. Instead we suggest a reverse prototype ranking: for a given data point . This ranking reverses the known (positive) ranking (1) such that the prototype with the largest distance now becomes the best (lowest) rank (see Figure 1); that is, the update of the prototypes is performed in reverse order and in opposite direction. The prototype update rule is formulated as where the neighborhood function depends on the reverse rankings (21). Now, in contrast to the common positive NG update step, the prototypes are not moved towards the presented data point. Yet instead, according to their reverse ranks they are pushed away, causing little change on the prototypes close to the data point and larger shifts of the prototypes located farther away. Figure 1 depicts this difference between the common NG and the Pulsing NG incorporating negative learning motivated by Simulated Annealing.

Unfortunately, this strategy is not directly transferable to the batch variants of NG [17] and FNG (12). Here all the data points are presented at once and the relocation of the prototypes at each update step depends on all data points. For this variant the idea of Simulated Annealing is performed differently. Instead of a reverse ranking, now only a random subset of the data samples is presented at a randomly chosen update step: where is a nonempty subset. The probability for performing this update step again follows a Gibbs-distribution decreasing with proceeding training. This way, the trend of the relocations is interrupted enabling the prototypes to leave prospective local minima yet possibly causing higher costs temporarily.

One can visualize this procedure as a more or less smooth process approximating some local optimum and once in a while the whole system is shaken up resulting in a temporary increase of the cost function and causing a reorientation of the whole adaptation process. We name this modification of the NG algorithm Pulsing Neural Gas (PNG) and for the fuzzy variant FPNG.

4. Fuzzy ConnIndex for the Evaluation of Fuzzy Clusterings

A strategy to cluster very large data sets is to perform vector quantization followed by a clustering of the obtained prototypes. If it can be assured that each of the resulting clusters is represented by more than one prototype, the ConnIndex [8] as proposed by Tademir and Merényi can be used for validation purposes. Yet, the ConnIndex is suitable only if crisp vector quantization has been performed in the first step. Since we need a method to evaluate cluster solutions based on fuzzy vector quantization we modified the original ConnIndex. In the following, first we recapitulate the index as proposed by Tademir and Merényi and subsequently we derive a fuzzy version of the ConnIndex.

Original ConnIndex. In general, the original ConnIndex balances the overall cluster compactness and separation by combining the intercluster connectivity and the intracluster connectivity Thereby, measures the compactness of the clusters and evaluates the separation between them. A value of close to one suggests a good cluster solution.

For the estimation of the connectivity a nonsymmetric cumulative adjacency matrix with respect to the receptive fields (6) is considered. Here, is the zero -matrix except the element which refers to the best matching unit and the second best matching unit for data point . The value of this element is set to a positive constant usually chosen as . The matrix is called the response matrix with respect to the data vector . As pointed out in [8], the row vector of describes the density distribution within the receptive field with respect to the other prototypes.

The symmetric connectivity matrix reflects the topological relations between the prototypes based on the receptive field evaluation. Thereby, the elements reflect the dissimilarities between the prototypes based on the local data densities.

Now, having the matrices and defined, the before mentioned connectivities and can be evaluated. The intracluster connectivity is based on the cumulative adjacency matrix (26): for each cluster . The greater the compactness of a cluster the closer its intraconnectivity is to one. Note again that, as mentioned above, each cluster is made up of more than one prototype .

The inter-cluster connectivity evaluates the separation between the clusters. Analogously, it is the average over the local inter-cluster connectivities of all clusters evaluating the separation of each cluster to the other clusters . Thereby, judges the separation of cluster to cluster based on the connectivity matrix (27) and is defined as where the sets describe the neighborhood relations between the clusters and based on the contained prototypes. In contrast to , the value of decreases with better separability.

Generalization of the ConnIndex. The ConnIndex by Tademir and Merényi considers the best and second best matching units and only, discarding any information provided by higher ranked prototypes. A generalized version of the index is obtained by incorporating higher winning ranks as known from Neural Gas [3]; see (1). Obviously is the rank of the best matching prototype. Analogously, the th winner is denoted by with rank . If it is clear from the context, we will abbreviate in the following.

To incorporate the higher winning ranks the response matrix has to be redefined to involve the full response of the whole vector quantizer model for a given input . The new response matrix is a zero matrix of the same size as except the row vector regarding the winner . The new response matrix is set to where is the so-called response vector of all prototype responses for a given input . The vector elements of the th prototype are defined as

with being an arbitrary monotonically decreasing function in . A simple choice for this function is the exponential function . The parameter determines the range of influence and should be determined carefully. If for the vector quantization an algorithm incorporating neighborhood cooperativeness in learning like Neural Gas [3] or self-organizing maps [18] was used, the -parameter should be chosen according to the neighborhood range used there. Yet, an alternative approach could be the direct utilization of the distances instead of the winning ranks and .

This generalized version of the ConnIndex uses for the calculation of the cumulative adjacency matrix in (26) the new response matrices instead of the original response matrices .

Note that this version is in concordance with the original version, if is chosen as

Fuzzy ConnIndex. Up to now we assumed that the vector quantization model is based on a crisp mapping. For these models a winner ranking is available and the response information of the network is collected in the response vector , reflecting the topological relation between the prototypes. In fuzzy vector quantization algorithms this information is no longer available because each data point is gradually assigned to all prototypes. Yet, the fuzzy data point assignments which can be stored in a assignment matrix also reflect the topography of the underlying data. The assignment vector is then the specific vector of which contains the assignment value of a data point to all of the prototypes and is comparable to the response vector used for the Generalized ConnIndex . Therefore, the assignments can be used directly to determine the response matrix by substituting the response vector in (32). Consequently, the best matching prototype for a given data vector can be seen as the prototype with the highest fuzzy assignment : Now, the row vector of the redefined response matrix can simply be chosen as the fuzzy response vector : Again, the cumulative adjacency matrix is calculated as before for the original ConnIndex and the Generalized ConnIndex according to (25). Further calculations remain unaffected.

Hence, the resulting new fuzzy ConnIndex is the counterpart of the generalized ConnIndex in case of fuzzy vector quantization models.

5. Performance

To evaluate the performance of FNG we designed different experiments to compare this method with crisp vector quantizers and the fuzzy c-Means. We also conducted an experiment examining the pulsing FNG. To perform the tests we used artificial and real world data sets.

For the evaluation of the cluster results we used the ConnIndex or the Fuzzy ConnIndex, respectively. This evaluation measure, described in the previous section, is relatively new [8, 19]. But it seems to be well suited for the evaluation of cluster solutions in terms of separation and compactness.

Additionally, for the first Smiley dataset we also calculated the Kappa value [20] which is a measure to judge the agreement of two cluster solutions. A variant thereof is suitable for fuzzy data [21]. Unfortunately, this measure can be used only for cluster solutions with a low number of clusters, since the clusters of the different solutions have to be matched, which is hard for clusterings containing a higher number of clusters.

5.1. Artificial Dataset: Smiley

In the first setting we used the Smiley data set [19]. This two-dimensional data set consists of three clusters with varying shapes, number of data samples, variances, and distances to each other (see Figure 2). It contains a total of 809 data points.

In the first step we apply c-Means and NG to perform crisp vector quantization and FCM and FNG to perform fuzzy vector quantization with the fuzziness parameter set to different values . All algorithms result in acceptable solutions. For the FNG cost function settles at the lowest value; FCM reaches the lowest costs for . The obtained FNG prototypes are depicted in Figure 2; the FCM results look similar. Visual evaluation confirms an intuitively good distribution in the data space.

A more objective evaluation is obtained with the help of the (Fuzzy) ConnIndex. Yet, to apply this measure the prototypes themselves have to be grouped to clusters of at least two prototypes each. In this simple experiment this step is done manually following the inherent obvious structure of the data set consisting of three clusters.

The obtained ConnValues are listed in Table 1 and show as expected a clear discrepancy between the ConnIndex values obtained by crisp and those obtained by fuzzy vector quantization. This is due to the influence of the data points located in the gaps between the main clusters on the calculation of the index. It is also evident that NG and FNG perform better than c-Means and Fuzzy c-Means, respectively. The overall best ConnIndex value is obtained for FNG, which is less surprising since this algorithm is a combination of FCM and NG, taking beneficial features of each: NG neighborhood and FCM fuzzy assignments.

The Kappa values [20] and [21] measure the agreement of two cluster solutions. The closer the values are to one, the higher is the agreement. Comparing the given data structure with the results obtained by the four different clustering methods yields high values indicating substantial to perfect agreement (according to [22]); see Table 1. It can be observed that the two crisp methods NG and c-Means performed almost equally, while the discrepancy between FCM and FNG is remarkable, indicating superior performance of FNG. Note that the values of the crisp and fuzzy solutions cannot be compared to each other since two different -measures are applied.

5.2. Practical Example: Indian Pine

Indian Pine is a publicly available data set taken by the NASA Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) consisting of pixels [23]. Data samples cover 220 bands of 10 nm width from 400 to 2500 nm. Due to the atmospheric water absorption 20 noisy bands can be identified (104–108, 150–163, and 220) and removed safely [24]. The data set is labeled according to 16 identified classes, but we do not use this information for the current experimental setting.

The processing of the data consists of two steps. First vector quantization is performed to position the prototypes. For this step the same algorithms as in the last experiment are used: crisp c-Means and NG and fuzzy FCM and FNG. In the second step the obtained prototypes are grouped by affinity propagation (AP) [25] to be able to apply the (Fuzzy) ConnIndex for evaluation. Special care has to be taken to fulfill the requirement that each cluster (i.e., prototype cluster) is represented by more than one prototype. For this reason a sufficiently high number of prototypes has to be chosen. We set this number to 64 (four times the number of known classes).

The calculation of the Generalized ConnIndex for the crisp methods is straightforward. For the fuzzy variants again the fuzziness parameter has to be considered carefully. A value of has proven to be favourable. The respective obtained ConnIndex values and are listed in Table 1.

Although the prototype clustering by Affinity Propagation always results in crisp cluster assignments, the clustering based on FNG vector quantization still yields better ConnIndex values than the other methods.

5.3. Practical Example: Colorado

The Colorado data set [26] is a LANDSAT TM image from the Colorado area, USA. The image covers a region of about kilometers yielding approximately 2 million data points. These are labeled by experts according to different vegetation types and geological formations found in this region. Among them are aspen, mixed pine forest, water, moist meadow, and dry meadow to name a few. The original data samples are 7-dimensional, yet one band (thermal band) is removed due to its low resolution. Generally, the bands are highly correlated [26].

For the experiment we neglected the class information and selected randomly of the data with a representative class distribution. The number of prototypes is set to 56. Besides that, the setup of the experiment is identical to the setup for the Indian Pine and consists of the two there described processing steps.

It can be observed that the FCM training is much faster than FNG training and requires less training cycles, about versus . Yet, the Fuzzy ConnIndex yields much better results for FNG than FCM (see Table 1) indicating a better prototype distribution in terms of inter- and intraconnectivity of the obtained clusters. The reason for the prolonged processing time can be found in the computational costs to calculate all neighborhood relations anew in each processing step.

5.4. Artificial Data Set: Checkerboard

This artificial data set [27] consisting of compact yet well-separated clusters arranged in a checkerboard-like manner is well suited to demonstrate the performance of the Pulsing Neural Gas compared to the common Neural Gas. The data set contains two-dimensional data vectors, which are grouped in normally distributed clusters with a standard deviation of . The mean distance between two neighboring cluster centers is . Due to the low dimensionality the data set is well suited for visualization; see Figure 3(a).

For both algorithms NG and PNG all prototypes are initialized in the center of the data set. In the following the algorithms are run both for the same number of steps. For comparison the values of the energy functions according to (4) are used. The experiment showed that on the long run both algorithms performed well. For online learning the effect of the pulsing variant is neglectable, yet the batch version shows significant improvements. The cost function of the Pulsing Neural Gas reaches lower values. The negative learning steps show as little bumps in the plot of the energy functions (see Figure 3(b)), indicating a temporary deterioration. In Figure 3(a) the prototype distribution after learning steps is visualized. Obviously the number of misplaced NG prototypes is higher than the number of misplaced PNG prototypes. This finding is in accordance with the lower value of the PNG energy function.

6. Conclusion

We proposed in this paper a fuzzy version of the Neural Gas. By combining the concept of neighborhood cooperativeness as known from NG with the FCM fuzzy assignments we obtain the Fuzzy Neural Gas. This algorithm outperforms FCM by taking dynamic neighborhood relations into account, a paradigm proven to be well suited for crisp vector quantization. The resulting FNG shows good performance compared to standard FCM and crisp NG. Due to the neighborhood cooperativeness this algorithm is insensitive to the initialization of the prototypes.

It is straight forward to introduce other distance measures besides the commonly used Euclidean distance. The only prerequisite is that the measure has to be differentiable; for example, differentiable kernels might be used.

A further variant of NG, respectively, its fuzzy version, is the Pulsing Neural Gas imitating a Simulated Annealing-like behaviour. This modification which allows temporary deterioration of the cost function stabilizes in the long run the learning procedure and helps the algorithm to overcome local minima more easily. This effect was demonstrated on a checkerboard data set, for which it is known that usually the algorithms do not find all clusters.

And finally, we extended the original crisp cluster evaluation ConnIndex [21] to be used for fuzzy clustering. It is based on a generalization of the index considering all prototypes instead of first and second best matching units only. The fuzzy version additionally takes the fuzzy information provided by the fuzzy data point assignments into account. As the original, the Fuzzy ConnIndex requires more than one prototype per cluster. The index was used for the evaluation of the experiments.

Acknowledgment

Marika Kaden was funded by the European Social Fund (ESF), Saxony.