Abstract

We propose a new distance metric, based on the linkage of genes, in the search space of genetic algorithms. This second-order distance measure is derived from the gene interaction graph and first-order distance, which is a natural distance in chromosomal spaces. We show that the proposed measure forms a metric space and can be computed efficiently. As an example application, we demonstrate how this measure can be used to estimate the extent to which gene rearrangement improves the performance of genetic algorithms.

1. Introduction

Distance metrics are fundamental tools for organizing search spaces, because the introduction of a metric is the simplest way to induce a topology [1]. Different metrics produce different topologies and thus change the shape of the search space. When a space is to be searched by a genetic algorithm (GA), a good distance metric facilitates navigation of the space [25] and can also improve the effectiveness of search [612]. Hamming distance is a popular metric in a discrete space that is to be searched by a GA. Hamming distance has also been widely used in analyses of solution spaces [1315].

Fitness distance correlation (FDC), proposed by Jones and Forrest [14], is a measure of the effectiveness of a distance metric in a space to be searched by a GA. An FDC is obtained by measuring the correlation between fitness and the distance to the nearest global optimum for a number of sample solutions. FDC coefficients range from to , where higher values suggest increased difficulty in maximizing fitness and decreased difficulty in minimizing fitness. When a GA is hybridized with a local optimization, the population consists entirely of local optima, and it is then more useful to determine FDCs of local-optimum spaces.

In this paper, we propose a new distance measure which takes account of gene interaction and show that it forms a metric space. We use this metric to compute FDCs of search space and show that FDCs obtained in this way have improved correlation with the improvement in GA performance that can be obtained by gene rearrangement. The remainder of this paper is organized as follows. In Section 2, we review gene rearrangement in GAs. In Section 3, we propose a new distance measure for GAs, show that it forms a metric space, and demonstrate an application. Finally, we draw conclusions in Section 4.

2. Gene Rearrangement

Holland’s schema theorem [16] shows that schemata (i.e., groups of genes) with high fitness, short defining length, and low order have high probabilities of survival in a standard GA.

These durable schemata are called building blocks. They make a major contribution to fitness and have a high degree of mutual interaction. The performance of a GA is strongly dependent on the survival and reproduction of these building blocks.

The survival probability of a gene group through a crossover is strongly affected by the positions of genes in the chromosome. Schemata consisting of genes in scattered positions tend to be too long to survive. Thus, the strategy used for placing genes significantly affects the performance of a GA. Inversion is an operator which changes the location of genes while a GA is running [17], and the process of rearranging genes dynamically to improve performance is called linkage learning [18]. Messy GA [19] is an example of a technique that implicitly uses dynamic gene rearrangement.

It has been observed that the performance of GAs on problems with a locus-based encoding can be improved by rearranging the indices of the genes before running the GA. Static gene rearrangement was first suggested by Bui and Moon [20, 21], who rearrange genes within a chromosomal representation to improve the quality of schemata and to help the GA to preserve the better schemata. Many studies on the static rearrangement of gene positions [2024] have showed performance improvements. However, the improvement in performance achieved in this way has been shown to vary greatly between problem instances. This motivated us to develop a distance metric to improve our ability to estimate how much improvement in the performance of a GA on a particular problem instance can be expected through gene rearrangement.

3. A Linkage-Based Distance Measure

3.1. Second-Order Distance Measure

The most usual first-order distance measure in discrete space is the Hamming distance which is also a natural distance in chromosomal space, although there are other first-order distance measures, such as the quotient metric in redundant encoding [11]. We now define a second-order distance measure derived from first-order distance. Given a problem instance , consider the unweighted undirected graph representing first-order gene interaction [23], which is the pairwise interaction of genes. For convenience, we will assume that each gene has an interaction with itself, so that for each gene . Let be the adjacency matrix of and consider as a binary matrix over [2527].

Definition 1. Suppose that the inverse of exists as a binary matrix over ; that is, . One defines the second-order distance measure as follows: where is a vector summation operator, which performs a Boolean XOR (i.e., , , , and ) in each coordinate, and is a norm derived from the first-order distance metric (i.e., ).

Theorem 2. is a metric.

Proof. It is enough to show the following four conditions [1].(i)Nonnegativity: since and is a metric, for all and in .(ii)Identity of indiscernibles: consider (iii)Symmetry: consider (iv)Triangle inequality: consider

If the inverse of does not exist, we can extend the scope of the distance metric using the following well-defined formulation: We note that if the inverse of exists, then , which implies , and hence . Our second-order distance and its extension can be computed in by a variant of Gauss-Jordan elimination [28], where is the number of genes.

3.2. An Application

Intuitively, our measure of the distance between two chromosomes can be understood as the minimum number of bits that must be changed to transform one chromosome into the other in the genetic process using optimal gene rearrangement.

Given an undirected graph with edge weights , the max-cut problem is that of finding a subset which maximizes the sum of the edge weights which traverse the cut [2931]. Consider the 6-node max-cut problem instance , which is to maximize the following expression: where a vertex belongs to the position and is the Boolean XOR operator. In this problem instance, edges and increase the fitness and edges and reduce the fitness. In the max-cut problem, we can consider that the given graph removing edge weights shows the first-order gene interaction (see, e.g., Figure 1(a)). Figure 1(b) shows an example in which the Hamming and second-order distances between two chromosomes and are obtained by optimal gene arrangement of the gene interaction graph . In this example, , , and hence . If we use the normalized Hamming distance (developed for the 2-grouping problem) [32, 33] as the first-order distance measure, the FDC of this problem is . But when our second-order distance is used, the FDC becomes .

Given a graph and its adjacency matrix , the graph bipartitioning problem is that of minimizing the following expression: where , a vertex belongs to the position , and is a positive constant introduced to penalize unbalanced partitions. If we ignore the second balancing term altogether, we can regard the given graph as the first-order gene interaction graph of the given problem instance. Bui and Moon [21] tried gene rearrangement in a GA for graph bipartitioning and obtained dramatic improvements in performance for some graphs. We hypothesized that FDCs calculated using our second-order distance would help identify graphs that could benefit most from gene rearrangement, in terms of GA performance. Figure 2 shows the relationship between FDC and the performance improvement of a GA on 16 benchmark graphs (8 random graphs and 8 random geometric graphs) that were used in [3440].

Here, the performance improvement means the difference in percentage between the average performances of a GA with and without gene rearrangement (data from [21]). The FDC values were approximated from 10,000 randomly generated local optima. When the first-order (normalized Hamming) distance was used, there was little correlation with the change in performance, but our second-order distance provided a clear correlation (see Figure 2(b) and Table 1).

4. Concluding Remarks

In most previous work, distances among chromosomes in GAs have usually been first-order distances, and in particular Hamming distance. We have proposed a second-order distance measure for GAs, which we consider to be more meaningful. We have showed that this distance measure forms a metric space and that it can be computed efficiently.

Using second-order distance allows us to see problem spaces from a different viewpoint. We have demonstrated its value in predicting the effectiveness of gene rearrangement, and we envisage it providing further understanding of the working mechanism of GAs.

Disclosure

A preliminary version of this paper appeared in the Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1393–1399, 2005.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research was supported by the Gachon University research fund of 2014 (GCU-2014-0121).