Abstract

Excessive detectors, high time complexity, and loopholes are main problems which current negative selection algorithms have face and greatly limit the practical applications of negative selection algorithms. This paper proposes a real-valued negative selection algorithm based on clonal selection. Firstly, the algorithm analyzes the space distribution of the self set and gets the set of outlier selves and several classification clusters. Then, the algorithm considers centers of clusters as antigens, randomly generates initial immune cell population in the qualified range, and executes the clonal selection algorithm. Afterwards, the algorithm changes the limited range to continue the iteration until the non-self space coverage rate meets expectations. After the algorithm terminates, mature detector set and boundary self set are obtained. The main contributions lie in (1) introducing the clonal selection algorithm and randomly generating candidate detectors within the stratified limited ranges based on clustering centers of self set; generating big-radius candidate detectors first and making them cover space far from selves, which reduces the number of detectors; then generating small-radius candidate detectors and making them gradually cover boundary space between selves and non-selves, which reduces the number of holes; (2) distinguishing selves and dividing them into outlier selves, boundary selves, and internal selves, which can adapt to the interference of noise data from selves; (3) for anomaly detection, using mature detector set and boundary self set to test at the same time, which can effectively improve the detection rate and reduce the false alarm rate. Theoretical analysis and experimental results show that the algorithm has better time efficiency and detector generation quality according to classic negative selection algorithms.

1. Introduction

The negative selection algorithm (NSA) first proposed by American scholar Forrest [1] is one of the most important anomaly detection algorithms in artificial immune field. The idea of negative selection algorithm comes from the negative selection behavior of T lymphocytes in immune tolerance of thymus [2]. An immune explanation for this behavior is as follows. In the thymus tolerance issue, T lymphocytes which identify self antigens will be in apoptosis or inactivated, and those cells which do not identify selves will mature after a period of tolerance and exercise their immune function in peripheral lymphoid tissues. The proposition of negative selection algorithm greatly promotes research and application in the anomaly detection field of artificial immune systems. Specifically, the idea of negative selection algorithm is often applied in these areas such as fault detection, virus detection, network intrusion detection, and machine learning [24].

The negative selection algorithm put forward by Forrest used binary string to represent antigen and antibody adopted r-continuous matching rule to compute matching degree between antibody and antigen and was successfully applied to anomaly detection system. Then, Balthrop et al. [5] pointed out holes of r-continuous matching rule and put forward the improved r-chunk matching mechanism. Zhang [6] proposed r-variable negative selection algorithm, and He [7] proposed negative selection algorithm with variable-length detectors.

But the binary representation has insufficiency in dealing with numeric data and multidimensional space problems; Gonzalez and Dasgupta [8] proposed a real-valued negative selection algorithm with position-developed detectors (RNSA). The study of real-valued negative selection algorithm is introduced in the following. The work in [9] introduced the super ellipsoid detector into the negative selection algorithm, the work in [10] introduced the super rectangle detector, and they needed less detectors to achieve the same coverage compared to spherical detectors. Ji [11] and Ji and Dasgupta [12] put forward a variable-sized real-valued negative selection algorithm (V-Detector). The algorithm dynamically determined the radius of a mature detector by calculating the smallest distance between the center of candidate detector and the self antigen. The work in [13, 14] proposed a negative selection algorithm based on grid. The algorithms adopted certain method to divide space into several grids, which reduced the tolerance range of randomly generated candidate detectors. The work in [15] proposed a negative selection algorithm based on hierarchical clustering of self sets, which improved the coverage rate of non-self space for detectors through the self set preprocessing. The work in [16] divided detectors into self detectors and non-self detectors which cover self space and non-self space, respectively, and used self detector instead of self elements to reduce the computational cost.

Some studies introduced other artificial intelligence technologies into the negative selection algorithm to improve the efficiency of detector generation. The work in [1719] proposed a negative selection algorithm based on genetic principles. The work in [20] combined particle swarm optimization strategy and the negative selection algorithm. The work in [21] introduced the wavelet transform to the negative selection algorithm.

Many problems in negative selection algorithms such as the representation of detectors, the affinity calculation method, and detector generation mechanisms were studied. There are many achievements, but some problems have not been solved effectively.

(1) Loopholes. Loopholes generally refer to all the non-self space which is not covered by detectors in the negative selection algorithms and can be specifically divided into two categories: the first kind of vulnerability is the non-self space which cannot be covered by detectors in theory due to restricted detector coding and implementation way; the second is the non-self space which is not covered by detectors but theoretically feasible. Figure 1 illustrates two kinds of holes when the detector size is immutable.

In order to obtain good detection performance, it is necessary to reduce loopholes. For the first kind of holes, regardless of specific differences of detector representations (binary string or real value expression), there are two main solutions: one is the variable detector shape scheme [9, 10] and the second is variable detector size scheme [11, 12]. As shown in Figure 2, two schemes can eliminate the first kind of vulnerability in theory. Although the first solution is possible in theory, its implementation is very difficult. In comparison, the second solution is more feasible, and its concrete implementation effect is better [1116].

For the second kind of loopholes, there are two solutions at present. One is the exhaustive method to generate all the detectors [22], and the second is to randomly generate detectors which satisfy certain requirements [11, 12]. Main problems of the first solution include the high time complexity and excessive detectors. It can only be applied to the situation that the number of detectors may be limited and cannot be used in many practical applications (for instance, detectors use real-valued coding, and the number is unlimited as a result). The second solution is widely used in the practical applications, but the scheme has random uncertain problems and cannot completely eliminate the existence of loopholes. While the second type of vulnerability can be covered by detectors in theory, the effective covering of holes which locate in the areas between self and non-self space is very difficult. Figure 3 shows a schematic diagram. In the figure, the self space is expressed by a self element, and it can be found that, even in the simplest environment, using a finite number of detectors cannot completely cover all the loopholes in the second category in theory.

(2) Too Many Detectors. The work in [11, 22] pointed out that, in the negative selection algorithm proposed by Forrest et al., the detector generation efficiency is very low. Candidate detectors become mature by negative selection. It is assumed that is the size of self training set, is the matching probability between any antigen and antibody, and is the failure rate (the probability of a non-self antigen cannot be matched by any antibody). The number of candidate detectors is , and the algorithm’s time complexity is . Therefore, with the increase of the size of the training set, the number of candidate detectors increases exponentially, the time cost in detector generation phase is higher, and the time cost in detection phase is higher as well. In addition, excessive detectors can cause redundant cover between detectors. Some negative selection algorithms made mature detectors to merge or cluster, which formed big-radius detectors instead of original detectors [19].

(3) Contradiction between Detection Rate and False Alarm Rate. For anomaly detection, high detection rate and low false alarm rate are two directions. Existing negative selection algorithms focused on improving the efficiency of detector generation in order to cover the non-self space as much as possible. They did not consider the balance of the detection rate and the false alarm rate, which lacked a good adjustment mechanism of the two rates. The classic algorithm V-Detector [11, 12] proposed two solutions, which were point-aware and boundary-aware, respectively. The two solutions have problems. The detection rate of point-aware solution is low, and the false alarm rate of boundary-aware solution is too high.

(4) Not Considering Noise Data in Self Training Set. In many anomaly detection applications, self training set contains noise data. And negative selection algorithms are on the basis of reliable training data; it is widespread that they cannot adapt to the existence of noise data.

How to generate efficient detector set is the key to the negative selection algorithms. This paper proposes a real-valued negative selection algorithm based on clonal selection, named CB-RNSA. The main contributions lie in introducing the clonal selection algorithm and randomly generating candidate detectors within the stratified limited ranges based on clustering centers of self set; generating big-radius candidate detectors first and making them cover space far from selves, which reduces the number of detectors; then generating small-radius candidate detectors and making them gradually cover boundary space between selves and non-selves, which reduces the number of holes; distinguishing selves and dividing them into outlier selves, boundary selves, and internal selves, which can adapt to the interference of noise data from selves; for anomaly detection, using mature detector set and boundary self set to test at the same time, which can effectively improve the detection rate and reduce the false alarm rate.

In this paper, the rest of the sections are arranged as follows. The second section introduces the background of this paper, including the typical real-valued negative selection algorithm RNSA, the variable-sized real-valued negative selection algorithm (V-Detector), and immune theory-clonal selection algorithm. The third section introduces the implementation strategies of the algorithm, including the basic idea, outlier selves discovery mechanism, cluster discovery mechanism, clonal selection mechanism, etc. The fourth section analyzes the algorithm, including time complexity analysis and detector self-reaction rate analysis. The fifth section verifies the effectiveness of the algorithm through experiments which use 2D comprehensive data sets and UCI data sets and compare with classic negative selection algorithms. Conclusions are given in the sixth section.

2.1. Basic Concepts of NSA

The system state can be expressed by the feature vector . n is the system dimension, and each feature of the vector is normalized to the real-valued interval . The entire state space of the system can be expressed as . The system state space can be further divided into self space Self and non-self space Nonself. In anomaly detection, self space Self is composed of states when the system is normal, and non-self space Nonself consists of states when the system is abnormal.

In artificial immune systems, antigens are on behalf of the entire state of the system, the self set represents self space of the system, and the non-self set represents non-self space, which are defined respectively as follows.

Definition 1 (antigens). = = = , , 1 ≤ , represents all the samples in the space. The parameter is an antigen in the set and consists of two parts, and . The parameter is the position of sample in the real-valued space. The parameter is the radius of and represents change threshold. Therefore, is a hypersphere in space.

Definition 2 (self set). represents all the normal samples in the antigen set. = .

Definition 3 (non-self set). Nonself represents all the abnormal samples in the antigen set. Nonself = . The self/non-self has different meanings in different areas. For network intrusion detection, non-self represents network anomalies, and self represents normal network activities. For virus detection, non-self represents virus signatures, and self represents legal codes. Self and Nonself meet (1).

Definition 4 (training set). is a subset of Self and is the a priori knowledge for testing.

Definition 5 (detector set). = = = , , 1 ≤ , . The parameter is one of the elements from the detector set, and its structure is the same as antigen which consists of two parts, y and , respectively. The parameter represents the position of detector d, and the parameter is the radius of detector d.

Definition 6 (matching rule). f(, ) represents the affinity between antigen and detector d, that is, the matching degree between data structures. In real-valued space, we can measure the affinity by calculating the distance between two feature vectors, which is usually expressed by Minkowski distance. For two vectors and in n-dimensional space, m-order Minkowski distance function is as

Minkowski distance is often called the m-order norm. For real-valued negative selection algorithms, it is thought that, for different values of m, test range of detectors has different geometry. This paper adopts the 2-order norm to express the matching rule, namely, the Euclidean distance. Then, (2) is rewrote as

Definition 7 (detection system). DS consists of three parts, namely, . In the process of detector generation, if f(, )+ , detector causes the immune self-reaction and cannot be a mature detector. In the process of detector testing, if f(, ), detector recognizes as a non-self. It is assumed that is mapping from self set Self and candidate detector to a classification , where 0 indicates that is immature, and 1 means that is mature. A is such a function as

Suppose is mapping from detector set and antigen to be identified to a classification , where 0 indicates that is self, and 1 means that is non-self. B is such a function as

When the detection system is working, TP is set as correct positive, that is, the number of non-selves correctly recognized by detectors; TN is set as correct negative, that is, the number of selves correctly recognized by detectors. Two kinds of errors may occur. False positive FP happens when a self sample is identified as a non-self. False negative FN occurs when a non-self sample is identified as a self. They can be defined as follows. Given a test set, which consists of two sets, self Stest and non-self Ntest. When FP happens, the pattern collection can be defined as

When FN happens, the pattern collection can be defined as

When TP happens, the pattern collection can be defined as

When TN happens, the pattern collection can be defined as

Definition 8 (detection rate). DR is the ratio of the number of non-self samples being correctly identified by detectors to the total number of non-selves and is expressed as

Definition 9 (false alarm rate). is the ratio of the number of self samples being wrongly identified by detectors to the total number of selves and is expressed as

2.2. RNSA

The artificial immune system removes those immune cells which respond to selves through the negative selection algorithm, so as to realize the self tolerance. RNSA adopts real-valued vector to describe the configuration space and uses fixed-size detectors. The termination condition of the algorithm is reaching the default number of detectors [8]. Algorithm 1 shows the process of the negative selection algorithm.

Procedure. The negative selection algorithm RNSA
Begin
Generate a large number of candidate detectors at random;
While a given size detector set has not been generated do
Calculate affinities between the candidate detector and every self element;
If the candidate matches any element of self set;
Then clear the candidate;
Else put the candidate in the detector set;
End;
Use the collection of resistant detectors to test abnormal variations;
End.

Figure 4 shows the relationships between the self set, the non-self set, the detector set, etc. is the antigen space; is the detector space. Although in many cases = , we draw them, respectively, in order to describe clearly. Self is the self set, Nonself is the non-self set, CD is the candidate detector set, D is the detector set which is selected from CD, NonselfCD is non-selves which are identified by candidate detectors, and NonselfD is non-selves which are identified by detectors. Therefore, collection of holes is Nonself-NonselfCD, and collection of non-selves which cannot be tested by detectors is Nonself-NonselfD.

2.3. V-Detector

V-Detector uses vectors in real-valued space to express detectors and antigens, and the radius of detectors is variable [11, 12]. Firstly, the algorithm randomly generates the center of the candidate detector d = and then calculates distance f(, d) between and every self = in the training set. If , accept the detector. The radius of the detector is calculated by the following formula:

Point-aware and boundary-aware are two technologies to determine detector radius. The algorithm of boundary-aware has high detection rate and high false alarm rate. This is because the boundary-aware makes detection range of detectors and coverage of some selves overlap. Non-self space is basically covered, which improves the detection rate, but is contrary to the assumptions of negative selection algorithm “vectors close to selves are selves, and vectors far from selves are non-selves [1, 11]”, which can lead to high false alarm rate. Therefore, this article uses the point-aware technology.

Figure 5 shows the contrast of RNSA and V-Detector when the expectation coverage is 50%. The self data are top 25 elements of classification “Iris-versicolor” from IRIS data set [23]. In order to display conveniently in two-dimensional space, we take the element’s sepalL and sepalW as self antigen’s properties. Blue filled circles are self elements, cyan filled circles are mature detectors, and unfilled area is holes. In RNSA, the detector size is constant and is difficult to determine accurately, which causes many loopholes in the non-self space and low detection rate. In V-Detector, the detector size is variable, and the algorithm makes big detectors cover most of the non-self space and small detectors cover holes, which not only reduces the number of detectors, but also reduces the number of vulnerabilities. But, these two algorithms have faced problems proposed in the introduction section, such as loopholes which influence the detection rate, too many candidate detectors, redundant cover between mature detectors, and high time cost of testing.

2.4. Clonal Selection Algorithm

Clonal selection principle is used to illustrate basic features of the immune response to antigen stimulation in the immune system [2426]. When external bacteria or virus invades the body, B cells begin to a large number of cloning and destroy invaders. Those cells who can identify antigens will achieve the purpose of hyperplasia by asexual reproduction according to the degree of recognition. The higher the affinity between cells and antigens is, the more cells can produce offspring. In the process of cell division, individual cells also experience a variation process, which results in higher affinity with antigens; the higher the affinity between parent cells and antigens is, the less parent cells experience variation. Algorithm 2 shows the process of clonal selection algorithm.

Procedure. Clonal selection algorithm
Begin
Randomly generate a population of immune cells;
While not meet the convergence condition do
While not search all antigens do
Choose those cells which have high affinity with antigen;
Generate copies of immune cells; the higher the affinity is, the more copies are.
Mutate according to the affinity; the higher the affinity is, the smaller the variation is;
End;
End;
End.

3. The Algorithm Theory

3.1. The Process of the Algorithm

The main idea of the algorithm is as follows. Firstly, the algorithm analyzes the space distribution of the self set and gets the set of outlier selves and several classification clusters. Then, the algorithm considers centers of clusters as antigens, randomly generates initial immune cell population in the qualified range, and executes the clonal selection algorithm. That is to say, the algorithm carries out the immune selection operation, clonal amplification operation, and hypermutation operation on the immune cell population. The affinity between antigens and immune cells is inversely proportional to their distance, and the convergence condition of clone selection algorithm is to achieve the expected coverage of non-self space. At this point, the finite range for generating immune cells is far from self space, where the coverage rate is low. At the end of the clonal selection algorithm, immune cell population is viewed as candidate detectors of the first level. The radius of the candidate detector is dynamically determined by computing the distance between its center and the closest self, and then the candidate detector joins the mature detector set through tolerance. In addition, the algorithm adds selves which are closest to detectors into boundary self set. The candidate detectors of this level have biggest radius and cover non-self space away from selves. Afterwards, the algorithm changes the limited scope to make the range of next level more convergent. The algorithm continues to consider centers of clusters as antigens, randomly generates initial immune cell population in the qualified scope, and executes the clonal selection algorithm. When the clonal selection algorithm ends, immune cell population is the candidate detectors of the second level. The candidate detectors of this level have larger radius, are close to detectors of the first level, and cover the non-self space a bit near selves. Then repeat the process until candidate detectors cover non-self space close to selves which are, namely, the boundary areas between self space and non-self space. When the algorithm terminates, mature detector set and boundary self set are obtained. Algorithm 3 shows the process of CB-RNSA.

Procedure. CB-RNSA
Input: self training set Train, expected coverage rate p0
Output: detector set D, boundary self set Selfo, outlier self set Selfd
n0: the sampling frequency of non-self space, n0
i: the number of non-self samples
m: the number of non-self samples which are covered by detectors
: candidate detector set = = = , , ,  
Clusters: cluster set Clusters =
: the number of candidate detector level
Begin
Initialize self training set Train, i = 0, m = 0, , , n0 = ;
Initialize outlier self set Selfd according to Procedure outlier selves discovery algorithm;
Initialize cluster set Clusters according to Procedure clusters discovery algorithm;
While does not reach the maximum number of levels for candidate detectors do
Consider centers of Clusters as antigens, randomly generate initial immune cell population in the qualified range;
While true do
Select immune cells;
Generate copies of immune cells;
Mutate according to affinities;
Compute distances between mutated individual and every self in the training set Train;
If is recognized by some self Then discard ;
Else
Find the closest self to dnew, and add it to boundary self set Selfo;
i ++;
Compute distances between and every detector in the detector set D;
If dnew is not identified by any detector Then put it into the candidate detector set CD;
Else m ++;
End if;
If the number of non-self samples reaches the sample times Then
Compute current coverage rate p;
If p reaches the expected coverage rate , break;
Else incorporate candidate detector set CD with D, reset i, m, CD;
End if;
End;
l ++;
Changes the limited range of candidate detectors;
End;
End.

The main contributions lie in introducing the clonal selection algorithm and randomly generating candidate detectors within the stratified limited ranges based on clustering centers of self set; generating big-radius candidate detectors first and making them cover space far from selves, which reduces the number of detectors; then generating small-radius candidate detectors and making them gradually cover boundary space between selves and non-selves, which reduces the number of holes; distinguishing selves and dividing them into outlier selves, boundary selves, and internal selves, which can adapt to the interference of noise data from selves; for anomaly detection, using mature detector set and boundary self set to test at the same time, which can effectively improve the detection rate and reduce the false alarm rate. Theoretical analysis and experimental results show that the algorithm has better time efficiency and detector generation quality according to classic negative selection algorithms.

Figures 6 and 7 show the contrast between CB-RNSA, RNSA, and V-Detector. The self data are top 25 elements of classification “Iris-versicolor” from IRIS data set [23]. Blue filled circles are self elements, cyan filled circles are mature detectors, and unfilled area is holes. In RNSA and V-Detector, with rise of coverage rate, redundant cover between mature detectors in non-self space increases, which causes excessive detector quantity and unnecessary self tolerance. In CB-RNSA, because the clonal selection algorithm is introduced to limit the range of randomly generated candidate detectors, it is preferred that detectors are generated in space of low coverage, which reduces the number of detectors and redundancy.

3.2. Classification of Selves

Most of the negative selection algorithms do not distinguish between selves. But for the continuous space of selves, information within different self is different. We divide selves into three groups, outlier selves, boundary selves, and internal selves, as shown in Figure 8. Magenta filled circles are outlier selves, cyan filled circles are boundary selves, and blue filled circles are internal selves.

Suppose the collection of elements Nei(, r) whose distance with self is less than is expressed as (13), and they are called neighbors.

Definition 10 (outlier self set). . The outliers may be caused by noise data. means that when the number of neighbors of a self is less than a certain value, the self is classified as an outlier. is the parameter for outlier.

Definition 11 (boundary self set). , and they are distributed in edges of self space and non-self space. The boundary self quantity is far less than the number of internal selves, and a lot of false positives and omissions appear in the borders. Because self space and non-self space are complementary, we use detector set to define boundary self set; that is to say, the self which is closest to a detector is boundary. .

Definition 12 (internal self set). , and they are surrounded by boundary selves. .

3.3. Outlier Selves Discovery Mechanism

As one of the important research fields of knowledge discovery, there are many effective outlier detection algorithms at present [2729]. This paper adopts the algorithm based on distance proposed by Knorr [27], and Algorithm 4 shows the process of the algorithm.

Procedure. Outlier selves discovery algorithm
Input: the self training set Train
Output: outlier self set Selfd
Begin
While self is not detected do
m = 0;
for other self in the self training set Train do
Compute the distance between and f(, );
If f(, ) < r then m ++;
End;
If then add into the outlier self set Selfd;
End;
End.
3.4. Clusters Discovery Mechanism

In anomaly detection, it is thought that most of these data are normal, and abnormal data are the minority. In this paper, the algorithm first analyzes the space distribution of selves and performs a clustering pretreatment on the self set. Similar selves are classified in the same cluster, and then a number of clusters are generated. Centers of clusters are used as a benchmark to generate candidate detectors. The purpose of self set clustering is to determine the randomly generation scope of candidate detectors, and candidates are generated in the qualified range in order to avoid detector redundancy in high coverage.

Definition 13 (cluster). cluster = c c = = , , 1 ≤ . The parameter is the center vector of the cluster in n-dimensional space, the parameter is the radius of the cluster, and the parameter is the collection of selves within the cluster. is computed by the following formula:

The algorithm randomly selects a self as an initial element for a cluster at first and then judges whether any element in the cluster and other self are neighbors, that is to say, whether meeting , if so, self will be within the cluster. After other selves are judged, if there is a self which does not belong to any cluster, that means , the algorithm continues to randomly select a self as an initial element of a new cluster. The above operations are performed until all the selves belong to a cluster. The cluster center is computed by the following formula (15). Algorithm 5 shows the process of clusters discovery algorithm.

Procedure. Clusters discovery algorithm
Input: the self training set Train
Output: cluster set Clusters=
Begin
While self is not belong to any cluster do
Generate a new cluster for ;
For other self in selves which are not classified do
If is neighbor to any element in cluster, then put into cluster;
End;
End;
End.
3.5. Clonal Selection Mechanism

Clonal selection algorithm is often used for solving the optimization problems. This paper adopts the clonal selection algorithm to search optimal detectors in the non-self space. The center of each cluster is viewed as an antigen. Initial immune cell population is randomly generated in limited scope, and the clonal selection algorithm is performed on the group. The limits control the position of vector of immune cell in each dimension and are defined as a super sphere loop based on the center of the cluster cluster.x. The distance between immune cell and the center of the cluster should be limited between [rlow, rhigh], that means, rlow f(, cluster) ≤ rhigh. rlow is the minimum distance between immune cell and cluster center, rlow = 0, and rhigh is the maximum distance between these two, rhigh = . Candidates will be generated by layer. According to the detector radius, from big to small, a detector of larger coverage will be in priority to be produced, to avoid repeated coverage with existing mature detectors and achieve less detectors covering as much as possible non-self area. Set level is l, and the limited scope of the level is to meet (16), where the value of should be satisfied (17); that is, .

Figure 9 shows limited scopes of the first level, second level, and third level of candidate detectors based on one of the clusters and generations of candidate detectors. Figure 10 shows limited scopes of the top three levels based on all the clusters and the generation of candidate detectors. Blue filled circles and green filled circles are self elements, cyan filled circles are mature detectors, and regions between two concentric dotted circles are the randomly generation scopes.

In the clonal selection algorithm, the main operations include immune selection , clonal amplification , and hypermutation . Immune selection operation is to choose a certain number of immune cells with high-affinity in order to search in more valuable space. In this algorithm, because immune cells should cover the non-self space as much as possible, it is not necessary to choose by affinity, and the population of all immune cells go into the next operation. Set as the population of immune cells and as the population after this operation, and the probability of selection is as

Clonal amplification operation simulates cloning mechanism of the immune response, and the higher affinity with antigens the cell has, the more offspring the cell can produce. In this algorithm, the number of copies nc is related to the level where candidate detector is. nc is computed by (19), where ncmax is the maximum number of copies and ncmin is the minimum.

The above formula reflects the clone expansion scales for immune cells in different limited scopes. When the hierarchy is small, candidate detectors are in low-coverage areas and the radius is larger. We hope to cover more non-self space with less detectors, so the cloning scale is smaller. When the hierarchy is big, candidate detectors are in high coverage areas and the radius is smaller. We hope to cover holes in the boundary of self space and non-self space, so the cloning scale is larger.

Hypermutation operation is to produce immune cells with higher affinity and enhance the diversity of the population. Mutation operator usually can produce small disturbance and can also produce a wide range of disturbance, which makes the mutation have abilities of local search and global search and makes the algorithm have stronger optimal search performance. In this paper, Gauss mutation is adopted and the formula is as

where is the mutated immune cell, is the Gaussian random variable, the mean is 0, and the deviation is 1. is the control parameter to adjust the variation amplitude. In this paper, a dynamic adaptive is adopted and only related to candidate detector level . The mutation mechanism is as (21). In the process of the generation of candidate detectors, when the level is small, individuals search with larger probability, which is good for global search. When the level is large, individuals search with smaller probability, which is more conducive to local search.

is the cut-off point for that means the cut-off for the algorithm from global search of big probability to local search of small probability.

After immune cells perform mutation, they also need to meet (16). So, new cells should meet the formula, or they will be discarded. This will produce a large number of useless cells, which needs a large amount of calculations. Set as the origin, we use -spherical coordinates to rewrite (16) by (22).

where the parameter is random variable between , the parameters are random variables between , and the parameter is random variable between . They are expressed as (23), (24), and (25).

3.6. Anomaly Detection

Anomaly detection is used to determine whether data is abnormal. In this paper, the anomaly detection process combines the detector set and boundary self set together for testing. So, we modify the definition of B, and is expressed as (26). B is mapping from detector set D, boundary self set , and antigen to be identified to a classification , where 0 indicates that is self, and 1 means that is non-self.

3.7. Coverage Rate of Non-Self Space

The greater the coverage of detectors for the non-self space is, the greater the probability of detecting non-self is. Due to overlaps between detectors, direct calculation of coverage rate for non-self space is very difficult. Monte Carlo method is used to calculate the approximate value . Suppose a randomly generated non-self set , the calculation formula of non-self space coverage rate of detector set is as

Set as the target coverage rate. When the estimated value is greater than or equal to , it is thought that detectors meet the requirements of the target coverage. Because is a random variable, it is inevitable that the actual coverage rate is less than . To reduce the probability of this situation, the algorithm introduces the hypothesis test for processing.

Set α as the significance of hypothesis test, za as the α quantile of standard normal distribution, as the sample size, and as the number of samples covered by detectors. The calculating formula of maximum xmax is as

Obviously , then , that is, . In order to meet the requirements and make the estimated coverage rate approximately obey the normal distribution, the calculation formula of sample size is as

Because , the above equation can be rewriting as .

The actual coverage rate of detectors for the non-space is p, the null hypothesis of hypothesis testing is “”, and the alternative hypothesis is “”. In the process of the algorithm, if is less than or equal to xmax, receive and update detector set; if is greater than the xmax, accept and exit.

4. Analysis of the Algorithm

4.1. Time Complexity Analysis

Theorem 14. The time complexity of CB-RNSA for detectors generation is O(), where n is the spatial dimension, is the size of self set, is the size of detector set, P is the self-reaction rate of detectors, is the scale of initial immune cell population, and ncmax is the maximum number of clonal copies.

Proof. CB-RNSA carries out preprocessing operations on the self set in the first place and then executes clonal selection algorithm in every qualified level to generate candidate detectors.
The number of computation times of the outlier selves discovery algorithm is not exceeded , and the time complexity is O(). The number of computation times of the clusters discovery algorithm does not exceed , and the time complexity is O(). We mainly consider the time complexity of the detector generation process and do not consider the time complexity of preprocessing operations.
In the layer, the time complexity of randomly generating initial immune cell group is O(ng). Immune selection operation chooses the entire immune cells, and the time complexity is not considered. For each immune cell, the calculation number of the clonal amplification operation is not more than ncmax, the time complexity is O(ncmax) which is a constant value, and the time complexity of hypermutation operation is O(ncmaxn). The time complexity of calculating whether immune cells fall into self space is O(). Suppose the probability of immune cells falling into self space is , the number of immune cells in level is , and then the number of immune cells which are not excluded in this step is . The time complexity of computing whether immune cells are covered by mature detectors is O(). Then viewing this immune cell as a candidate detector, the time complexity of calculating the radius is O(). Results of previous operations can be used for this operation, so, we do not consider the time complexity.
Therefore, the overall time complexity of generating detectors in the level is O(). The maximum number of is ; the overall time complexity of generating detectors for CB-RNSA is O(). Suppose is the average self-reaction rate of detectors, the number of candidate detectors is , and then the time complexity of CB-RNSA is O(). Proved.

NSA, RNSA, and V-Detector are the influential negative selection algorithms and are widely used in intrusion detection, abnormal diagnosis, pattern recognition, etc. Table 1 lists the time complexity contrast of the three negative selection algorithms and CB-RNSA, where is the probability of detectors identifying any antigen and is the detection failure rate. As can be seen from the table, the time complexity of the traditional negative selection algorithms is in exponential relationship with self set size . When the size of self elements increases, the time spending increases rapidly, even to the unbearable point. The time complexity of CB-RNSA is related to the spatial dimension n, the size of self set , the detector set size , and the self-reaction rate . There is no exponential relationship with , and the size of the detector set is far less than the other three algorithms, which reduces the time complexity and improves the detectors generation efficiency.

4.2. Self-Reaction Rate Analysis of Detectors

Under the established matching rules, the matching probability of any given detector with any antigen is a constant [1]. For the r-continuous matching rule of NSA, it satisfies , where l is the length of the string and r is the consecutive matching digits.

In the real-valued algorithm, calculation is different. The work in [3, 4] pointed out that, in RNSA and V-Detector, P can be acquired by calculating the ratio of self capacity to the total antigen capacity, and is also known as the reaction rate of detectors, which means the probability of detectors covering self space. In addition, we can use the number of all candidate detectors to measure detector generation cost. Suppose achieving desired non-self space coverage rate, the number of detectors is , and the candidate detectors can be calculated by , where the number of selves is num. The larger the self-reaction rate of detectors is, the greater the number of candidate detectors for generating mature detectors is, and the higher the detector generation cost is. To simplify the discussion, it is assumed that there is no overlap between selves.

For RNSA and V-Detector, detectors are randomly generated in space. Therefore, the self-reaction rate of detectors is the ratio of hypersphere volumes to unit hypercube volume. Suppose is the volume of a self. is expressed as

For CB-RNSA, detectors are randomly generated within different limited space. Therefore, the self-reaction rate of detectors is the ratio of volumes of the limited space which are covered by selves to volumes of the limited space. In the layer, the limited space is a super sphere loop between two spheres with radiuses and , respectively. Selves may or may not intersect with this loop. Suppose the number of selves within this loop is . To simplify the discussion, suppose there is no element half intersecting with the loop. The self-reaction rate of detectors is as (31). Suppose is the volume of a sphere for the layer.

To compare the self-reaction rates of the three algorithms, set and is expressed as

When is small, the super sphere loop is far from the center of the cluster. Selves of this cluster are disjoint with the loop and selves of other clusters may intersect with the loop. With the increase of , the loop will be more and more near the center of cluster, selves of this cluster are likely to intersect with the loop, and other selves are disjoint with the loop. So, . When is less than 1, the self-reaction rate of CB-RNSA is less than RNSA and V-Detector, that means the detector generation cost of CB-RNSA is less. Figure 11 shows the variation of with changes of the data dimension and the limited level . As can be seen from the figure, when and are small, is far less than 1.

5. Experimental Results and Analysis

This section verified the effectiveness of GB-RNSA through experiments. Experiments chose the representative real-valued negative selection algorithms RNSA and V-Detector for comparisons. Experimental data adopt two types of data sets which are commonly used in the study, including 2D comprehensive data sets [30] and UCI data sets [23]. 2D comprehensive data sets are provided by the team of professor Dasgupta from the university of Memphis and are authoritative for the real-valued negative selection algorithm performance test [11, 12, 14]. UCI data sets are classical in machine learning and are widely used in the performance tests and detector generating efficiency tests [816].

In specific comparisons, in order to avoid the influence of different exit conditions on the algorithms, all algorithms adopt the same exit criteria “to reach the expected non-self space coverage rate”. The number of mature detectors DN, detection rate DR, false alarm rate FAR, and time cost of detector generation DT are adopted to measure the effectiveness of algorithms.

5.1. 2D Comprehensive Data Sets

The data sets contain a number of subdata sets; Figure 12 shows distributions in the two-dimensional space of self data from three subdata sets, Cross, Intersection, and Ring. Without loss of generality, experiments chose the three data sets.

Self set size of these three data sets = 1000. The training set is composed of randomly selected selves, and the test data are composed of random points in the space. The experiments repeat 20 times and averaged values were obtained. Tables 2 and 3 show the results of the experiments, and values in parentheses are variances. Table 2 lists contrasts of detection rate and false positive rate of CB-RNSA under the same expected coverage rate 90%, the same training set size =500, and different radiuses of selves. It can be seen that detectors trained from the smaller radius of selves have higher detection rate and false positive rate, and detectors trained from the bigger radius have lower detection rate and false positive rate. Therefore, we should adopt smaller radius of selves to train detectors for applications whose environment is sensitive to abnormal data, and bigger radius for applications whose environment is sensitive to false positives. Table 3 lists contrasts of detection rate and false positive rate of CB-RNSA under the same expected coverage rate 90%, the same radius of selves = 0.05, and different training set sizes. It can be seen that, with the rise of training set size, the detection rate increases gradually, and the false positive rate reduces gradually. This is because more selves participating in training is good for effective screening of detectors, which can reduce the number of self-reacted detectors, and makes detectors cover non-self space more accurately.

Figures 13 and 14 are comparisons of a run of RNSA, V-Detector and CB-RNSA in Cross data set and Intersection data set, respectively. Blue filled circles are self elements, cyan filled circles are mature detectors, and unfilled area is holes. As can be seen from the figures, there is less redundant coverage between detectors of CB-RNSA, and detector quantity is less, which makes less detectors achieve the same coverage expectation.

2D comprehensive data sets are clean. In order to test conditions of self set containing noise data, we added a small amount of noise data in the Ring data set. Figure 15 shows comparison of a run of RNSA, V-Detector, and CB-RNSA in Ring data set. Blue filled circles are self elements, white filled circles are noise data, cyan filled circles are mature detectors, and unfilled area is holes. CB-RNSA detects the outlier selves and ignores them because they are noise data. In the case of noisy data interference, detectors of CB-RNSA fully covered the non-self space, and RNSA and V-Detector cannot effectively handle it.

5.2. UCI Data Sets

Experiments selected four standard UCI data sets, including Haberman’s Survival, Abalone, Breast Cancer Wisconsin Original (BCW1 for short), and Breast Cancer Wisconsin Diagnostic (BCW2 for short), and experimental parameters are shown in Table 4. Of these four data sets, self set and non-self set were randomly selected; training set and testing set were randomly selected as well. Experiments were repeated 20 times and averaged values were gained.

5.2.1. Comparisons of the Number of Detectors

Figure 16 shows comparisons of the number of mature detectors of RNSA, V-Detector, and CB-RNSA. As can be seen from the diagram, while the expected coverage rate increases, the number of mature detectors of three algorithms rises correspondingly. But the efficiency of CB-RNSA is superior to the other algorithms. For Haberman’s Survival data set, in order to achieve the expected coverage rate of 99%, RNSA needs 1033.1 mature detectors, V-Detector needs 351.4 detectors, and CB-RNSA needs 165.2 detectors which declines by 84.0% and 53.0%, respectively. For large data set Abalone, in order to achieve the expected coverage rate of 99%, RNSA needs 12893.2 mature detectors, V-Detector needs 1194.0 detectors, and CB-RNSA needs 615.4 detectors which declines by 95.2% and 48.5%, respectively. So, in the expectation of the same coverage rate, under different data dimensions and different training sets, the number of mature detectors of CB-RNSA has greatly reduced compared to RNSA and V-Detector.

5.2.2. Comparisons of the Cost of Detectors Generation

Figure 17 shows comparisons of the cost of detectors generation of RNSA, V-Detector, and CB-RNSA. For Haberman’s Survival data set, when the expected coverage rate rises from 90% to 99%, the time price of RNSA increases from 7.7s to 291.0s, the time price of V-Detector increases from 0.6s to 29.8s, and the time price of CB-RNSA increases from 0.4s to 13.7s. For Abalone data set, when the expected coverage rate rises from 90% to 90%, the time price of RNSA increases from 241.7s to 2412.8s, the time price of V-Detector increases from 2.4s to 228.6s, and the time price of CB-RNSA increases from 1.3s to 82.5s. Therefore, with the rise of expected coverage rate, the time costs of RNSA and V-Detector increase very quickly, and the time cost of GB-RNSA increases more slowly.

5.2.3. Comparisons of Detection Rate and False Alarm Rate

Figures 18 and 19 show comparisons of detection rates and false alarm rates of RNSA, V-Detector, and CB-RNSA. As can be seen from the diagram, while the expected coverage rate is greater than 90%, detection rates of three algorithms have little differences, and that of RNSA is lower; false alarm rate of CB-RNSA is obviously lower than that of RNSA and V-Detector. For BCW1 data set, when the expected coverage rate is 99%, false alarm rate of RNSA is 55.2%, false alarm rate of V-Detector is 30.1%, and false alarm rate of CB-RNSA is 20.1% which declines by 63.6% and 33.2%, respectively. For high dimensional data set BCW2, when the expected coverage rate is 99%, false alarm rate of RNSA is 25.1%, false alarm rate of V-Detector is 20.5%, and false alarm rate of CB-RNSA is 12.6% which declines by 49.8% and 38.5%, respectively. On the one hand, CB-RNSA introduced clonal selection algorithm and limited the generation range of detectors, which made detectors generated in the low-coverage non-self space and improved the coverage rate. On the other hand, detectors and boundary selves were adopted for testing at the same time, the definition of anomaly was stricter, which reduced the rate of false positive.

The ROC curve is a graphical method for classification model based on detection rates and false alarm rates. Figure 20 shows comparisons of ROC curves of RNSA, V-Detector, and CB-RNSA under two kinds of data sets, BCW1 and BCW2. A good classification mode curve should be distributed in the left-top of graphic as soon as possible. As can be seen from the diagram, CB-RNSA is superior to RNSA and V-Detector.

6. Conclusions

Excessive detectors, high time complexity, and loopholes are main problems which current negative selection algorithms have face and greatly limit the practical applications of negative selection algorithms. This paper proposes a real-valued negative selection algorithm, named CB-RNSA. The algorithm introduces the clonal selection algorithm and randomly generates candidate detectors within stratified limited ranges based on clustering centers of self set, which reduces the number of detectors and the number of holes. Selves are divided into outlier selves, boundary selves, and internal selves, which adapts to the interference of noise data. When the algorithm runs for anomaly detection, mature detector set and boundary self set are used at the same time, which effectively improves the detection rate and reduces the false alarm rate. Theoretical analysis and experimental results show that the algorithm has better time efficiency and detector generation quality according to classic negative selection algorithms.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to thank Sichuan Provincial Education Department of China Funded Project (035Z2258) for providing financial aid.