Abstract

Negative selection algorithm (NSA) is an important kind of the one-class classification model, but it is limited in the big data era due to its low efficiency. In this paper, we propose a new NSA based on Voronoi diagrams: VorNSA. The scheme of the detector generation process is changed from the traditional “Random-Discard” model to the “Computing-Designated” model by VorNSA. Furthermore, we present an immune detection process of VorNSA under Map/Reduce framework (VorNSA/MR) to further reduce the time consumption on massive data in the testing stage. Theoretical analyses show that the time complexity of VorNSA decreases from the exponential level to the logarithmic level. Experiments are performed to compare the proposed technique with other NSAs and one-class classifiers. The results show that the time cost of the VorNSA is averagely decreased by 87.5% compared with traditional NSAs in UCI skin dataset.

1. Introduction

NSA was proposed by Forrest et al. in 1994 [1], which generates immune detectors based on the “Random-Discard” model. Initially, massive immature detectors are randomly generated, and then the ones covering the self-areas are discarded. González et al. presented the real-valued negative selection algorithm (RNSA) in 2003 [2], in which the detectors and antigens are studied in the real-value space. Ji and Dasgupta proposed V-Detector algorithm [3, 4]. It turns the fixed-length detectors in RNSA into the variable-sized detectors to enlarge the detection areas. In 2015, Cui et al. developed BIORV-NSA [5]. In their work, the self-radius can be variable and the detectors, which are recognized by other mature detectors, are replaced by new ones to eliminate the “detection holds.”

In big data era, the low efficiency of NSA becomes an important challenge, which largely limits its applications. In this paper, we design a new NSA based on Voronoi diagrams, named VorNSA. In the VorNSA, a restrained Voronoi diagram is constructed based on the whole training set in the first step. Then, two types of detectors are generated in the specific location of the Voronoi diagram separately. In order to accelerate the test stage of NSA, in particular for large scale dataset, a new testing strategy VorNSA/MR (VorNSA with Map-Reduce) is proposed. Unlike the testing stage of classic NSAs, data are divided into small groups and calculated to generate the labels separately in Map stage. Then the final labels can be obtained after merging and sorting in the Reduce stage.

The contributions of this work can be summarized as follows. () Based on Voronoi diagrams, the optimal position of detectors is calculated directly rather than in a stochastic way. Therefore, the time consumption wasted on excessive invalid detectors is avoided. () In the Map/Reduce framework, data are partitioned into several small parts by VorNSA/MR and can be processed in parallel to enhance the self/non-self-discrimination efficiency.

The rest of the paper is organized as follows. In Section 2, we describe the definitions of VorNSA. The original contribution of the paper is presented in Section 3. Experimental results on synthetic datasets and real-world datasets are shown and discussed in Section 4. Conclusions appear in Section 5.

2. Basic Definition of VorNSA

VorNSA is designed based on Voronoi, which is derived from computation geometry to search the nearest neighbors, and it has been widely utilized in the fields of life sciences [6], material sciences [7], and mobile navigation [8]. The basic definitions are listed as follows.

Definition 1 (site). Site is a set of distinct points in the feature space. In VorNSA, all the training samples are defined as site points: .

Definition 2 (Voronoi diagram). divides the feature space into unoverlapped cells based on the given site set , and each cell only contains one site in , such that any point in satisfies , , and can be any distance metrics.

Definition 3 (cell). All the cells construct a mathematic partition of the feature space, and the cell corresponding to site is denoted by .

Definition 4 (largest empty circle). The largest circle with center , which does not contain any site in , is denoted by .

Theorem 5. A point is a vertex of iff contains at least three sites on its boundary [9].

Definition 6 (I-detector). , where is the detector position in the feature space, and is the detector radius, satisfies that corresponds to one vertex of the Voronoi diagram.

Theorem 7. Given is the center of an I-detector, there are at least three sites located on the boundary of , and these sites are the nearest neighbors of each other.

Proof. According to ‎Definitions 2 and 6, it can be inferred that the center of the I-detector p is an intersection of three or more cells. Suppose that is intersected by three cells , , , while the sites of these cells are , , . According to ‎Definition 4 and ‎Theorem 5, there is a largest empty circle that does not contain any site of , and , , are located on its boundary. So , , and are the nearest sites of among the site sets .

Theorem 8. The bisector between sites and defines an edge of iff there is a point on the bisector such that contains both and on its boundary with no other site [9].

Definition 9 (II-detector). , where is the detector position in the feature space, and is the detector radius, satisfies that corresponds to the junction of the edges of and the unit hypercube .

Theorem 10. Given is the center of II-detector, there are two sites located on the boundary of , and these sites are the nearest neighbors of each other.

Proof. According to ‎‎Definitions 2 and 9, it can be inferred that the center of II-detector is an intersection of two cells. Suppose that is intersected by two cells , , while the sites of these cells are , . According to ‎Definition 4 and ‎Theorem 8, there is a largest empty circle that does not contain any site of , and , are located on its boundary. So and are the nearest sites of among the site sets .

As an example in Figure 1, there are 10 sites in set , and the space is divided into 10 cells by the Voronoi diagram . The green circle is , and three sites (, , ) are located on its boundary. The red circle is , and two sites (, ) are located on the boundary. The purple circle is , and two sites (, ) are located on the boundary. is the center of I-detector, while and are the centers of II-detector.

3. The Details of VorNSA

3.1. The Detector Generation Process of VorNSA
3.1.1. Space Partition Stage

First of all, all the training data are normalized to feature space, where is the data dimension. The normalized training set is denoted by . Secondly, a bounded Voronoi diagram is constructed based on , to divide the unit feature space into cells, where = . Finally, the set , where , are the vertex and site in a cell , can be constructed.

3.1.2. I-Detector Generation Stage

According to ‎Definition 6 and ‎Theorem 7, the center of I-detector is designated by the intersection of three or more cells, and the sites located in the cells are the nearest neighbors of each other. So a new set , where is the position of I-detector and is the nearest sites, can be obtained by , where , , and are the vertex sets of cell. Then, generating a mature detector is just through self-tolerating with . According to the principle of self-tolerance, the radius of I-detector can be calculated with where is the radius of I-detector, is the center of I-detector, is the nearest sites, and is the radius of self-antigens.

Furthermore, a threshold of detector radius is introduced in case of overfitting: If the detector radius is less than , the detector will be discarded. Otherwise, it will mature.

3.1.3. II-Detector Generation Stage

The main difference between the I-detector and the II-detector is the location of detector centers. According to ‎Definition 9 and ‎Theorem 10, the position of II-detector is located on the junction of two cells and the unit hypercube. The sites in the two cells are the nearest neighbors of each other. So a new set , where is the position of II-detector and is the nearest sites, can be obtained by , where and are the vertex sets of cell. Similarly, the radius of II-detector can be computed by (2), and a threshold of detector radius is introduced in case of overfitting.where is the radius of II-detector, is the position of II-detector and is the nearest sites, and is the radius of self-antigens.

Details of the VorNSA can be found in Algorithm 1.

Input: Training set , Self radius , Minimum detector radius
Output: Detector set
() normalize into
() construct voronoi diagram by sites
() get all cells in
() construct by
() foreach in
()   if has three or more same values in
()   then
() foreach in
()   compute the detector radius using Eq. (1)
()   if then
() foreach in
()   if has two same values in
()   then
() foreach in
()   compute the detector radius using Eq. (2)
()   if then
() return
3.2. The Immune Detection Process of VorNSA under Map/Reduce Framework

In the testing stage of traditional NSAs, each piece of data has to be compared with all the detectors to label its classification. This strategy is too time-consuming to be applied in big data era due to its low efficiency. In order to enhance the efficiency in testing stage, an immune detection process of VorNSA under Map/Reduce framework (VorNSA/MR) is proposed. Map/Reduce is a parallel computation framework, which splits the sample set into a group of small datasets and handles them on many cluster nodes simultaneously.

Details of VorNSA/MR (Figure 2) are mainly divided into two parts: Map stage and Reduce stage. First of all, the testing datasets are split into parts by VorNSA/MR. In the Map stage, each cluster node selects a part of split data to compute the distance with matured detectors. If any distance is less than the detection radius, the testing sample is labeled with the non-self-antigens; otherwise it is labeled with the self-antigens. Then cluster nodes put results to the intermediate value. The Reducer receives the intermediate values, sorts them, and merges them into the final results.

The implements of Map and Reduce stage can be found in Algorithms 2 and 3.

Input: Detector set D, Split data T
Output: Intermediate Value IV
() foreach in
()    foreach in
()      Compute the Euclidean distance between and
()      if   <
()        is Noself Antigen, .Label = 0
()        go to line ()
()      is Self Antigen, .Label = 1
() IV.Value =
() return IV
Input: Intermediate Value IV
Output: Final Value FV
() While IV.next ~= END
()    add IV.Value to FV.Value
() Sort FV.Value by no
() return FV
3.3. Theoretical Analysis

Theorem 11. The time complexity of VorNSA is , where is the size of training dataset, is the dimension of training dataset, and is the size of detectors.

Proof. Since VorNSA is divided into three stages, we could analyze the time complexity separately.
The main work in space partition stage is to build a Voronoi diagram, so we borrow the analysis from Voronoi diagrams to estimate the time complexity. The literatures [912] prove that a Voronoi diagram with sites can be computed in optimal time under -dimension space. Therefore, the time complexity can be denoted by , where is the size of training set, and is the dimension of training set.
In the second and third stage, the main work is to compute the distance between detectors and sites. Though several detectors are discarded by the threshold , the quantity is very small compared with the whole size, so we use the size of detectors instead. According to (1) and (2), we can infer that the time complexity is in the two stages.
Combining the abovementioned, the time complexity of VorNSA is .

The time complexity of traditional NSAs is shown in Table 1, where is the match probability between detectors and antigens, is the failure rate, is the size of self-set, is the size of detectors, and is the data dimension. As shown in Table 1, the time complexity of VorNSA is in logarithmic level with , which is much less than the traditional exponential level compared with NNSA [1], RNSA [2], and V-Detector [4].

4. Experiments and Discussion

In the experiments, we use two evaluation criteria of performance: DR (Detection Rate) and FAR (False Alarm Rate) which is reported in varied literature [2, 3, 13], and they are defined aswhere TP and FN are the counts of true positive and false negative of non-self-antigens, respectively, and TN and FP represent the number of true negative and false positive of self-antigens, respectively.

4.1. Experiments on Synthetic Dataset (SDS)

In order to determine the performance of VorNSA among different datasets, 4 SDS proposed by the intelligence security laboratory of Memphis University are introduced in this section. The records of original datasets [3] are 1000, respectively. We expand the number of pieces of data to 10,000 to simulate the environment of big data better. The distributions of datasets are depicted as Figure 3 in which self-antigens are represented by red dots and non-self-antigens are shown by blue points. The details of datasets are listed in Table 2. Additionally, experiment parameters are set as follows: the self-radius is 0.04, self-antigens are randomly obtained from 50 to 1000, and the minimum radius of detectors is 0.005. Each experiment is repeated 25 times independently.

As Figure 4 shows, the trends of experiment results on 4 SDS are approximately the same. It indicates that VorNSA could achieve a high degree of applicability on different datasets. In Figure 4(a), it can be observed that the DR decreases from 95% to 80% with the increment of self-antigens. Besides, in Figure 4(b), the FAR drops from 60% to zeros. The reasons of this phenomenon can be explained as follows: when less self-antigens are trained, some self-antigens cannot be covered by the scope of self. So these self-antigens are identified as non-self-antigens in VorNSA. Due to its strong ability in detecting, the DR and FAR are both high. With the increase of the training numbers, all self-antigens will be covered. Furthermore, the non-self-antigens are covered and identified as self-antigens, in particular those located in the edge of self-set. Therefore, the DR decreases slightly while FAR sharply drops to zeros.

Figure 4(c) shows the quantity of detectors generated by VorNSA is not increasing remarkably with the growth of train set but maintains a relatively stable range. It is implied that VorNSA can effectively control the expansion of detectors. According to ‎Definition 2, with the increment of training samples, the space will be partitioned into smaller cells. We introduce the minimum detector radius . Thus, the inefficient tiny detectors are discarded.

In Figure 4(d), it can be noted that the time consumption of VorNSA on different datasets is similar, and time cost rises slowly even with enormous self-antigens. It suggests that the performance of VorNSA is less affected by the distribution of dataset, because the optimal position of detectors is calculated directly rather than in a stochastic way.

To sum up, we can see that VorNSA can generate fewer but more effective detectors. Besides, the less self-antigens are trained, the higher FAR will be. With the number of self-antigens increasing, the FAR is decreased significantly. Increasing the training set will lead to a rise of the time consumption, and the DR will be slightly decreased. Hence, a smaller self-set will be a smart choose in VorNSA.

4.2. Experiments on Skin Segmentation Dataset

In this section, VorNSA is tested by a group of comparison experiments. The compared algorithms include the classic NSAs (RNSA, V-Detector), a newly proposed NSA (BIORV-NSA) in 2015. To study the different methods, we introduce a classic statistics algorithm for one-class classification: OC-SVM [14], which is implemented by LibSVM [15]. All algorithms run in a computer deployed with Intel Pentium [email protected] G, while the implement of VorNSA refers to an open source toolbox of computational geometry, called MPT 3.0 [16].

The Skin Segmentation dataset is a UCI dataset. It is collected by randomly sampling B, G, and R values of skin texture, which derives from FERET database and PAL database. Total sample size is 245,057 in which 50,859 records are the skin samples and 194,198 records are non-skin ones.

In this experiment, 50 skin samples are randomly obtained as self-antigens. Meanwhile, to verify the performances of VorNSA and VorNSA/MR in large scale dataset, we use all 245,057 records in the datasets. The experiments are preformed 20 times independently, and the evaluation criteria include DR, FAR, detector number (DN), data training time (DT), and data testing time (DTT). The parameters of simulation are set as follows: the OC-SVM uses the RBF kernel functions, and nu is 0.5 and gamma is 0.33. The self-radius of RNSA, V-Detector, and VorNSA are set as the same value (0.1). The maximum number of detectors is 3000 in RNSA, and detector radius is 0.1. The estimated coverage and the maximum self-coverage are 99%. The maximum number of detectors is 1000 in BIORV-NSA, and the self-set edge inhibition parameter is 0.8 and the detector self-inhibition parameter is 1.2. The minimum radius of detectors is 0.005 in VorNSA and VorNSA/MR. The results of experiments are shown in Table 3.

From Table 3, it can be seen that the FAR of OC-SVM is 51.2%, reaching an unacceptable level. As OC-SVM implemented in a different platform, the time consumption is not counted in this paper. The DR of VorNSA (99.2%) is closed to the BIORV-NSA (99.42%), and better than the classic NSAs. Besides, the FAR of VorNSA (1.48%) is lower than BIORV-NSA (3.29%). It indicates that the detectors generated by VorNSA are more applicable than BIORV-NSA and more effective than classic NSAs.

Moreover, the DN, DT, and DTT of VorNSA are significantly lower than other NSAs, especially when it integrates the Map-Reduce Testing Framework. For example, the average number of detectors generated by VorNSA is 172.25, lower 63.3% by V-Detector and 82.8% by BIORV-NSA. The average training time of VorNSA is 1.91, lower 78% by RNSA, 94.1% by V-Detector, and 90.5% by BIORV-NSA. So the efficiency of VorNSA is averagely decreased by 87.5% compared with traditional NSAs. The testing time of VorNSA/MR is 426.7, lower 36.4% by VorNSA, 55% by V-Detector, 77.8% by BIORV-NSA, and 94.3% by RNSA.

The main reasons of above results can be explained as follows. In traditional NSAs, a large number of immature detectors are randomly generated without any optimal way and must self-tolerate with all self-antigens to decide whether they are matured or not. As a result, much time has been wasted. The scheme of detector generation of VorNSA is quite different with other NSAs. The optimal position of detectors is directly calculated. Thus, the time consumption on discarding many randomly generated but inappropriate detectors is avoided.

5. Conclusions

In this paper, we propose a new one-class classification algorithm based on Voronoi diagrams (VorNSA) and an immune detection process of VorNSA under Map/Reduce framework (VorNSA/MR) to cope with the challenge of big data. VorNSA alters the generative mechanism of detector from the “Random-Discard” model to the “Computing-Designated” model. VorNSA/MR can divide the sample set into several small parts and can be processed in parallel. Theoretical analyses show that the time complexity of VorNSA decreases from the exponential level to the logarithmic level. Experiments results show that the time consumption of VorNSA is significantly declined.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (Grant nos. 2016YFB0800605 and 2016YFB0800604) and Natural Science Foundation of China (Grant nos. 61402308 and 61572334).