A Large-Scale -Nearest Neighbor Classification Algorithm Based on Neighbor Relationship Preservation
Owing to the absence of hypotheses of the underlying distributions of the data and the strong generation ability, the -nearest neighbor (kNN) classification algorithm is widely used to face recognition, text classification, emotional analysis, and other fields. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances during the prediction process; it is difficult to deal with large-scale data. To overcome this difficulty, an increasing number of acceleration algorithms based on data partition are proposed. However, they lack theoretical analysis about the effect of data partition on classification performance. This paper has made a theoretical analysis of the effect using empirical risk minimization and proposed a large-scale -nearest neighbor classification algorithm based on neighbor relationship preservation. The process of searching the nearest neighbors is converted to a constrained optimization problem. Then, it gives the estimation of the difference on the objective function value under the optimal solution with data partition and without data partition. According to the obtained estimation, minimizing the similarity of the instances in the different divided subsets can largely reduce the effect of data partition. The minibatch -means clustering algorithm is chosen to perform data partition for its effectiveness and efficiency. Finally, the nearest neighbors of the test instance are continuously searched from the set generated by successively merging the candidate subsets until they do not change anymore, where the candidate subsets are selected based on the similarity between the test instance and cluster centers. Experiment results on public datasets show that the proposed algorithm can largely keep the same nearest neighbors and no significant difference in classification accuracy as the original kNN classification algorithm and better results than two state-of-the-art algorithms.
K-nearest neighbor classification algorithm is a lazy learning method that does not require a training process but simply stores training instances . When given a test instance, kNN classification algorithm first calculates the similarity between the given instance and all instances in the training set, then finds -nearest instances according to the similarity, finally predicts its label by the majority voting based on the category of these instances. Owing to its advantages of substantial theoretical foundation, strong generalization performance, and no assumptions on data distribution, the kNN classification algorithm has been widely used in many fields [2–6]. It is selected as one of the top 10 classic algorithms in data mining .
With the rapid development of sensing and Internet technology, data from all walks of life is increasing by orders of magnitude; big data becomes the focus of government, academia, and industry; and the research results of data analysis and mining have been widely used in the Internet of Things, healthcare, e-commerce, finance, and so on. However, kNN needs to compute the similarity between the aim instances and all the training instances so that its execution efficiency faces a great challenge in the big data environment. An increasing number of acceleration algorithms are proposed to improve the efficiency of kNN classification algorithms to process the large-scale data [8–10]. The existing accelerating algorithms for kNN classification can usually be divided into two categories from the perspective of data preprocessing: kNN classification based on data partition (DP-kNN) algorithm and kNN classification based on instance selection (IS-kNN) algorithm [11, 12].
The basic ideology DP-kNN algorithm divides the training set into several subsets by feature space partition, then classifies the test instances using some of the divided subsets. Specifically, the feature space of the training set is divided into several subregions, then determines which divided subregions the test instance belongs to, and finally finds -nearest neighbors in the subset of instances corresponding to that region. These algorithms mainly take advantage of the local learning characteristics of kNN classification algorithms: the label of the test instance in the prediction process is only related to the most similar instances in the training set. Therefore, it tries to ensure that -nearest neighbors of each instance in its divided subset are consistent with the ones in the original dataset. However, most of the existing data partition algorithms scarcely analyze this consistency from a theoretical point of view, so they are difficult to guarantee that the algorithm has high generalization performance.
Different from the DP-kNN algorithm, the IS-kNN algorithm does not use all the training examples. At the same time, it finds -nearest neighbors of the test instance from a representative subset of the training set, where the subset is obtained by using the instance selection algorithm. Because the size of the representative subset is smaller than the original training set, it can greatly improve the efficiency of finding neighbors for the test instance. Instance selection is an important data preprocessing method; it removes noisy instances and those instances far away from the classification decision plane from the training set according to the similarity and label differences of the training instances. Since there are more instances far from the classification decision plane in most datasets than those close to the classification decision plane, an instance selection algorithm can greatly reduce the size of the training set and keep the classification accuracy relatively unchanged. However, the time complexity of most existing instance selection algorithms is the square of the training set size, which makes it difficult to effectively process large-scale data. Furthermore, it only uses the information of the part data rather than all the data, so its generalization performance could be negatively affected.
For the problem of the lack of consistency analysis about nearest neighbors under data partition, this paper analyzes its classification performance theoretically from the perspective of optimization. The contribution of this paper is as follows: (1)Theoretically analyzing the effect of data partition on the classification performance of the kNN classification algorithm and giving the difference measurement between -nearest neighbors obtained with data partition and without data partition(2)Obtaining the fact that minimizing the similarity of the instances in different divided subsets can largely reduce the effect of data partition on the classification based on the theoretical analysis(3)Adopting the minibatch -Means clustering algorithm to execute data partition, because it divides the dataset into several subsets with a large difference in similarity(4)Searching nearest neighbors from the union of several candidate divided subsets for the test instance, where the candidate divided subsets are selected by the similarity between the test instance and cluster centers(5)Compared with the two existing typical algorithms, the experimental results on the public dataset show that the proposed algorithm could largely hold same nearest neighbors and similar classification accuracy of the original kNN classification algorithm
The rest of this paper is organized as follows. Section 2 reviews related methods about kNN classification acceleration algorithm. Section 3 analyzes the effect of data partition on the classification performance of the kNN classification algorithm and proposes a novel algorithm, called the large-scale kNN classification algorithm based on neighbor relationship preservation (NPR-kNN algorithm). Section 4 reports the experimental results through the comparison with existing methods. Section 5 gives the conclusion of this paper and the future work.
2. Related Work
The existing acceleration algorithm for -nearest neighbor classification from the perspective of data preprocessing can be categorized into the acceleration algorithms based on data partition (DP-kNN algorithm) and the acceleration algorithms based on instance selection (IS-kNN algorithm).
The DP-kNN algorithm is mainly divided into three steps: the feature space of the current training instance is firstly divided into several subregions; then, the divided region where the test instances stay is determined, and finally, -nearest neighbors are found from the subset of instances within this region. Because the kNN classification algorithm is a local learning algorithm, it is necessary to ensure that the neighboring sequences of the instance before and after data partition are consistent when dividing the training set. Most of the existing data partitioning algorithms for kNN classification algorithms are based on the binary tree structure; the current data is recursively divided into two subsets of similar capacity until the termination condition is met starting from the original set. Friedman et al.  firstly have proposed the concept of the KD tree, which uses the attributes of the data to recursively divide the -dimensional feature space into several subregions and treat the data falling in each region as a subset. However, in the face of high-dimensional complex data, there will be a phenomenon that some attributes with a large amount of information are not used in the process of building a tree. To solve this problem, Verma et al.  have proposed a KD tree that maximizes variance (MKD-tree) algorithm, which selects the attribute with the largest variance of the attribute value on the current data as the node for division. The MKD-tree algorithm uses only a certain attribute each time the current data is divided, which will cause partial information loss. For this reason, a binary tree algorithm based on principal component analysis is proposed , which divides the current data based on the score of the first principal components and the corresponding median value. In addition, there also exist some data partition algorithms based on the structure of the nearest neighbor graph and hash approximation [16–19]. However, most of the existing data partitioning algorithms do not theoretically study the effect of data partitioning on the kNN classification algorithm.
The IS-kNN algorithm mainly searches -nearest neighbors of the test instance in a representative subset of the training set with a relatively small size. The representative subset is obtained by various instance selection algorithms . Hart  has proposed a compressed nearest neighbor based on 1NN (CNN) algorithm, which obtains a subset of the training set so that the instances in the set are correctly classified by . That is, the instances in have the same labels as their neighbors in . The CNN algorithm first randomly selects an instance from the training set into the set . Then each time select an instance from and determine whether it is the same as the label of its neighbor in the set : if it is consistent, put it in , and repeat the above process until the set is empty. Although the CNN algorithm can obtain a relatively small subset , this algorithm is very sensitive to the order of reading data, and the time complexity is the square of the number of instances in the training set. To overcome this difficulty, Angiulli  has proposed a fast compressed nearest neighbor (FCNN) algorithm. The FCNN algorithm first selects the instances closest to each center. It puts them into the set and then iteratively selects representative instances from the set of instances in that are not correctly classified by the set and puts them into the set and repeats the selection process until can correctly classify all instances in . The FCNN algorithm is independent of the order of reading data, but its time complexity is , where is the size of the set . To improve the efficiency of the FCNN algorithm in processing large-scale data,  has proposed an FCNN algorithm based on parallel distributed computing. Although the proposed algorithm can achieve the purpose of greatly reducing the data size, it does not consider the impact of noise instances. To solve this problem, many editing algorithms have been proposed; its main idea is to remove instances that are inconsistent with their nearest neighbor labels. CNN series of algorithms and editing algorithms have achieved the goal of greatly reducing the size of the training set while the training error remains relatively unchanged. However, none of these algorithms consider the local sparsity of training examples in the feature space; it takes a negative effect on the classification performance of the kNN algorithm. For this reason, Nikolaidis et al.  have proposed a kind of boundary preservation algorithm, which first uses an editing algorithm to remove the noise instances in the training set and then uses the geometric characteristics of the potential distribution of the training instances in the feature space to divide the training set into border instances and interior instances. Finally, the representative instances from these two kinds of instances are selected and merged into the final instance selection subset. Furthermore, there are lots of improved kNN classification algorithms based on graphics and search algorithms . However, most kNN classification algorithms based on instance selection need to calculate the similarity between all instances, which makes it difficult to process large-scale data .
3. Main Content
3.1. Related Concepts
Let be the labeled training set of instances from different classes, where each instance is expressed by -dimensional feature vector and is its th feature value, is the label of the instance , are the number of features and instances,
KNN classification algorithm is learned by comparing the similarity between the unlabeled instance and all the training instances. When given a test instance , kNN first calculates the similarity with each instance in the training set and then sorts all the training instances according to the order of the similarity and takes the first instance as nearest neighbors of . Finally, the class of the neighbors with the largest number of instances will be determined to be the label of the instance .
The basic idea of the existing kNN classification acceleration algorithm based on data partition is quite similar. After dividing the training set into several subsets of approximately equal size, it determines which of the divided subsets the test instance is most similar to and finds its neighbors in this subset. The kNN algorithm is also a local learning algorithm, and the predicted label of the test instance is only related to the label of its nearest neighbors in the training set. To obtain a similar classification performance with the one using all the training instances, the DP-kNN algorithm needs to guarantee the -nearest neighbors in the divided subset of the test instance to be consistent with the original training set as much as possible. Specifically, it is ensured that the test instance and its -nearest neighbors in the training set are still in the same divided subset. Therefore, the data partition should be carefully studied.
The test instance has randomness and is unknown before executing prediction, and its location in the feature space is difficult to be decided. This difficulty takes the trouble to perform a good data partition. Fortunately, empirical risk minimization takes an effective way to solve this problem in statistical learning theory. Minimizing the empirical risk can obtain the optimal solution, where is the loss function between the true label and the predicted label predicted by -nearest neighbors of the instance . In this way, it should ensure that each training instance and its -nearest neighbors in the training set are still in the same divided subset. To this end, we analyze the effect of data partitioning on the neighbor relationship from the perspective of optimization.
3.2. Transformation of the Problem
For a given training set , kNN classification algorithm finds nearest neighbors for each instance in transformed into solving the following optimization problem: where is a boolean matrix, when instance is one of nearest neighbors of the instance , and when instance is not one of nearest neighbors of the instance ; is the similarity matrix and each element is the similarity between the instance and the instance ; and is the trace of the matrix which is the product of the matrix and the matrix , . The optimization problem (1) has only one optimal solution under the assumption that there exist different similarities for different instances. Let be the optimal solution to the optimization problem (1).
Suppose the training set is divided into disjoint subsets , where . The KNN classification algorithm based on data partition is aimed at finding -nearest neighbors of each instance within its divided subset. For each divided instance subset , each element searching its -nearest neighbors in can be transformed into solving the following optimization problem: where , is a boolean matrix, if and only if the instance is one of -nearest neighbors of the instance , otherwise ; the matrix is a submatrix of with row and column indexes , , and is the size of the set . Let be the optimal solution for solving the optimization problem (2) for the instance subset , where .
3.3. The Estimation of the Effect of Data Partition
Let be the index of the divided subset which the instance belongs to, e.g., when , where , . In fact, the nearest neighbor algorithm based on data partitioning approximately decomposes the optimization problem (1) into suboptimization problems (2) and independently solves this separate suboptimization problem. Combine the optimal solutions of these subproblems (2) into a new matrix , each element is , where binary function if and only if ; otherwise, . The matrix is an approximation of the optimal solution matrix of the optimization problem (1). In order to ensure the performance of the algorithm, the difference between and should be minimized. In order to measure the difference between the two, we introduce the following lemmas and theorems.
Lemma 1. is the optimal solution to the following problem: where the matrix .
Proof. According to the definition , each element satisfies the following rule: Moreover, the result can be seen from the calculation properties of the block matrix Therefore, the original optimization problem (3) can be decomposed into suboptimization problems (2), i.e., On the other hand, all suboptimization problems are independent and the matrix is the optimal solution of the suboptimization problem (2), so the matrix is the optimal solution of the problem .
Theorem 2. For the given training set and its partition index set , then where , .
Proof. Let . According to the definition of the matrix and , we have Combining equations (8) and (9) and the result according to Lemma 1, we have
It is often assumed that each training instance and other instances in the training set have different similarity values in the kNN classification task. This assumption ensures that each training instance has fixed nearest neighbors without considering the order of reading data, and the optimal problem (1) has a unique solution. Combined with the above theorem, reducing the difference between the objective function and can help reduce the difference between the approximate solution and the optimal solution . Therefore, we need to minimize the estimated difference , i.e., the similarity between instances that are not in the same partitioned subset should be promoted to decrease as far as possible.
To achieve this aim, the minibatch -Means clustering (MKC) algorithm is adopted to perform data partition for efficiently and effectively dealing with large data [25–27]. The MKC algorithm is one kind of the two-step -Means clustering algorithm; it first performs -Means algorithms on the randomly sampled instances from the original data to obtain the cluster center; then, the rest of the instances decide which cluster they belong to according to the similarity to the cluster centers. Meanwhile, the MKC algorithm is efficient because its time complexity is , where is the size of the sampled subset. An additional advantage of this algorithm is that the maximum number of clusters often does not exceed and the size of divided subsets has the uniformity effect, which provides us with an important reference basis for determining the number of divided subsets .
3.4. NPR-kNN Algorithm
Suppose the training set is divided into disjoint clusters using the MKC algorithm, and each cluster is to be a subset after division. It is an important step to decide nearest neighbors of the given test instance after the training set partition. For the given test instance , the traditional way decides which divided subsets the instance belongs to according to the similarity between the instance and the cluster centers, then finds nearest neighbors within this aim subset. However, this way could not be effective for those instances which are far from the aim cluster center because it is difficult to guarantee that these instances and their neighbors in the training set are still in the same cluster. These instances and their nearest neighbors are very likely in several adjacent clusters because they have higher similarity in the small local region of the feature space than other instances. Therefore, the method of cluster fusion is used to solve this problem. The aim cluster where the test instance search its nearest neighbors extends to be the union of candidate clusters, where the cluster centers of these candidate clusters are the first most similar with the instance among all the clusters, and is an integer greater than 1. In this way, it largely increases the possibility that the test instance can find the same nearest neighbors as the original training set. The fixed value of the parameter for different datasets is not desirable for the large difference in the sparseness of data distribution. We adopt the early stopping rule to adaptively determine the value of for different test instances. It successively merges candidate clusters from to until nearest neighbors in the merged set of the test instance does not change. The following algorithm shows the detailed procedure of the NPR-kNN algorithm.
3.5. Complexity Analysis of the Proposed Algorithm
Besides the classification performance, execution efficiency is another important evaluation. The NPR-kNN algorithm includes data partition stage and prediction stage. The minibatch -Means cluster algorithm is adopted to perform data partition in the first stage, and it is designed for dealing with big data and gets several times more efficient than the traditional -Means cluster algorithm . The test instance searches its -nearest neighbors using the aim divided subset rather than all the training instances in the prediction stage. The aim divided subset is obtained by only computing the similarity between the cluster centers and the test instance. Moreover, the NPR-kNN algorithm has the additional advantage of allowing distributed storage of large-scale training data. The training data is divided into several disjoint subsets because there is no intersection among these divided subsets. Therefore, the proposed algorithm can effectively deal with large-scale data.
To test the proposed algorithm, an extensive experiment comparison has been carried out on the real datasets with two representative kNN classification acceleration algorithms based on data partition.
4.1. Experiment Setup
Two representative algorithms are selected in this paper: kNN classification algorithm based on KD-tree (MKD-kNN) and kNN classification algorithm based on PCA tree (PCA-kNN) . Meanwhile, ten large-scale public datasets are chosen to make a fair comparison with other algorithms to verify the effectiveness of the proposed algorithm [30, 31], where the scale of each dataset is greater than 90000. Information of the ten selected datasets is shown in Table 1.
The NPR-kNN algorithm, MKD-kNN algorithm, and PCA-kNN algorithms are all approximations of the kNN classification algorithm. To evaluate the degree of consistency of -nearest neighbors of each training instance before and after the training set is divided, the training matching ratio , where is the number of instances whose -nearest neighbors in the divided subsets are the same as the training set and is the size of the training set. The larger the value of , the stronger the locality of the data maintained by the algorithm; otherwise, the weaker the locality of the data maintained by the algorithm. The test accuracy is an important index to evaluate the performance of the classifier; it mainly characterizes whether the label of the test instance is consistent with the predicted label. However, it does not reflect whether the nearest neighbor sequence of the test instance obtained by the approximate nearest neighbor algorithm is consistent with that obtained by the original kNN algorithm. To this end, we also calculated the test matching ratio , where is the number of test instances whose -nearest neighbors obtained by the approximate nearest neighbor algorithm are the same as the ones obtained by the original kNN algorithm, and is the number of all test instances. A tenfold cross-validation method is used to estimate three performance index values on different datasets. In addition, the signed-rank test [32, 33] is adopted to test whether there is a significant difference in performance between the NPR-kNN algorithm and other algorithms.
In the following experiments, all attribute values of the used datasets are normalized to the interval to avoid the influence of dimensions between different attributes. The Euclidean distance is used to measure the similarity between instances. The performance of the approximate nearest neighbor algorithm based on the data partition is affected by the size of the subset, so they need to be compared under different numbers of divided subsets. We choose these four different values as the threshold of the divided subset size according to the suggestion of the paper . It determines the number of divided subsets using the formula , where is the minimum positive integer large than . Moreover, is chosen based on the experiment result of the paper . The significant level is .
4.2. Experiment Analysis
The following experiment analysis is made from the three indicators of training matching ratio, test matching ratio, and test accuracy.
4.2.1. Training Matching Ratio
The training matching ratio is the measurement to evaluate the consistency of -nearest neighbors of all the training instances searched using data partition. To make a fair comparison with other algorithms, each training instance finds its -nearest neighbors in the single divided subset rather than the union of some divided subsets. Tables 2–5 list the training matching ratio of different algorithms under different , where two descriptive statistics and the value of the Wilcoxon signed-rank test between the NPR-kNN algorithm and one of the other algorithms are also listed in the last three lines of the tables.
The following experiment analysis is made from the value of , because it takes a great effect to the measurement . From the results of Tables 2–4 under three different smaller values of , the value of of the NPR-kNN algorithm is serval times than the MKD-kNN algorithm and PCA-kNN algorithm on most datasets except the Skin-noskin dataset. Besides the value of , the sparseness of the data distribution also takes an effect on the value . If most instances of the dataset are distributed densely in the input space, then data partition takes a small effect on the nearest neighbor relationship preservation, and the value could be large. The dataset Skin-noskin is relatively densely distributed and the divided subsets with hundreds of data so that the value of all three algorithms have larger than 0.9 under different values of on it. Meanwhile, the mean and median of the NPR-kNN algorithm on different datasets are close to or greater than 0.5 and have the largest value among the three algorithms under . Finally, the value of the Wilcoxon signed-rank test between the NPR-kNN algorithm and one of the other algorithms is smaller than the given significant level . Therefore, the NPR-kNN algorithm obtains the best result of the measurement compared with the MKD-kNN algorithm and PCA-kNN algorithm under the smaller .
For the result in Table 5 under the large value , three algorithms have the similarity value of on all the datasets except the Acoustic and Aloi datasets. And the mean of of these algorithms is 0.832, 0.825, and 0.825, and their median values are 0.854, 0.850, and 0.851. Moreover, these algorithms get larger values of under than the results under . The reason for this issue is that there are lots of elements in the divided subset under the large value of which largely increases the probability that each element and -nearest neighbors are still in the same divided subset. However, the value of the Wilcoxon signed-rank test between the NPR-kNN algorithm and one of the other algorithms is smaller than the given significant level . Therefore, there exists a significant difference between the NPR-kNN algorithm and one of the other algorithms, and the NPR-kNN algorithm also obtains the best result under . In conclusion, the experiment result shows that the NPR-kNN algorithm largely keeps the instances and their -nearest neighbors still in the same divided subsets, and it also verifies the correctness of Theorem 2.
4.2.2. Test Matching Ratio
Besides the training performance measured by the training matching ratio , we pay more attention to the test performance of the algorithm. Test matching ratio is adopted to measure the extent that the test instances and their -nearest neighbors are still in the same divided subset, and it also evaluates whether the improved algorithm using data partition can obtain a similar performance of the original -nearest neighbor classification algorithm. Tables 6–9 list the results of three algorithms under different values of , and the statistical results are also listed in the last three lines of the tables.
The results of Tables 6–9 show that the value of of the NPR-KNN algorithm is several times larger than that of other algorithms on all the datasets except the Skin-noskin dataset under different values of . The values of on the Skin-noskin dataset of three algorithms are larger than 0.9; this is because its elements are distributed densely in the input space, and each divided subset has hundreds of elements. Meanwhile, the mean and median of of the NPR-kNN algorithm on different datasets are close to or greater than 0.8 and have the largest value among three algorithms under . Finally, the value of the Wilcoxon signed-rank test between the NPR-kNN algorithm and one of the other algorithms is smaller than the given significant level . Therefore, the NPR-kNN algorithm obtains the best result of the measurement compared with the MKD-kNN algorithm and PCA-kNN algorithm. The NPR-kNN algorithm searches -nearest neighbors from the union of several divided subsets rather than only one divided subset. This increases the probability that the test instance and its -nearest neighbors are still in the candidate subset.
On the other hand, the value of the parameter has a different effect on the values of of these algorithms. The NPR-KNN algorithm has the value of with a small change on each data under different values of parameter , while other algorithms have a large difference in the value of . This fact is that these algorithms use different numbers of divided subsets to find -nearest neighbors, and the parameter controls the number of instances in the divided subset. Each test instance finds its -nearest neighbors using only one divided subset for both MKD-kNN algorithm and PCA-kNN algorithm, and then, this takes a great effect on the value of . -nearest neighbors are continually updated by successively merging the divided subsets until they do not change rather than only one divided subset, and this operation greatly reduces the effect of the parameter to the NPR-KNN algorithm. Therefore, the performance of the NPR-KNN algorithm on is not sensitive to the value of , and this advantage increases its availability for dealing with practical problems.
4.2.3. Test Performance
The generation ability can be measured by the classification accuracy on the test data, which is the most commonly used performance indicator. Tables 10–13 list the results of three algorithms under different values of the parameter .
The results on Tables 10–13 also show that the classification accuracy of the NPR-KNN algorithm is not less than that of the MKD-kNN algorithm and PCA-kNN algorithm on all the datasets under different values of , and it obtains the similar classification accuracy as the kNN classification algorithm. Moreover, the NPR-KNN algorithm has much better test performance than the MKD-kNN algorithm and PCA-kNN algorithm on the multiclassification dataset Aloi with 1000 classes. For the mean and median of classification accuracy on all the datasets, the NPR-KNN algorithm has a better performance than other improved algorithms and obtains similar results. Finally, the values of the Wilcoxon signed-rank test between the NPR-kNN algorithm and one of the other improved algorithms are smaller than the given significant level 0.05 under different values of . Then, the NPR-kNN algorithm achieves better classification than them. There exists no significant difference between NPR-kNN algorithm and kNN classification algorithm because the values between them are larger than the given significant level 0.05. The reason for this fact is that the NPR-kNN algorithm obtains -nearest neighbors that are most likely to be the same as the original algorithm compared with other algorithms, and this conclusion is also verified in the above subsection experience results.
We have proposed a novel algorithm to explore the effect of data partition on the classification performance of kNN classification algorithm, which could largely keep same nearest neighbors as the original algorithm. Different from previous improved kNN classification algorithms based on data partition, the proposed algorithm theoretically studies the effect of data partition from the perspective of optimization, and it proves that the similarity of instances within the different partitioned subsets to be smaller is the key factor for the generation ability of the classifier. To this end, the minibatch -Means clustering algorithm is adopted to execute the data partition for its high efficiency and effectiveness, and an early stopping rule is designed to search -nearest neighbors from the divided subsets. Moreover, it can effectively deal with large-scale data for its linear time complexity. Experiment results on multiple real datasets show that the proposed algorithm gets the similar -nearest neighbors and classification performance with the original kNN classification algorithm and better results than two state-of-the-art algorithms. The method in this paper takes a paradigm to handle large-scale data, and it also offers a promising way to scalable algorithms based on data partition. In future work, we will study how to combine the result of multiple data partitions to improve the performance of the kNN classification algorithm.
All the used datasets can be downloaded from the LIBSVM dataset (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/) and UCI machine learning repository (https://archive.ics.uci.edu/ml/index.php).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was also supported by the project ZR2020MF146 supported by Shandong Provincial Natural Science Foundation, China, Major Scientific and Technological Innovation Project of Shandong Province (No. 2019JZZY010716), and Open Project Foundation of Intelligent Information Processing Key Laboratory of Shanxi Province (No. CICIP2021002).
M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, MIT press, 2018.
A.-J. Gallego, J. Calvo-Zaragoza, J. J. Valero-Mas, and J. R. Rico-Juan, “Clustering-based _k_ -nearest neighbor classification for large-scale data with neural codes representation,” Pattern Recognition, vol. 74, pp. 531–543, 2018.View at: Publisher Site | Google Scholar
J. Gou, W. Qiu, Z. Yi, X. Shen, Y. Zhan, and W. Ou, “Locality constrained representation-based k-nearest neighbor classification,” Knowledge-Based Systems, vol. 167, pp. 38–52, 2019.View at: Publisher Site | Google Scholar
E. Prasetyo, R. Purbaningtyas, and R. Adityo, “Cosine K-nearest neighbor in milkfish eye classification,” International Journal of Intelligent Engineering and Systems, vol. 13, no. 3, pp. 11–25, 2020.View at: Publisher Site | Google Scholar
B. Tang, H. He, and S. Zhang, “Mcenn: a variant of extended nearest neighbor method for pattern recognition,” Pattern Recognition Letters, vol. 133, pp. 116–122, 2020.View at: Publisher Site | Google Scholar
D. Jiangyi and F. Bian, “A privacy-preserving and efficient k-nearest neighbor query and classification scheme based on k-dimensional tree for outsourced data,” IEEE Access, vol. 8, pp. 69333–69345, 2020.View at: Publisher Site | Google Scholar
X. Wu, V. Kumar, J. Ross Quinlan et al., “Top 10 algorithms in data mining,” Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37, 2008.View at: Publisher Site | Google Scholar
E. Marchiori, “Class conditional nearest neighbor for large margin instance selection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 2, pp. 364–370, 2010.View at: Publisher Site | Google Scholar
Y. Li and L. Maguire, “Selecting critical patterns based on local geometrical and statistical information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 6, pp. 1189–1201, 2011.View at: Publisher Site | Google Scholar
M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 11, pp. 2227–2240, 2014.View at: Publisher Site | Google Scholar
T. Liu, A. W. Moore, A. G. Gray, and K. Yang, “An investigation of practical approximate nearest neighbor algorithms,” in Advances in Neural Information Processing Systems, pp. 825–832, MIT Press, Vancouver, Canada, 2004.View at: Google Scholar
S. Garcia, J. Derrac, J. Cano, and F. Herrera, “Prototype selection for nearest neighbor classification: taxonomy and empirical study,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 417–435, 2012.View at: Publisher Site | Google Scholar
J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An algorithm for finding best matches in logarithmic expected time,” Association for Computing Machinery transactions on mathematical software, vol. 3, no. 3, pp. 209–226, 1977.View at: Publisher Site | Google Scholar
N. Verma, S. Kpotufe, and S. Dasgupta, “Which Spatial Partition Trees Are Adaptive to Intrinsic Dimension?” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 565–574, Montreal, Canada, 2009.View at: Google Scholar
B. McFee and G. R. G. Lanckriet, “Large-scale music similarity search with spatial trees,” in Proceedings of the 12th International Society for Music Information Retrieval Conference, pp. 55–60, Miami, USA, 2011.View at: Google Scholar
M. Slaney and M. Casey, “Locality-sensitive hashing for finding nearest neighbors [lecture notes],” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 128–131, 2008.View at: Publisher Site | Google Scholar
C. García-Osorio, A. de Haro-García, and N. García-Pedrajas, “Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts,” Artificial Intelligence, vol. 174, no. 5-6, pp. 410–441, 2010.View at: Publisher Site | Google Scholar
A. de Haro-García, N. García-Pedrajas, and J. A. R. del Castillo, “Large scale instance selection by means of _federal_ instance selection,” Data & Knowledge Engineering, vol. 75, pp. 58–77, 2012.View at: Publisher Site | Google Scholar
C.-J. Hsieh, S. Si, and I. Dhillon, “A divide-and-conquer solver for kernel support vector machines,” in Proceedings of the 31st International Conference on Machine Learning, pp. 566–574, Beijing, China, 2014.View at: Google Scholar
P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE Transactions on Information Theory, vol. 14, no. 3, pp. 515-516, 1968.View at: Publisher Site | Google Scholar
F. Angiulli, “Fast nearest neighbor condensation for large data sets classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 11, pp. 1450–1464, 2007.View at: Publisher Site | Google Scholar
F. Angiulli and G. Folino, “Distributed nearest neighbor-based condensation of very large data sets,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 12, pp. 1593–1606, 2007.View at: Publisher Site | Google Scholar
K. Nikolaidis, J. Y. Goulermas, and Q. H. Wu, “A class boundary preserving algorithm for data condensation,” Pattern Recognition, vol. 44, no. 3, pp. 704–715, 2011.View at: Publisher Site | Google Scholar
H. Zhang and G. Sun, “Optimal reference subset selection for nearest neighbor classification by tabu search,” Pattern Recognition, vol. 35, no. 7, pp. 1481–1490, 2002.View at: Publisher Site | Google Scholar
D. Sculley, “Web-scale k-means clustering,” in Proceedings of the 19th international conference on world wide web, pp. 1177-1178, Raleigh, USA, 2010.View at: Google Scholar
A. Hidayat, D. Jamaluddin, and D. S. Maylawati, “Data analytics for effectiveness evaluation of islamic higher educationusing k-means algorithm,” International Journal of Advanced Science and Technology, vol. 29, no. 3, pp. 4149–4161, 2020.View at: Google Scholar
T. Li, Y. Ma, and T. Endoh, “Normalization-based validity index of adaptive k-means clustering for multi-solution application,” Ieee Access, vol. 8, pp. 9403–9419, 2020.View at: Publisher Site | Google Scholar
R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society, vol. 63, no. 2, pp. 411–423, 2001.View at: Publisher Site | Google Scholar
S. C. Hicks, R. Liu, Y. Ni, E. Purdom, and D. Risso, “Mbkmeans: fast clustering for single cell data using mini-batch k-means,” PLoS Computational Biology, vol. 17, no. 1, pp. 1–18, 2021.View at: Publisher Site | Google Scholar
C.-C. Chang and C.-J. Lin, “LIBSVM,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1–27, 2011.View at: Publisher Site | Google Scholar
K. Bache and M. Lichman, “Uci machine learning repository,” 2021, http://archive.ics.uci.edu/ml/index.php.View at: Google Scholar
F. Wilcoxon, “Individual comparisons by ranking methods,” in Breakthroughs in Statistics, pp. 196–202, Springer, 1992.View at: Google Scholar
J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.View at: Google Scholar
M. Kordos, M. Blachnik, and D. Strzempa, “Do we need whatever more than k-nn?” in International Conference on Artificial Intelligence and Soft Computing, pp. 414–421, Heidelberg, Germany, 2010.View at: Google Scholar