Abstract

Probability density function (p.d.f.) estimation plays a very important role in the field of data mining. Kernel density estimator (KDE) is the mostly used technology to estimate the unknown p.d.f. for the given dataset. The existing KDEs are usually inefficient when handling the p.d.f. estimation problem for stream data because a bran-new KDE has to be retrained based on the combination of current data and newly coming data. This process increases the training time and wastes the computation resource. This article proposes an incremental kernel density estimator (I-KDE) which deals with the p.d.f. estimation problem in the way of data stream computation. The I-KDE updates the current KDE dynamically and gradually with the newly coming data rather than retraining the bran-new KDE with the combination of current data and newly coming data. The theoretical analysis proves the convergence of the I-KDE only if the estimated p.d.f. of newly coming data is convergent to its true p.d.f. In order to guarantee the convergence of the I-KDE, a new multivariate fixed-point iteration algorithm based on the unbiased cross validation (UCV) method is developed to determine the optimal bandwidth of the KDE. The experimental results on 10 univariate and 4 multivariate probability distributions demonstrate the feasibility and effectiveness of the I-KDE.

1. Introduction

Probability density function (p.d.f.) estimation [1] uses a nonparametric way to determine the p.d.f. of a random variable (r.v.) based on the given dataset. It plays a very important role in the fields of data mining because many machine learning tasks are related to the p.d.f. estimation, for example, Bayesian classification [2], density-based clustering [3], feature selection [4], time series analysis [5], and image processing [6]. The mostly used p.d.f. estimator is the Parzen window estimator [7] which is also termed kernel density estimator (KDE). The KDE uses the superposition of multiple kernels (e.g., triangular kernel, Epanechnikov kernel, biweight kernel, triweight kernel, cosine kernel, and Gaussian kernel) to fit the unknown p.d.f. of given dataset. How to select an appropriate window bandwidth or kernel size is the core of training an effective KDE: the large bandwidth will lead to the oversmoothed estimation, while the small bandwidth will result in the undersmoothed estimation. Until now, the studies regarding how to construct the KDEs mainly focus on the selection of optimal bandwidths.

In order to select the optimal bandwidth, an effective error criterion should be deliberately designed [8]. The mean integrated square error (MISE) is a typical error criterion which measures the expected value of estimated error between the estimated p.d.f. and true p.d.f. Because the error criterion includes an unknown term, i.e., the true p.d.f., the bandwidth selection methods have to use the different approximation strategies to replace it and then determine the optimal bandwidths for the specific applications. There are many bandwidth selection methods which have been developed based on MISE criterion. The representative works are summarized as follows. The rule of thumb (RoT) [9] is the simplest method to determine a quick normal scale bandwidth by assuming the data obeying a normal distribution. However, when the data are not close to normal, RoT tends to oversmooth and masks the important features of data [8]. The bootstrap method [10] used the p.d.f. estimated with the resampled data to replace the true p.d.f. and then minimize the bootstrap criterion function to determine the bandwidth for p.d.f. estimation. The unbiased cross validation (UCV) method [11] which is also termed least squares cross validation (LSCV) used the leave-one-out strategy to estimate the true p.d.f. of error criterion. The expected value of UCV is equal to the difference between the expected value of integrated square error (ISE) and a constant related to the true p.d.f. The biased cross validation (BCV) method [12] derived a smoothed objective function to optimize the bandwidth based on an asymptotic MISE. The theoretical analysis showed that BCV had a good convergence rate of optimal bandwidth. A solve-the-equation approach [13] for univariate p.d.f. estimation was studied based on the plug-in strategy to approximate the true p.d.f. in the asymptotic MISE. Correspondingly, an iterative algorithm was designed to find the optimal bandwidth. The experimental results demonstrate that the aforementioned MISE-based KDEs and their variants [1416] can obtain the good performances when handling the p.d.f. estimation problems in a stationary way.

As far as we know, there is no such KDE which is specially designed to estimate the p.d.f. for the stream data, i.e., estimate the p.d.f. in a nonstationary or incremental way. The stream data [17] can be regarded as a dynamic dataset of which the amount gradually increases with the continuous collection of new data. When the data are obtained by the learning system in a batch way, the existing KDEs have to reestablish the bran-new p.d.f.s based on the combination of current data and newly coming data. This training mode severely wastes the computation resource because the information regarding p.d.f. estimated with the current data is fully discarded. In addition, it can significantly increase the computation time if the p.d.f. is estimated by putting the current data and newly coming data together. The existing KDEs cannot work well if the total amount of data with the progression of data batches is beyond the memory capacity of computer. The incremental learning [18] provides an efficient paradigm to deal with the stream data, especially in the big data age, and it also gives a flexible and feasible way to analyze the large-scale data.

This article proposes an incremental kernel density estimator (I-KDE) for stream data mining. The I-KDE updates the current KDE dynamically and gradually with the newly coming data rather than retraining the bran-new KDE with the combination of current data and newly coming data. The I-KDE deals with the p.d.f. estimation problem in the way of data stream computation. The theoretical analysis proves the convergence of the I-KDE only if the estimated p.d.f. of newly coming data is convergent to its true p.d.f. In order to guarantee the convergence of the I-KDE, a new multivariate fixed-point iteration algorithm based on the unbiased cross validation (UCV) [11] is developed to determine the optimal bandwidth of the KDE. The experimental results on 10 univariate and 4 multivariate probability distributions demonstrate the feasibility and effectiveness of the I-KDE to estimate the p.d.f. of stream data. The remainder of this article is organized as follows. In Section 2, we introduce our problem formulation. The I-KDE is presented in Section 3. In Section 4, we report the experimental results to demonstrate the effectiveness of the I-KDE. Finally, we give our conclusions and suggestions for the future research in Section 5.

2. Problem Formulation

Assume there are datasets with -dimensional samples, respectively, where

Letrepresent the combination of datasets . For any there exists and such that . If we want to estimate the underlying p.d.f. for dataset , the mathematical model of classical KDE [7] is described aswhereis the standard -variate Gaussian kernel function andis the optimal window bandwidth which can be determined with the different bandwidth selection strategies, e.g., the bootstrap method [10], unbiased cross validation (UCV) method [11], biased cross validation (BCV) method [12], and the solve-the-equation method [13].

In fact, the estimation paradigm as shown in equation (3) has the following three limitations. First, equation (3) cannot retain the information of current p.d.f. when the data arrive at the learning system sequentially. Second, the computation times are obviously different even for the newly coming data with the same size. The earlier the data arrive, the less training time is required. Third, equation (3) does not work well when the data amount is beyond the memory capacity of the computer. The main reason for these limitations is that the existing KDE must put all the data together to estimate a bran-new p.d.f. Our solution mainly focuses on how to retain the current p.d.f. and uses the newly coming data to update the current p.d.f. The research problem of this article is depicted as follows.

Problem Statement. Assume the current p.d.f. estimated by the current learning system based on data is , where the positive integer is the number of data batches. Our objective is to use the p.d.f. estimated with newly coming data to update and then obtain the estimated p.d.f. for data , i.e.,where the mathematical symbol represents the incremental updating to previous result.

3. I-KDE: An Incremental Kernel Density Estimator

This section presents an incremental KDE (I-KDE) for the stream data mining. Assume the learning system has received batches of data and the newly coming dataset is

The estimated p.d.f.s , , and with data , , and arerespectively, whereis the optimal bandwidth of p.d.f. estimated with the newly coming data .

In equation (10), we can find that it consumes less time to determine the bandwidth than to determine and when and . Thus, the I-KDE only considers how to determine the optimal bandwidth for the newly coming data and gives up to calculate the bandwidths for both the current data and the combination of current data and newly coming data. Then, Eq. (9) can be represented as

According to equation (11), we can iteratively derive the following equations:

The learning system only receives the data ,holds for . Substituting equation (12) into equation (11) yields the mathematical model of the I-KDE aswhere is the optimal bandwidth of p.d.f. estimated with data . Equation (14) reveals that the KDE trained based on the union of different data blocks can be decomposed into the different KDEs which are trained based on the corresponding data blocks in an independent way. In equation (14), we can find that the I-KDE is an asymptotic integration model of a series of KDEs which are trained gradually based on the different data blocks. The I-KDE provides an effective way to deal with the p.d.f. estimation problem for stream data and meanwhile makes it possible to estimate the p.d.f. for large-scale data by partitioning it into different data blocks. Figure 1 depicts the procedure of the I-KDE.

We provide a brief analysis to the computation complexity of the I-KDE. The classical unbiased cross validation (UCV) method [11] is used in this article to determine the optimal bandwidth. Its computation complexity is , where is the number of samples and is the number of sample dimensions. Assume the learning system receives batches of data ; the time complexity of the I-KDE is

If we reestablish a bran-new p.d.f. for each batch of data by training the KDE based on the combination of current data and newly coming data, the time complexity of full retraining scheme iswhereis the time complexity of estimating the p.d.f. for data . By comparing equation (15) with equation (16), we can get that the time complexity of the I-KDE is far less than the time complexity of retraining scheme, i.e., . In addition, we can know that the I-DKE is able to deal with the p.d.f. estimation for large-scale data by comparing equation (15) with equation (17) because . Figure 2 presents the comparison of time complexity between the I-KDE and retraining scheme.

Now, we give the theoretical analysis to the convergence of I-KDE. First, a lemma regarding the consistency of probability distribution function is given.

Lemma 1. Assume the probability distribution functions of -dimensional random variables are . includes observations which obey the probability distribution function . If are mutually independent, thenwhere , and its probability distribution function is .

Proof. For any , the number of observations satisfyingisThus, the number of observations satisfyingisThen, the probability distribution function isThis concludes the proof.

Theorem 1 (convergence of I-KDE). Assume includes observations of -dimensional random variables of which the p.d.f. is and includes observations of -dimensional random variable of which the p.d.f. is . If are mutually independent and there exists such that for , thenwhere is the estimated p.d.f. with the KDE on data and are the estimated p.d.f.s with the I-KDE on data .

Proof. Based on Lemma 1, we haveThen,For the given holds. Thus, we can deriveThis completes the proof.
Note. When for , then . It means that can approximate well if can approximate well.
Theorem 1 demonstrates the convergence of the I-KDE when the estimated p.d.f. converges to the true p.d.f. . In order to obtain an accurate p.d.f. estimation for in equation (14), we use the multivariate fixed-point iteration to design the bandwidth optimization method based on UCV error criterion. The formulation [12] of UCV error criterion for -dimensional observations iswhereThe optimal bandwidth for the estimated p.d.f. satisfiesWe calculate the partial derivative of with respect to and letwhereIt is very difficult to calculate the analytic solution of , from equation (31). However, an iterative function with respect to can be derived by simplifying equation (31):Furthermore, a multivariate fixed-point iteration algorithm can be designed based on the aforementioned iterative function to determine the optimal bandwidth .

4. Experimental Results and Analysis

In this section, we conduct three experiments to demonstrate the feasibility and effectiveness of our proposed I-KDE based on 10 univariate and 4 multivariate probability distributions of which the details are listed in Table 1 [19]. First, we validate the convergence of Algorithm 1. Second, we demonstrate the convergence of the I-KDE. Third, we test the p.d.f. estimation performance of the I-KDE.

(1)Input: the given data ; the initial bandwidth ; the stopping threshold .
(2)Output: the optimal bandwidth .
(3)Repeat
(4)  For do
(5)   ;
(6)   ;
(7)  End For
(8)Until
4.1. Convergence of Algorithm 1

In this experiment, we check the convergence of Algorithm 1 on 2 univariate (beta and normal) and 2 multivariate (2-dimensional bimodal and quadrimodal normal) probability distributions. For each distribution, 1000 samples are randomly generated under Matlab programming environment. The experimental results are listed in Figure 3 and confirm that our designed fixed-point iteration algorithm can find the optimal bandwidths for univariate and multivariate UCV error criteria.

For the univariate probability distribution, e.g., the beta distribution in Figure 3(a), we can see that the UCV error first decreases and then increases with the increase of bandwidth parameter. For the any initial bandwidth value 0.001 or 1, our designed fixed-point iteration algorithm can find the optimal bandwidth 0.048 for the I-KDE. For the multivariate probability distribution, e.g., the 2-dimensional quadrimodal normal distribution in Figure 3(d), we know that its UCV error has the minimum value −0.020. Algorithm 1 determines the optimal bandwidth vector (0.451, 0.575). The learning curves corresponding to the bandwidth optimization demonstrate the convergence of Algorithm 1.

4.2. Convergence of I-KDE

The objective of this experiment is to check whether the estimation of I-KDE which is trained by means of data stream computation can converge to the estimation of KDE which is trained based on the combination of current data and newly coming data. The initial bandwidths in Algorithm 1 are determined based on the RoT method. The experimental results are presented in Figure 4.

For the univariate case, we use 10 different probability distributions as shown in Table 1. For each distribution, 5000 samples are randomly generated. The parameter settings of these p.d.f.s are summarized as follows: and for beta, for chi-squared, for exponential, and for F, and for gamma, and for lognormal, and for normal, for Rayleigh, for Student’s T, and and for Weibull. In Figure 4(a), we can see that the red and green curves almost coincide with the coming of new data. It is very hard to distinguish the red and green curves.

The similar situation exists in the multivariate case. For the multivariate case, 5 different probability distributions are selected. In Figure 4(b), the parameter settings of these p.d.f.s are summarized as follows: and for , and for , , , and for , , , , and for , , , , , , and for . We can see that the I-KDE and KDE obtain the similar probability distributions with the different estimation ways. The experimental results visually reflect that the I-KDE can converge to the KDE in an incremental manner.

In addition, we use the Jensen–Shannon (JS) divergence to measure the similarity between two different p.d.f.s estimated with the KDE and I-KDE, respectively. The JS divergence is a variant of Kullback–Leibler (KL) divergence [20]. The smaller the JS divergence is, the more similar the two p.d.f.s are. In Figure 4, we can see that the JS divergences are small and the quantitative measures also reflect that the differences between KDEs and I-KDEs are small.

We also test the convergence performance of the I-KDE on the testing data corresponding to beta ( and ) and 2-dimensional trimodal normal distributions as shown in Figure 5. For each distribution, 2000 training samples are randomly generated and partitioned into 20 data blocks. We check the ratio of testing samples with the smaller absolute error to all testing samples. In Figure 5, we can see that the win ratios (i.e., the proportions of data points with smaller estimation errors) of the I-KDE and KDE are almost same with the increase of data blocks. This indicates that the I-KDE has the equivalent estimation capability with the KDE. Figure 5 also provides the comparison of training time. We can find that the training time of the I-KDE is far less than the KDE.

Figure 6 shows the convergence tendency of the I-KDE on 2-dimensional bimodal normal distribution. 2000 training samples are randomly generated and partitioned into 20 data blocks. 20000 testing samples are selected with the incremental steps of (0.01, 0.01) in the space of [−9.9, 10] ∪ [0.1, 10]. In Figure 6, the black points are training samples and the red points are the testing samples estimated by the I-KDE with the smaller absolute error. We can see that the estimation performance of the I-KDE will gradually converge to a stable state with the increase of training samples.

4.3. Estimation Performance of I-KDE

Table 2 presents the comparison between the KDE and I-KDE on 10 univariate and 4 multivariate distributions, where the parameter settings of beta, chi-squared, exponential, F, gamma, lognormal, Rayleigh, Student’s T, Weibull, 2-dimensional bimodal normal, and 2-dimensional trimodal normal distributions are the same as Figures 4(a), 5, and 6. The p.d.f. of Bimodal normal is . The parameters of 2-dimensional standard normal distribution are and . The parameters of 2-dimensional quadrimodal normal distributions are . In addition, the initial bandwidths in Algorithm 1 are determined based on the RoT method. For each distribution, the mean absolute error (MAE) and number of win samples are the average values of 10 runs of different trainings. We give a statistical analysis to the experimental results in Table 2 and confirm that the I-KDE is statistically convergent to the KDE. Taking Exponential distribution for example, we use the Wilcoxon signed-ranks test [21] to check whether the difference between the I-KDE and KDE is significant. Table 3 lists the comparative results of the I-KDE and KDE on 10 different training datasets of exponential distribution.

Let be the sum of ranks for the training dataset on which the KDE outperforms the I-KDE and be the sum of ranks for the training datasets on which the I-KDE outperforms the KDE. We can calculateand

We construct the statisticswhich is distributed approximately normally. The null hypothesis that the KDE and I-KDE perform equally well can be rejected if is smaller than −1.96 under the confidence level of 0.05. Because , we accept the null hypothesis, i.e., the difference between the KDE and I-KDE is not significant. The similar result exists for the other distributions in Table 2. To sum up, we demonstrate that the I-KDE can obtain the equivalent p.d.f. estimation performance with the less training time in comparison to the KDE. The experimental results reveal that it is feasible to use I-KDE to deal with the p.d.f. estimation problem of large-scale data in the way of data stream computation.

5. Conclusions and Future Works

In this article, we proposed an incremental kernel density estimator (I-KDE) to deal with the probability density function (p.d.f.) estimation problem in the way of data stream computation. The I-KDE updated the current KDE dynamically and gradually with the newly coming data rather than retraining the bran-new KDE with the combination of current data and newly coming data. The theoretical analysis proved the convergence of the I-KDE only if the estimated p.d.f. of newly coming data is convergent to its true p.d.f. The experimental results on 10 univariate and 4 multivariate probability distributions demonstrated that the I-KDE obtained the equivalent p.d.f. estimation performance with the less training time in comparison to the KDE and thus indicated that the I-KDE can be used to deal with the p.d.f. estimation problem of large-scale data in the way of data stream computation. In future, we will try to combine the I-KDE with the random sample partition (RSP) model [22, 23] of big data and seek the practical applications for the I-KDE, e.g., Bayesian classification, density-based clustering, and big data reduction [24].

Data Availability

The data used in our manuscript can be accessed by readers via our BaiduPan at https://pan.baidu.com/s/1wgj-gTzEZL51WTtl2RpCzA with the extraction code “kai5”.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank Dr. Salman Salloum ([email protected]) for his language quality improvement and Dr. Xiaoliang Zhang ([email protected]) for his mathematical derivations. This study was supported by the National Key R&D Program of China (2017YFC0822604-2), Scientific Research Foundation of Shenzhen University for Newly Introduced Teachers (2018060), and National Training Program of Innovation and Entrepreneurship for Undergraduates (201910590017).