Abstract

In order to improve utilization rate of high dimensional data features, an ensemble learning method based on feature selection for entity resolution is developed. Entity resolution is regarded as a binary classification problem, an optimization model is designed to maximize each classifier’s classification accuracy and dissimilarity between classifiers and minimize cardinality of features. A modified multiobjective ant colony optimization algorithm is employed to solve the model for each base classifier, two pheromone matrices are set up, weighted product method is applied to aggregate values of two pheromone matrices, and feature’s Fisher discriminant rate of records’ similarity vector is calculated as heuristic information. A solution which is called complementary subset is selected from Pareto archive according to the descending order of three objectives to train the given base classifier. After training all base classifiers, their classification outputs are aggregated by max-wins voting method to obtain the ensemble classifiers’ final result. A simulation experiment is carried out on three classical datasets. The results show the effectiveness of our method, as well as a better performance compared with the other two methods.

1. Introduction

Entity resolution (ER) is to find out the ambiguous denotations which refer to the same real world entity. ER has been researched for a long time and it is a crucial stage in data cleaning. It is also called record linkage in statistics, disambiguation in information retrieval [1], data matching and coreference disambiguation in computer science, and so forth [2].

In the big data era, ER’s researches for big data have become a hot point [35]. Big data has some new characteristics such as big volume, fast velocity, and high dimension. And high dimension is one of the most important characteristics [6], which brings great challenges for current ER’s methods. There are two existing ways to handle the high dimensional data: one is to adopt parallel technologies, such as crowdsourcing and MapReduce, to reduce time for computing similarity vector of each candidate pair by all features of two records so as to identify whether the two records are matches (similar) or nonmatches (dissimilar). The other way is to use feature selection method to reduce dimensions and calculate similarity vector for ER.

Crowdsourcing is a new ER approach [7] which distributes candidate records to human workers to identify matching records [8]. Abboura et al. [9] used crowdsourcing to find out matching records in training data and created matching dependencies by Apriori algorithm to identity matching records in testing data. Zhang et al. [10] proposed CrowdLink model to reduce human workers’ difficulty in identifying similar records, and it could also tolerate human mistakes at the same time. The existing problems of crowdsourcing method are instability of human workers and the dependency on ER questions’ setting way. Besides, human’s judgment cannot guarantee the right answers, so algorithms need to handle with that.

Priya et al. [11] adopted Hadoop framework to design an ER system for stream data; they used thirteen similarity functions to measure similarity of records and generated matching rules by matching records based on the average of thirteen similarity functions. In order to solve ER problem in high dimensional data by a parallel way, Fries et al. [12] proposed a method called parallel similarity self-join based on MapReduce, which simply applied MapReduce framework to reduce computing time. Besides, there are other ER models based on MapReduce such as HadoopDB [13], Hadoop++ [14], and PACT [15]. Though it is effective to reduce computing time through parallel technologies, it has a low efficiency to obtain similarity vectors of records by using all features. Besides, high dimensional data could likely contain irrelevant features which may obscure the effect of the relevant ones [16]. It is inefficient to improve algorithm’s performance on ER by using all features which may even include noise information.

In order to overcome shortage of human participation for ER in high dimensional data, Cheng et al. [17] selected the key features from entity’s descriptive features, and they are sorted and recombined to improve their readability and discriminant ability. For coping with the incomplete and incorrect publication information’s effect on name disambiguation, Song et al. [18] employed the Named Entity Recognition model to choose organizations’ features with publication features to improve ER’s performance. Gueereiro et al. [19] developed a name disambiguation framework and applied five different types of features to implement ER process. Treerapituk et al. [20] adopted random forest method based on binary classifiers to select features, and experiment results showed that only using some key features could improve the ER’s accuracy. Current feature selection methods cannot make full use of the rich information of high dimensional data effectively because the final number of selected features is usually no more than fifteen [21]. Besides, different classification results may be found in different subspaces, so global filtering of features is not sufficient (local feature relevance problem) [16]. Now, there are very few works considering big data high dimensional characteristic to address ER problem [22].

In order to overcome the difficulties of current ER methods for high dimensional data, we propose an ensemble learning method based on feature selection in this paper. We regard ER process as a binary classification problem, that is, classifying a record pair as matches (similar) or nonmatches (dissimilar). Then we define the measures of classification performance and similarity between binary classifiers which employ SVM as base classifier. Three objectives are applied to optimize each base classifier’s performance, that is, maximizing classifier’s classification accuracy rate, maximizing dissimilarity between classifiers, and minimizing cardinality of features. A modified multiobjective ant colony optimization (MOACO) is designed to solve the optimization model to select complementary feature subset which is adopted to train base classifier. In the end, several binary base classifiers are combined by ensemble learning method to improve performance for ER in high dimensional data.

The paper is organized as follows: Section 2 describes ER’s concept and process; Section 3 defines binary classifier’s classification performance and similarity measures; Section 4 shows our method’s components and how it works; Section 5 makes an experiment to evaluate our method compared to other two methods; Section 6 closes with conclusions and discussions.

2. ER’s Processing Description

According to machine learning technologies, ER’s methods can be divided into four categories: methods based on probability; methods based on rules; methods based on clustering; and methods based on classification [23, 24].

In this paper, we regard ER as a binary classification problem, and the results contain two classes: matches class and nonmatches class, respectively. Firstly, a candidate record pair is represented by a similarity vector which is obtained by computing their corresponding features’ similarity. Secondly, the similarity vector is adopted as input of a binary classifier to identify whether they are matches or nonmatches. The above process can be described as Figure 1.

Without loss of generality, we only discuss records in one table. Suppose a record has features, and the set of features is denoted as . The value of th feature of th record is denoted as ,  ; then the th record can be denoted as . And the th feature’s similarity value between th and th records can be denoted as ; then we obtain the similarity vector of th and th records . At last is input to the binary classifier which identifies the th and th records as matches or nonmatches.

3. Measures of Classification Performance and Classifier’s Similarity

3.1. Measures of Binary Classifier’s Classification Performance

As discussed above, we regard ER as a binary classification problem in this paper, so how to measure classifier’s performance is the key to improving its classification effectiveness. In this section, we define some indicators to measure binary classifier’s performance.

Generally speaking, a binary classifier can be seen as a mapping function which is a many-to-one mapping classifier from sample space to class space. That is to say, it maps classes in sample space to two classes in class space. As Figure 2 shows, the binary classifier maps two or more classes in sample space to the same class in class space. In order to facilitate the statement, we regard matches as positive class and nonmatches as negative class.

Now we define the measures of binary classifier for high dimensional ER.

Classification accuracy rate :

False alarm rate :

Fault is not recognized rate :

The binary classifier’s output distribution matrix:where

Then is the classification accuracy rate of class , and it can be calculated by

Then the classification accuracy rate can be given bywhere is the samples’ prior possibility in class . Given a testing set of samples, can be expressed as where is number of samples in class and is number of samples in class .

So the false alarm rate of binary classifier can be written as (1 denoted matches; 2 denoted nonmatches)

The fault is not recognized rate of binary classifier can be expressed as (1 denoted matches; 2 denoted nonmatches)

Then , , and have a relation represented by

Based on (11) and the definitions of and , we can find that they have a conflict with each other. A high will lead to a low , and vice versa. And the classification accuracy rate can reflect both and effectively, so we adopt classification accuracy rate to measure binary classifier’s classification performance.

3.2. Measures of Binary Classifier’s Similarity

When many classifiers exist, the classifiers which have similar outputs may have similar results under the same ambiguous data, so their combination cannot improve the classification performance. On the other hand, when there are some differences between classifiers’ results, their ensemble will improve the classification performance to a certain extent, which is illustrated in Figure 3.

In Figure 3(a), classifiers and B have similar classification results, so their ensemble cannot identify the four negative samples in shadow correctly. But in Figure 3(b), classifiers A and B have a difference between their outputs, so their ensemble only cannot identify the two positive samples in shadow correctly, which lead to an improvement in classification accuracy rate. In this case, we say that their classification results are complementary to each other.

Given a set of samples and feature subset subset (denoting feature vectors of samples), if we use feature vectors of training samples to train a binary classifier and feature vectors of testing samples to test it, then we can map subset into a fixed binary classifier and an output distribution matrix :

So the classifiers which have complementary results can be obtained by using complementary feature subsets to train them. It is also clear that the similarity of binary classifier can be measured by the similarity of subset and which are also called input similarity and output similarity, respectively.

Next we define the measures of similarity between classifiers (similar to diversity) [25].

Definition 1. Two binary classifiers’ input similarity is defined as the similarity of input feature subsets. For feature subsets and of two binary classifiers (both are nonempty), we adopt Tanimoto distance to measure their similarity [26]: From (13), we can know that . When , it means that there is no common element between two subsets. When , it means that two subsets are identical and the classifiers trained by their corresponding feature vectors of training samples are also identical. So the bigger , the higher similarity of two feature subsets and the two classifiers’ input.

Definition 2. Two binary classifiers’ output similarity is defined as the similarity of classifiers’ output distribution matrix. Given two binary classifiers’ output distribution matrices and , where , , we use normalized Pearson’s correlation coefficient to measure their similarity:where and are the average of matrices and , respectively:where . When , the two output distribution matrices are full positive relative and the classification results of corresponding classifiers are identical. When , it means that the two output distribution matrices are full negative relative and the classification results of corresponding classifiers are completely different from each other.

Theorem 3. If , , and , then one has .

Proof. Suppose ; then we have according to (12) and precondition; that is to say, if , then . So the hypothesis is wrong and the theorem is correct.

From above analysis, we know that the output similarity of classifiers is stronger than their input similarity, so we adopt output similarity of classifiers to measure the similarity of classifiers. It is also clear that the more dissimilarity between classifiers which are trained by complementary subsets, the more complementarity between them, which may improve utilization rate of high dimensional data features.

4. Ensemble Classifiers Based on Feature Selection

There is no single classification algorithm that can solve all kinds of problems according to “no free lunch theorems” [27]. The reason is that each classifier typically has a different domain of competence under different problems and conditions. However, if we have a pool of different classifiers and adopt ensemble learning method to combine them, which is a way independent of algorithms to improve classification performance, the classification accuracy would be improved efficiently [28]. So we use ensemble learning method to combine several binary classifiers to improve ER’s performance.

4.1. Model of Ensemble Classifiers

For high dimensional ER, the proposed model based on feature selection which is used to train binary classifiers can be described as follows: Suppose there are ensemble classifiers which contain ( is odd number) binary classifiers, denotes the classification accuracy rate of th classifier, represents cardinality of th classifier’s input feature subset, and the input features of th classifier are selected according to following optimization objectives:

Equation (16) maximizes the classification accuracy rate of th classifier, which leads to a not poor classifier for ER. Equation (17) maximizes the dissimilarity between th classifier and other classifiers, which means that the selected features constitute a complementary feature subset. Equation (18) minimizes the number of selected features, which leads to a better efficiency for classification. Since the three objectives have a conflict with each other, they are a multiobjective optimization problem.

The decision function adopted by our ensemble classifiers is “Max-Wins” voting method. Suppose is the output of the th binary classifier for th sample, and it can be denoted by

If the ensemble classifiers contain ( is odd number) binary classifiers, then the final output of th sample is decided by where denotes XOR operation. That is to say, we choose majority classification results as the final output.

4.2. MOACO for Solving Ensemble Classifiers’ Model

A multiobjective optimization problem contains two or more than two objectives which have no order of priority. It can be stated as follows [29]:where the decision vector belongs to the nonempty decision space , the objective function vector consists of () objectives, and is objective space.

The solutions of multiobjective optimization problem are called Pareto optimal solutions which could not be further improved on any objective without harming the rest of objectives. And decision maker should choose one solution based on his or her preference. The goal of multiobjective optimization is to approximate the Pareto front which is composed of Pareto optimal solutions in the objective space [30].

MOACO is an excellent multiobjective evolutionary algorithm based on the foraging behavior of real ant species. The indirect communication of real ants in the colony uses pheromone trail lying on the ground to guide others to find the shortest path between their food source and the nest, which is called information positive feedback mechanism. As MOACO has that mechanism inherently and adopts reactive search optimization principle, that is, “learning while optimizing” principle [31], it has a better performance of searching Pareto optimal solutions than other multiobjective evolutionary algorithms especially for multiobjective combinatorial optimization problems [32]. Besides, it also has many other advantages such as robust, distributed computing and combination with other certain heuristics, which make it easily generalized.

As our model is a feature selection problem which is also a typical combinatorial optimization problem, we adopt a modified MOACO for solving it.

In order to solve the model by MOACO, we analyze it as follows:

(i) For a given binary classifier, it will have a better computing efficiency and classification accuracy rate if the number of its input features is set between 5 and 10. But we need more features to train the classifier to improve its performance in a high dimensional data. So the number of input features is set between 1 and 20 without missing lower bound.

(ii) Transform (16) and (17) into two objectives and which must be optimized under a fixed cardinality of features (as MOACO must determine the number of selected features firstly). So the problem solved by MOACO is converted into a maximization optimization problem.where and . For a given binary classifier, we set up an archive which records its Pareto optimal solutions and the archive lasts until the cardinality of features increases to the upper bound. MOACO is adopted to search Pareto optimal solutions (feature subsets) and it also sets up an archive under a fixed feature number (determine value firstly). After one cycle, the archive of MOACO is updated by solutions which are found by all ants. There are two situations for updating archive of MOACO. The value of objective does not exist when solving the first classifier, so the solutions are compared by values of objective , and the solution with the highest is recorded. When solving the th classifiers (), the Pareto relations between solutions are determined by their values of objectives and , and the Pareto relations are applied to update the archive of MOACO (solution replaces solution if no component of is smaller than the corresponding component of and at least one component is larger). When iteration reaches to the upper bound, MOACO finishes and the solutions in its archive are employed to update current binary classifier’s archive and the method for updating the archive is the same as MOACO’s. After one cycle, the cardinality of features increases by one and a new MOACO starts.

(iii) When MOACO finishes for a given binary classifier under all values, we need to choose one solution (feature subset) from current classifier’s archive as the input feature subset. We select the final solution by the descending order of priority on those three objectives here.

(a) Compare values of all solutions on objective and choose a solution with max value on objective as current classifier’s solution.

(b) If there are many solutions with the same max value on objective , we compare their values on objective and choose a solution with max value as current classifier’s solution.

(c) If there are many solutions with the same max values on objective and objective , we choose the solution with least cardinality of features as current classifier’s solution.

4.3. Components of MOACO

Based on the discussion of Section 4.2, the key stage is to solve the maximization optimization problem (22) which is a classical multiobjective subset problem.

Cao et al. [33] proposed a graph-based ant system for solving subset problem. They defined construction graph and equivalent routes and proposed a new updating pheromone policy based on strengthening the pheromone on equivalent routes. The effectiveness and superiority of the policy were illustrated with multidimensional knapsack problem. But it was used to solve single optimization problem, and we generalize it to solve multiobjective subset problem (22).

Lopez-Ibanez and Stutzle [34] made a comparison between several state-of-art MOACO algorithms and concluded that it would improve solutions’ quality by using more than one pheromone matrices. So we adopt two pheromone matrices as a component of MOACO and one matrix per objective.

In MOACO, each ant selects route (feature) according to the transition probabilities. In the case of traveling salesman problem, the probability that ant chooses to visit node after node is given bywhere is pheromone value of edge and is heuristic information which is a static greedy measure of the “goodness” of edge . denotes the set of feasible choices available for ant located in node given its current partial solution. and are algorithm’s parameters which represent the importance degree of pheromone and heuristic information, respectively. When there is more than one pheromone matrix, the values of them need to be aggregated into a single pheromone value to calculate transition probabilities. In this paper, we adopt weighted product method to aggregate values of two pheromone matrices [34].where is a weight which biases the aggregation towards one objective or the other, and it changes with iteration increases as below [34].where is algorithm’s parameter.

Besides, based on our model’s characteristic, the heuristic information of MOACO is defined as the Fisher discriminant rate of th feature of records’ similarity vector.where and are average of the th feature of similarity vector in matching and nonmatching classes, respectively. and are variance of the th feature of similarity vector in matching and nonmatching classes, respectively. Equation (26) shows that MOACO selects underlying features with high Fisher discriminant rate, which means they are easily classified.

After one cycle, MOACO uses solutions found by all ants to update its Pareto archive based on their Pareto relations. Then we adopt equivalent routes’ policy and solutions in Pareto archive to update the two pheromone matrices.where is evaporation rate of pheromone and is a constant parameter to fine-tune increment value of pheromone. represents th Pareto solution in archive after iterations, and denotes evaluation value of in corresponding objective.

In summary, the proposed modified MOACO can be described as Algorithm 1.

Begin
 Initialize parameters, pheromone matrices, and Pareto archive
While not stopping criteria met do
 Generate weight parameter by Eq. (25)
 Aggregate values from two pheromone matrices by λ by Eq. (24)
For each ant do
 Construct solution by Eq. (23)
End for
 Update Pareto archive
 Update pheromone matrices by solutions of Pareto archive and Eq. (27)
End while
End

We now analyze the time complexity of MOACO in Algorithm 1. For a given value , the time complexity of initialization is . The time complexity for constructing solutions by ants is , where is number of ants. The time complexity for updating Pareto archive is , where is the number of solutions in archive. The time complexity for updating pheromone matrices is . The overall time complexity of MOACO is , where NC is the iteration number of MOACO.

4.4. Pseudocode and Complexity Analysis of the Proposed Method

Based on above discussion, the pseudocode of our method is shown in Algorithm 2.

Begin
 Initialize parameters and each base binary classifier’s Pareto archive
For each base binary classifier do
For each value do
 Search for the optimal solution (feature subset) by Algorithm 1 under current value (cardinality of features)
End for
 Update current base binary classifier’s archive based on analysis (ii) in Section 4.2
End for
 Choose a solution (feature subset) as the current base binary classifier’s input based on analysis (iii) in Section 4.2
 Apply max-wins voting method to aggregate classifiers
End

Now we analyze the time complexity of Algorithm 2. The time required by initializing parameters is . For each classifier and a fixed , the time complexity for implementing MOACO is . The time complexity for updating classifier’s Pareto archive is , where is the number of solutions in classifier’s archive. The time complexity for choosing classifier’s final solution is . So the overall time complexity of our method is , where is the number of base binary classifiers.

5. Experiment Settings and Discussions

5.1. Data and Preprocessing

We applied three datasets (Colon, GLIOMA, and GLI_85) which come from a well-known website (available from http://featureselection.asu.edu/datasets.php.) to evaluate our method. The characteristics of the three datasets are shown as Table 1.

In order to make the three datasets fit for ER, we must translate them into datasets composed of similarity feature vectors. For each dataset, we first normalized its features’ range between zero and one. Then we chose two samples from the same class as a matching record pair and two samples from different classes as a nonmatching record pair. We used absolute difference value between two records’ corresponding features as their features’ similarities which constitute records’ similarity feature vector. Besides, there are more nonmatching records than matching records in a real world problem, so we adopted uniform resampling method to make the number of matches less than that of nonmatches. We can obtain three new datasets for our experiment after above preprocessing, and they are called Colon_ER, GLIOMA_ER, and GLI_85_ER. The characteristics of them are shown as Table 2.

5.2. Experiment Settings

We made a comparison between two methods and our proposed method in this section. We adopted a single SVM for ER, it calculated records’ similarity vectors by all features, and this model is called method 1. Our model is called method 2. Cao et al. [33] proposed a model for ER based on feature selection, and it took classification accuracy rate, recall rate, and cardinality of features as optimization objectives, and the experiment results showed a good performance for ER. We name the model proposed by Cao as method 3 for comparison. We took fivefold cross validation in each test and adopted average classification precision , recall rate , and measure as evaluation measures.

Method 1 applied SVM as a classifier, whose kernel function was “rbf” and and .

The parameters of method 2 were set as follows: the base binary classifier was SVM whose parameters were set as method1, number of base binary classifiers , cardinality of features , pheromone matrices number of MOACO , initial value of pheromone matrices , factors of importance of pheromone values and heuristic values , , evaporation rate , constant value , number of ants , parameter of weight   , number of solutions in classifier’s archive , number of solutions in MOACO archive , and the stopping criteria of MOACO were set as iterations .

The parameters of method 3 were set the same as those in [33].

5.3. Results and Discussions

The results of three methods on three datasets are shown in Tables 3, 4, and 5 (part of results).

The results on dataset Colon_ER in Table 3 show that method 1 has a lower precision and higher precision rate than other two methods, since high dimensional data has irrelevant features and noise information, and adopting all features to train classifiers may lead to overfitting; that is, most samples are classified incorrectly as matches. From Tables 4 and 5, we can find that method 1 does not identify any matching records on datasets GLIMOA_ER and GLI_85_ER. It is because that those two datasets have higher dimensions than Colon_ER, which leads to having more irrelevant features and noise information that reduce performance of classifier for ER. The experiment results of method 1 demonstrate that it may reduce classifier’s performance by using all features of records and make the classifier unavailable for high dimensional ER.

The results of method 3 on three datasets demonstrate that it has a higher performance than method 1. Feature selection could filter irrelevant features and noise information, which helps to improve the performance of classifier. Method 3 takes classification precision, recall rate, and cardinality of features as optimization objectives for feature selection. So the selected features used to train classifier could make it classify most samples correctly, which improves its classification precision and recall rate. The values of measure also show that method 3 has a better performance for high dimensional ER than method 1.

At last, we can find that method 2 has a higher performance on three datasets than method 1 and method 3 for high dimensional ER. There are three reasons causing this situation. Method 2 takes classification accuracy rate as optimization objective to consider about both false alarm rate and fault is not recognized rate of classifiers to get an improvement for its performance. And Table 6 also shows that method 2 has a better classification accuracy rate than that of two others, which means it can classify most positive (matches) and negative (nonmatches) classes correctly at a high level. Method 3 only applies one classifier and a few features of high dimensional data, which leads to a loss of rich information. The results on GLIMOA_ER show that high dimensional data causes local feature relevance problem and the performance of method 3 deteriorates with dimensions increasing. In contrast, method 2 adopts ensemble classifiers which take the dissimilarity between classifiers as an optimization objective. It chooses complementary features which maximize the dissimilarity between classifiers to make full use of the rich information in high dimensional data to overcome local feature relevance problem. Finally, method 2 adopts ensemble learning to combine base binary classifiers which are trained by complementary feature subsets to further improve the performance of ensemble classifiers for high dimensional ER.

Table 7 shows the final solutions’ values of five base binary classifiers of method 2 in objectives and . We can find that the values of solutions (except the first classifier as there were no other classifiers when it is trained) are all greater than 0.5, and it demonstrates that the five classifiers are dissimilar from each other because of being trained by complementary feature subsets. So method 2 can make full use of rich information in high dimensional data to further improve its classification performance. We can also obtain a conclusion that the solutions found by the modified MOACO can make classifiers achieve a tradeoff between classification precision and recall rate.

6. Conclusions

In order to improve features’ utilization rate of high dimensional data and reduce the impact brought by irrelevant features, an ensemble learning method based on feature selection is proposed, and some conclusions through experiments can be obtained as follows.(1)It could reduce the impacts brought by irrelevant features and noise information through applying feature selection, which improves classification performance for high dimensional ER.(2)It can get a tradeoff between classification precision and recall rate by taking classification accuracy rate as an optimization objective.(3)The complementary feature subsets obtained by maximizing dissimilarity between classifiers can address local feature relevance problem efficiently.(4)Combining classifiers trained by complementary feature subsets through ensemble learning method can further improve algorithm’s performance for high dimensional ER.(5)Note that the proposed method is very generic. It can be applicable for solving ER from any databases provided there is a way to calculate the distance between two given records without considering the number of dimensions (our method is also suitable for the situation with low dimensions in order to further improve the performance of ER). In order to illustrate the effectiveness of our method, we have used three datasets, but any other datasets could have also been used with suitable distance measure.(6)The proposed method is also very generic with respect to the solution framework. Thus instead of MOACO any other optimization technique could be used. And also the SVM could be replaced by other classifiers.

Though our method has a lot of advantages, there are more problems that need to be solved. We adopt evolutionary algorithm to solve model, but the running time increases with dimensions increasing (e.g., we may need more time to select 20 features from data with 10000 features rather than 5000). Second, there are more nonmatches than matches in ER, which is known as imbalanced data problem, but we do not consider it in our method. Finally, we need more researches to make use of complementary feature subsets to further improve utilization rate of high dimensional data features.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this manuscript.

Acknowledgments

This work is supported by the Natural Science Foundation of China under Grant no. 61371196, the China Postdoctoral Science Special Foundation under Grant no. 201003797, the China Postdoctoral Science Foundation under Grant no. 2015M582832 and Grant no. 20090461425, and the Jiangsu Postdoctor Research Fund of China under Grant no. 1402138C.