Abstract

Classifiers are often used in entity resolution to classify record pairs into matches, nonmatches, and possible matches, the performance of classifiers is directly related to the performance of entity resolution. In this paper, we develop a multiple classifier system using resampling and ensemble selection. We make full use of the characteristics of entity resolution to distinguish ambiguous instances before classification, so that the algorithm can focus on the ambiguous instances in parallel. Instead of developing an empirical optimal resampling ratio, we vary the ratio in a range to generate multiple resampled data. Further, we use the resampled data to train multiple classifiers and then use ensemble selection to select the best classifiers subset, which is also the best resampling ratio combination. Empirical study shows our method has a relatively high accuracy compared to other state-of-the-art multiple classifiers systems.

1. Introduction

Entity resolution, also called duplicate record detection, is the process of identifying different or multiple records that refer to one unique real world entity or object [1]. It is widely used in homeland security, custom relationship database, and fraud and crime detection [2]. Christen summarized the outline of the general entity resolution, showing that entity resolution mainly comprises three steps: the first is indexing, where similar records are grouped together, the second is record pair comparison, where each field is compared using similarity function and numeric similarity values are generated, and the last is similarity vector classification, where records are classified into matches, nonmatches, and possible matches [2]. The possible matches refer to ambiguous records, which often need experts’ participation to manually assess and further classify into matches or nonmatches. In case classification algorithm is used in similarity vector classification, entity resolution becomes a typical classification problem; for instance, Bilenko et al. used SVM to conduct similarity vector classification [3]. In order to improve the resolution effectiveness, existing methods used in classification like multiple classifier system can also be applied; for example, Tejada et al. used multiple classifiers to detect ambiguous records and asked users for feedback to reach high accuracy [4], which shows the advantage of using multiple classifier system in entity resolution. However, the applications of multiple classifier system in entity resolution remain rare, and even few researches take the characteristic of entity resolution into account in developing multiple classifier system.

In this paper, we focus on the third step of entity resolution by constructing a multiple classifier system to improve resolution effectiveness; we made use of the characteristic of entity resolution in developing multiple classifier system too.

Many kinds of multiple classifier systems have been developed, like Bagging, Boosting, and AdaBoost [5]. AdaBoost emphasizes the weight of ambiguous instances, to gain high accuracy, showing the effectiveness of emphasis on ambiguous data; the training is a sequential process.

Instead of selecting all classifiers in developing multiple classifier system, Zhou et al. showed that selecting a proper subset is superior to selecting all; his work also showed that it is better to select from parallel problems like Bagging than from sequential problems like Boosting [6].

The process of selecting a subset from a multiple classifier system is called ensemble selection, ensemble pruning, ensemble thinning, and so on. Many have been working on ensemble selection; the diversity among component classifiers is regarded as playing an important role in ensemble selection, but it is still an open problem on how to measure and evaluate diversity [7, 8].

Yu et al. managed the diversity in a deterministic mathematical programming framework; they conducted a theoretical analysis in a PAC learning framework to show that the diversity can effectively reduce the hypothesis space complexity, implying that the diversity control in ensemble selection plays a role of regularization as in statistical learning approaches; the solution is a quadratically constrained quadratic program (QCQP) and they used alternating optimization instead to improve efficiency [8].

Li et al. defined diversity based on the average of pairwise differences; they further used diversity and accuracy to conduct ensemble selection and achieve good results, and they also conducted a theoretical analysis and showed that encouraging diversity can reduce generalization error, thus enhancing accuracy; the solution is greedy forward pruning and is relatively efficient [9].

Rafal et al. defined the competence of classifiers based on the probability of correct classification and the pairwise diversity based on conditional probabilities of error and then constructed an ensemble selection model using competence and diversity; the solution of the model is a combinational optimization problem, and they solved it using simulated annealing [10].

Yin et al. defined a convex diversity measure based on ensemble ambiguity and presented a general ensemble selection framework with diversity and sparsity. With the convex measure, they converted the ensemble selection into a convex optimization problem; they further showed that sparsity will force some weights of classifiers to be zero, realizing selection [11].

Above all, it is concluded that diversity among component classifiers works like regularization in general statistical learning approaches; besides, those measures like accuracy, sparsity, competence, and so on focusing on the classification performance of component classifier and ensemble size can be used in ensemble selection too.

3. Resampling and Ensemble Selection

In this part, we develop a high accurate multiple classifier system, resampling and ensemble selection (RES), by first varying the resampling ratio in a range to resample the ambiguous instances to generate a group of new instances, then using the new instances to train multiple SVM classifiers, using diversity and sparsity to select the best classifiers subset, and then using weighted voting to make the final classification decision. The outline of the construction of multiple classifiers system is shown in Figure 1.

3.1. Resampling

In resampling, we make full use of the characteristic of entity resolution to distinguish ambiguous instances before classification and then use the idea of REA [12] to resample the ambiguous ones and use the resampled data to train a group of SVM.

Record similarity of each record pair is usually calculated in entity resolution; and those with high similarity are likely to be matches; those with low similarity are likely to be nonmatches; those with similarity neither too high nor too low can be either similar or distinct and can be assumed to be ambiguous.

The formal illustration of distinguishing ambiguous instances is as follows. Let be a record similarity vector of duplicate record pairs, let be a record similarity vector of distinct pairs, let and be the expectation of and , respectively, and let and be the variance of and , respectively. As the distribution of record similarity is approximately normal, then, since most values obeying normal distribution are within the region , where is the expectation and is the variance, we can assume that those record pairs whose similarity is within are likely to be duplicate, and those within are likely to be distinct; hence, those with similarity within the region can be regarded as ambiguous.

We give the pseudocode of resampling algorithm in Algorithm 1.

Input:
(1) the dataset to be resampled:
% each instance is the field similarity vector of a
record pair, and the class label indicates whether the
corresponding record pair is match or non-match
(2) the ratio of resampling:
Initialization:
(3) , ,
Splitting the dataset into ambiguous and normal:
(4) get the instance number
(5) Split dataset into duplicate data and distinct
data
(6) Average each instance to get record similarity
vector and
(7) Calculate expectation and variance of and as
, and , respectively
(8) Calculate lower bound LB of as
and the upper bound UB of as UB =
(9) for
(10) If % is the similarity of the
th instance
(11)    % ambiguous instances
(12) Else
(13)      % normal instances
Resampling:
(14) For
(15) Randomly select an instance from
(16)
(17) For
(18) Randomly select an instance from
(19)
(20) Order in random order
Output:
(21) the resampled dataset:

In Algorithm 1, we first calculate the upper bound and the lower bound (6–8), splitting the dataset into ambiguous and normal data (9–13), and then conduct resampling according to the resampling ratio (14–20).

The overall time complexity of resampling is linear.

3.2. Ensemble Selection

While conducting selection, we also use diversity and sparsity; the general formula is [11]where is the sparsity control parameter and is the diversity control parameter.

For loss function, we use the least square error, aswhich is also used in [11], where w is the weight vector with , is the output vector of all classifiers on the th instance, and is the target label of the th instance.

In case of binary classification with class label as 1 and −1,

So we can use to normalize the loss function; then (3) can be written aswhere is the prediction matrix of all classifiers, with size , is the number of classifier sets, is the number of test instances, and .

As to diversity measure, we use the diversity measure in [6], as

Also, in case of binary classification with class label as 1 and −1, (5) can be written asAs we have where is an all-1 -dimension row vector.

As to sparsity, we just set .

Since sparsity will force some weights to be zero, we extend the diversity measure to include weight as a parameter, so that only those classifiers with weight above zero will be selected in calculating diversity. The diversity measure thus becomeswhere sgn is a sign function and is the number of classifiers whose weights are above zero; it satisfies .

For solution, we convert formula (1) to

and are two control parameters, and (10) is a typical nonlinear programming problem and can be solved using existing optimization tool.

We give the pseudocode of ensemble selection in Algorithm 2.

Input:
(1) the prediction matrix of all classifiers
(2) the target label
(3) the control parameters: and
Initialization:
(4) ,
Ensemble selection:
(5) get the instance number
(6) apply an optimization tool to solve (10) to get
(7)
(8)     % sgn is sign function
Output:
(9) the weighted predict:

In Algorithm 2, the main process is to use an optimization tool to solve (10); the overall time complexity is determined by the number of classifiers , also related to the specific optimization tool used.

4. Experiment

4.1. Settings

We use 10 synthetic datasets and 5 real datasets. The synthetic datasets are abalone, dermatology, innosphere, breast cancer, seismic, ILPD, vote, biodeg, glass, and diabets from UCI machine learning repository; we then use a duplication generation tool to generate duplication. For convenience, we only choose numeric and nominal fields.

As to real world dataset, we use the datasets abt_buy, amazon_gp, dblp_acm, and dblp_scholar formally used in [13] and cora formally used in [14].

When calculating the similarity of each field, we use Jaccard similarity for characteristic, and for numeric data; as to nominal data, we use .

For each dataset, we conduct 10 runs of 5-fold cross validation, by randomly selecting 4/5 pieces of data as training data and 1/5 as test data. On each dataset, each experiment is run for 10 times. We use the average and variance of the 10 runs to evaluate the classification performance.

We compare our algorithm with Gentle AdaBoost [5] (a sequential multiple classifier system), Bagging [1] (no resampling nor ensemble selection), DREP [6] (only ensemble selection), and REA [12] (only resampling).

For Gentle AdaBoost, we use the MATLAB code implemented by Alexander Vezhnevets; its base classifier is CART and it carries out 100 iterations.

For Bagging, we conduct Bootstrap sampling on the training data to train 21 SVM classifiers and then use majority voting to get the final prediction.

For DREP, we first conduct 1 run of Bagging and use the output of Bagging as the input of DREP; we vary the tradeoff parameter of DREP from 0.05 to 0.5 with step 0.05 to get 10 results and use the average as the final result.

For REA, we use its resampling equations to calculate the empirical resampling ratio and conduct resampling on the training data with the ratio to train SVM classifier.

For RES, we vary the resampling ratio from 0.4 to 0.6 or from 0.4 to 0.8 with step 0.01 to resample the training data and train SVM classifiers, then use the trained classifiers to predict on the test data to generate a prediction matrix, and use ensemble selection to get the final weighted prediction; for parameters in (10), we set to be 1 and to be 0.7 for simplicity.

We use rbf as the kernel function of SVM, and the width of rbf is 0.4; the tradeoff parameter is 100. We use fmincon in MATLAB optimization toolbox to solve (10).

4.2. Results

The overall accuracy comparison is shown in Table 1. On each dataset, an entry is marked with bullet “•” (or circle “○”) if it is significantly better (or worse) than Bagging based on t-test at the significance 0.05; the win/tie/loss counts are summarized in the last row and the entry with the best performance of each dataset is marked in bold font.

DREP does not perform as well, which lies in the fact that its result depends on the input of Bagging, and it requires proper tradeoff parameter chosen.

Gentle AdaBoost wins 11 times and achieves the best performance on 8 datasets, and it can sometimes get a quite good result, but when focusing on real dataset, Gentle AdaBoost only wins twice and only achieves the best performance once; what is more, Gentle AdaBoost is sometimes inferior to Bagging; it especially loses on 3 real datasets.

REA performs not that well on synthetic data, but, on real dataset, it wins 4 times with slight improvement.

RES wins 14 times and achieves the best performance on 6 datasets; it especially achieves the best performance on 3 real datasets and is very close to the best performance on the remaining 2 datasets. It clearly shows that the proposed RES is superior to simple ensemble selection (DREP), simple resampling (REA), and Gentle AdaBoost in accuracy, as it can always achieve better performance than Bagging, and it can achieve the best performance sometimes.

The runtime comparison is shown in Table 2; each entry is the average of the 10 runs, and the unit is second.

The time complexity of Bagging mainly depends on the number of classifiers and the training of single classifier. It appears to be quite inefficient in this case, which is because it has 21 component classifiers, and the training of SVM is a little time consuming.

It is easy to see that DREP is quite efficient, as it only aims at ensemble selection procedure and uses greedy forward pruning.

Gentle AdaBoost is also efficient, because its base classifier is CART, which is relatively more efficient than SVM in this case.

The resampling of REA itself is linear, which is very efficient, though it appears to be time consuming; the main time consumed is during the training of SVM.

RES is relatively efficient too, as the resampling is linear, and, in case of medium scale nonlinear programming problem, the solution of the fmincon in MATLAB optimization toolbox switches to linear search, and the search is efficient too.

The time consumed by Bagging is almost 21 times that of REA, reflecting the efficiency of resampling. Besides, Bagging is the basis of DREP and RES.

RES is not much sensitive to parameter chosen, as the most key parameter in resampling is the resampling ratio, and it is replaced with a range; besides, the control parameters in (10) do not matter that much too.

5. Conclusion

By making full use of the characteristics of entity resolution, we emphasize the ambiguous instances through resampling; besides, we construct a parallel multiple classifiers system by varying the resampling ratio to form multiple classifiers and using ensemble selection to select the best classifier subset. The empirical study shows our system has relatively high accuracy compared to other state-of-the-art multiple classifier systems.

RES works in situation where accuracy is more emphasized, and it proves the effectiveness of resampling and ensemble selection in entity resolution.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.