Journal of Sensors

Research Article

A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data

Algorithm 1

Splitting attributes into different blocks.

Input: sd: dataset;
δ: the threshold for the number of possible values of an enumeration attribute;
λ: the number of values which are randomly selected from sd;
Output: Map < BT(block type), list of attribute names > BT∈{NUME, STRING, DATE, ENUM}
(a) Map < attribute, List < v1, v2,…, vλ> > mediateData ← sd;// Using Map to store attribute and its
values.
(b) blockMap← new HashMap < String, List>; // blockMap is used to store the return value;
(c) For each attribute in mediateData Do
(d) valuesNoRep← Remove duplicate elements from List < v1,v2,…,vλ>;
(e) n← valuesNoRep.size();
(f) If ((double)n/λ < δ) then {//the type of this attribute is enumeration
(g) List enumAttributes ← blockMap.get(“ENUM”);
(h) If (enumAttributes == null) then {enumAttributes ← new ArrayList;
(i) blockMap.put(“ENUM”, enumAttributes);}
(j) enumAttributes← enumAttributes.add(attribute name);
(k) }Else if (the elements of valuesNoRep conform to the date type rules) then {
(l) List dateAttributes ← blockMap.get(“DATE”);
(m) If (dateAttributes == null) then {dateAttributes ← new ArrayList;
(n) blockMap.put(“DATE”, dateAttributes);}
(o) dateAttributes ← dateAttributes.add(attribute name);
(p) }Else if (The elements of listWithoutDu conform to the numerical type rules) then {
(q) List numericAttributes ← blockMap.get(“NUME”);
(r) If (numericAttributes == null) then {numericAttributes ← new ArrayList;
(s)) blockMap.put(“NUME”, numericAttributes);}
(t) numericAttributes ← numericAttributes. add(attribute name);
(u) } Else {
(v) //other attributes will be treated as string type
(w) List stringAttributes ← blockMap.get(“STRING”);
(x) If (stringAttributes == null) then {stringAttributes ← new ArrayList;
(y) blockMap.put(“STRING”, stringAttributes);}
(z) stringAttributes ← stringAttributes.add(attribute name);
(aa) }
(bb) End For each attribute in mediateData
(cc) return blockMap;
(Note: the δ and λ should be adjusted according to the size of dataset)