Research Article

A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data

Algorithm 1

Splitting attributes into different blocks.
Input: sd: dataset;
     δ: the threshold for the number of possible values of an enumeration attribute;
     λ: the number of values which are randomly selected from sd;
Output: Map < BT(block type), list of attribute names > BT∈{NUME, STRING, DATE, ENUM}
(a) Map < attribute, List < v1, v2,…, vλ> > mediateData ← sd;// Using Map to store attribute and its
   values.
(b) blockMap← new HashMap < String, List>; // blockMap is used to store the return value;
(c) For each attribute in mediateData Do
(d)   valuesNoRep← Remove duplicate elements from List < v1,v2,…,vλ>;
(e)    n← valuesNoRep.size();
(f)    If ((double)n/λ < δ) then {//the type of this attribute is enumeration
(g)     List enumAttributes ← blockMap.get(“ENUM”);
(h)     If (enumAttributes == null) then {enumAttributes ← new ArrayList;
(i)        blockMap.put(“ENUM”, enumAttributes);}
(j)      enumAttributes← enumAttributes.add(attribute name);
(k)   }Else if (the elements of valuesNoRep conform to the date type rules) then {
(l)      List dateAttributes ← blockMap.get(“DATE”);
(m)        If (dateAttributes == null) then {dateAttributes ← new ArrayList;
(n)             blockMap.put(“DATE”, dateAttributes);}
(o)        dateAttributes ← dateAttributes.add(attribute name);
(p)   }Else if (The elements of listWithoutDu conform to the numerical type rules) then {
(q)        List numericAttributes ← blockMap.get(“NUME”);
(r)         If (numericAttributes == null) then {numericAttributes ← new ArrayList;
(s))         blockMap.put(“NUME”, numericAttributes);}
(t)     numericAttributes ← numericAttributes. add(attribute name);
(u)   } Else {
(v)       //other attributes will be treated as string type
(w)      List stringAttributes ← blockMap.get(“STRING”);
(x)       If (stringAttributes == null) then {stringAttributes ← new ArrayList;
(y)       blockMap.put(“STRING”, stringAttributes);}
(z)       stringAttributes ← stringAttributes.add(attribute name);
(aa)  }
(bb)  End For each attribute in mediateData
(cc)   return blockMap;
(Note: the δ and λ should be adjusted according to the size of dataset)