Research Article

Query Execution Optimization in Spark SQL

Algorithm 2

Calculation of the tuple number of join operation by histogram method.
Estimate the size of join operation by histogram method
input: HR = {h1r, h2r, …, hnr}, HS = {h1s, h2s, …, hms}
Output: Total tuples Sum after join;
procedure
i ⟵ 1; j ⟵ 1; Sum ⟵ 0;
while i ≤ n and j ≤ m do;
  if hi and hj have overlap then;
   Overlap ⟵ Overlap of two histogram buckets;
   templeft ⟵ hi.times ∗ Overlap/(hi.end-hi.start)
   tempright ⟵ hj.times ∗ Overlap/(hj.end-hj.start)
   Sum ⟵ Sum + templeft ∗ tempright/Overlap
   if hi.end < hj.end then
    i ⟵ i + 1
   else
    j ⟵ j + 1
   end if
  else
   if hi.end < hj.start then
    i ⟵ i + 1
   else
    j ⟵ j + 1
   end if
  end if
end while
end procedure