Scientific Programming

Research Article

Query Execution Optimization in Spark SQL

Calculation of the tuple number of join operation by histogram method.

	Estimate the size of join operation by histogram method
	input: H_R = {h₁r, h₂r, …, h_nr}, HS = {h₁s, h₂s, …, h_ms}
	Output: Total tuples Sum after join;
	procedure
	i ⟵ 1; j ⟵ 1; Sum ⟵ 0;
	while i ≤ n and j ≤ m do;
	if h_i and h_j have overlap then;
	Overlap ⟵ Overlap of two histogram buckets;
	templeft ⟵ h_i.times ∗ Overlap/(h_i.end-h_i.start)
	tempright ⟵ h_j.times ∗ Overlap/(h_j.end-h_j.start)
	Sum ⟵ Sum + templeft ∗ tempright/Overlap
	if h_i.end < h_j.end then
	i ⟵ i + 1
	else
	j ⟵ j + 1
	end if
	else
	if h_i.end < h_j.start then
	i ⟵ i + 1
	else
	j ⟵ j + 1
	end if
	end if
	end while
	end procedure