Complexity

Research Article

HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

HSDP Algorithm.

	Input: imbalanced dataset S
	Output: balanced dataset S
	Process:
Step 1:, , , can be obtained by DP algorithm.
Step 2: count the number (m) of samples in the and . Count the number (n) of samples in the and . Meanwhile, count the number (s) of samples in the .
Step 3: calculate the number of synthetic data samples that need to be generated for minority class: , where is the synthesis scaling factor. b = 1 means a balanced dataset is obtained after the oversampling process.
Step 4: for each sample , calculate the ratio of majority class samples belonging to the k neighbors of . This ratio is defined as
Step 5: the weight is determined by .
Step 6: calculate the number of synthetic data samples for each sample x_i in the boundary minority samples region: .
Step 7: for each sample x_i in the boundary minority samples region, generate synthetic data samples according to the following steps:
	Do the loop from 1 to :
(a)	Randomly select another sample y from the same cluster of x_i
(b)	Generate a synthetic data sample:
	,
	where
	End loop