Research Article

HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

Algorithm 3

HSDP Algorithm.
Input: imbalanced dataset S
Output: balanced dataset S
Process:
Step 1:, , , can be obtained by DP algorithm.
Step 2: count the number (m) of samples in the and . Count the number (n) of samples in the and . Meanwhile, count the number (s) of samples in the .
Step 3: calculate the number of synthetic data samples that need to be generated for minority class: , where is the synthesis scaling factor. b = 1 means a balanced dataset is obtained after the oversampling process.
Step 4: for each sample , calculate the ratio of majority class samples belonging to the k neighbors of . This ratio is defined as
Step 5: the weight is determined by .
Step 6: calculate the number of synthetic data samples for each sample xi in the boundary minority samples region: .
Step 7: for each sample xi in the boundary minority samples region, generate synthetic data samples according to the following steps:
 Do the loop from 1 to :
(a)  Randomly select another sample y from the same cluster of xi
(b)  Generate a synthetic data sample:
  ,
    where
 End loop