Research on Cross-Company Defect Prediction Method to Improve Software Security
Algorithm 3
Metric matching and sample weight setting based CCDP algorithm.
Input: source company project datasets as training datasets;
Target company project datasets as test datasets;
Output: trained defect prediction model
(1)
Clustering-based metric matching:
(a)
Use MaxMinNormalization method to normalize S and T to get SNorm and TNorm;
(b)
Extract the multigranularity metric feature vector of S and T, expressed as and , where ;
(c)
Apply K-means clustering algorithm to cluster and into K clusters, respectively;
(d)
Extract the principal component of each cluster through PCA as the representative vector and ;
(e)
Perform one-to-one metric matching on and through metric matching;
(f)
Redistribute SNorm and TNorm based on above steps, expressed as ,.
(2)
Sample selection-based weight setting:
(a)
Use NNFilter to select N source samples that similar to each target sample based on Euclidean distance as candidate training data samples ;
(b)
Statistic the frequency of selected source samples in the SCandidate;
(c)
Use samples frequency information in SCandidate as the basis for sample weight setting.
(3)
CCDP model construction and verification:
(a)
Use the weighted source samples as the training dataset and apply common machine learning methods including LR, NB, and KNN to construct the predict model;
(b)
Perform experiments on multiple defect datasets and evaluate the performance of the proposed method.