Abstract

Software testing identifies defects in software products with varying multiplying effects based on their severity levels and sequel to instant rectifications, hence the rate of a research study in the software engineering domain. In this paper, a systematic literature review (SLR) on machine learning-based software defect severity prediction was conducted in the last decade. The SLR was aimed at detecting germane areas central to efficient predictive analytics, which are seldom captured in existing software defect severity prediction reviews. The germane areas include the analysis of techniques or approaches which have a significant influence on the threats to the validity of proposed models, and the bias-variance tradeoff considerations techniques in data science-based approaches. A population, intervention, and outcome model is adopted for better search terms during the literature selection process, and subsequent quality assurance scrutiny yielded fifty-two primary studies. A subsequent thoroughbred systematic review was conducted on the final selected studies to answer eleven main research questions, which uncovers approaches that speak to the aforementioned germane areas of interest. The results indicate that while the machine learning approach is ubiquitous for predicting software defect severity, germane techniques central to better predictive analytics are infrequent in literature. This study is concluded by summarizing prominent study trends in a mind map to stimulate future research in the software engineering industry.

1. Introduction

Software defect prediction (SDP) suffixes to determine faulty parts of a software system proactively. It is one of the most time-consuming and important stages of the software development life cycle [1, 2], with different testing strategies including cloud-based options [3, 4]. Early prediction and identification of defective parts expectedly warrant immediate debugging [5], which is a function of the severity level of the identified defect(s). Moreover, software requirement gathering is a critical introductory phase of the software development process [68]; hence, high priority is given to software testing during development [9] with the aim of ensuring compliance with the requirement specifications. Software defect has been broadly described as either a fault, blunder, or bug in a software product, with a corresponding unexpected output and or an unpremeditated behavioral outcome [10] contrary to the quality intentions of software engineers and expectations of the end-users. Whether outsourced or developed in-house, critical success factors are essential for software development [3, 1113] in order to avoid costly defects. However, advances in data science, automation, and information technology are rapidly enhancing the aim of improving assurances across the software development value chain [14], as better technological advancements are tailormade for massive data mining sequel to its predictive analytics. Nonetheless, a review of such problem-solving approaches is needed to guide practitioners and the research community for continuous improvement across the pipeline besides the dire need of an up-to-date understanding of trends, research direction, lapses, potentials, etc., in the deployment of data science, in particular, for software defect severity tasks.

While reiterating the applicability of data science in virtually all fields of human endeavor, Olaleye et al. assertions in ref. [15] underscore the global research community’s interest in adopting predictive analytics for resolving industry problems just like for software defect predictive modeling [16, 17], as evident in literature across digital libraries consulted for the purpose of this work. Therefore, data science offers a reliable functionality for scientific studies which avail review studies a retinue of sources of investigations [18]. In the existing systematic literature reviews on software defect severity prediction, the bias-variance tradeoff consideration is seldom factored in research questions, which has dire consequences on the eventual performance metrics of the machine learning models. If unaddressed, it is a major threat to the internal validity of data science-based studies. Furthermore, only a few review works are obtainable focusing on text analytics techniques for software defect severity prediction. The approach of text analytics falls within the natural language processing (NLP) use case of artificial intelligence, which comes with diverse functionalities that would make a robust literature review with respect to encompassing research questions. For instance, the basic but all-important preprocessing cycle of NLP tasks, including tokenization, stemming, and other feature engineering techniques of feature selection, were not investigated in existing studies, including parameter optimization considerations. These further constitute internal threats to the validity of proposed frameworks. Investigation of future recommendations on future works is also left out in most of the existing reviews. Furthermore, the need to examine the level of multicollinearity inherent in training datasets, which is the hallmark of data science-based predictive analytics [19], is not investigated. In addition, the nature of datasets employed in literature is very important in mitigating threats to the external validity of proposed frameworks, a consideration hardly investigated by existing review studies.

This study is therefore motivated by the need to address the aforementioned gaps, which all constitute internal and external threats to the validity of machine learning-based studies.

The objective of this study is to address the lapses while carrying out a critical review of identified primary studies targeting techniques that speaks to the internal and external validity of their approaches. These expositions are exceptionally paramount for researchers with the intent of unraveling novel methodological concepts for future studies in the domain of software quality assurance using text analytics.

The main contribution of this study, therefore, is in the analysis of studies with respect to techniques that have a direct correlation with threats to the internal and external validity of their work, as well as their bias-variance tradeoff consideration. This partly determines the research questions formulated for this study and as well as unraveled the research agenda for future studies. As recommended by the authors of ref. [8], the guidelines for conducting a systematic literature review in software engineering in ref. [20] are employed in this study. In all, eleven research questions are formulated, as inspired by the objectives of the study, and are presented in Table 1. The mind map in Figure 1 aptly captures the focus of the paper and the internal validity considerations to be investigated in primary studies.

To fill the aforementioned gaps amongst many others, an SLR on the deployment of machine learning techniques for software defect prediction is hereby presented based on articles within the last decade (2011 and 2022). Other aims of the study include the following theories:(i)Investigating limitations to inspire future research.(ii)Unraveling success variables adopted in the preprocessing phase of text analytics (if any).(iii)Contrasting current research trends with future works.(iv)Bias-variance tradeoff consideration proposed in methodologies, etc. Besides a review of existing methodologies, this work equally aims to unravel and present the confines of the current approaches including data sampling consideration, training set, training strategy, and compliance with recommendations in the literature.

1.1. Research Questions and Study Mind Map

Research questions and objectives for this SLR are listed in Table 1. From studies, the data sampling state, the most prevalent public data, and the most popular learner algorithm in literature are discovered through research questions R-Q1, R-Q2, and R-Q3. The learning approach employed in literature including cross-validation or hold-out, popularity of parameter tuning, etc., is analyzed in R-Q4 to R-Q6. The significant area of feature engineering is covered in R-Q8 while R-Q9 answers the all-important question of the nature of software metrics deployed. To inspire future studies, threats to validity and proposed future works are determined in R-Q10 and R-Q11. Figure 1 shows the simple mind map of the SLR to identify software defect severity prediction models, approach, framework, dataset, and test natural language processing approach used in literature. The mind map will eventually be used to gauge observations from primary studies at the end of this review.

2. Literature Review

Systematic literature reviews (SLRs) on software defect predictions majorly concentrate on the choice of algorithms, methodologies, performance analysis, etc., without consideration for data-based threats to internal and external validities of primary studies. Hence, primary studies considered for mapping are chosen carefully to ensure research questions are efficiently addressed, which necessitated a clear-cut search criterion, and therefore, a limited number of primary studies is considered since this study is only concerned with literature that offers adequate answers to set-out research questions. Azeem et al. in ref. [5] focused on literature between 2000 and 2017 detecting code smell through machine learning techniques targeting setups of machine learning approaches, how evaluation strategies are conducted, and a meta-analysis of performance metrics from models under study. Their analysis revealed challenges that are not yet addressed by the research community. The work of Son et al. in ref. [21] focuses on research studies from 1995 to 2018 with a systematic mapping of literature in the software quality domain, adopting a multistage process. It focused on models that could classify software metrics from different projects to help organizations with little data, just as in ref. [10], aiming to identify research trends, methods, frameworks, and datasets in SDP deployed between 2000 and 2013.

While reviewing papers between 2004 and 2012, the authors in ref. [9] categorized primary studies into several parts. Those executed within the framework of classification, clustering, and ensemble with the profound investigation of software metrics deployed for software defect prediction in the works, including ISO standards, CMM, software testing, and unique software metrics. Quality improvements were central to the work of ref. [22] as it identifies contributory factors with consequent remedial courses to improve software productivity and quality. Models deployed in the literature were evaluated based on stated criteria with personal observations related to the models discussed.

In ref. [23], the authors reviewed studies between 1991 and 2013 and identified categories of machine learning models, as well as including studying performance accuracy analysis, reviewing statistical approaches while understudying the strength of machine learning models and similarly; the authors of ref. [24] reviewed data mining techniques deployed for software defect prediction works in literature under review. The authors thoroughly likewise reviewed datasets used, tools deployed for predictive analytics, performance measures used in literature, etc.

While working on an empirical study of literature between 1995 and 2018 for systematic mapping, the authors in ref. [25] reviewed 98 primary papers out of the initial 156 accessed from reputable digital libraries to address nine germane research questions centered on various aspects of defect prediction through predictive analytics. Unlike other SLRs, the authors factored in threat to validity items in their study. In ref. [26], only three research questions were investigated in 208 studies published between 2000 and 2010, trying to investigate how the performances of models are affected by the context in which the models are developed, the techniques upon which the models stand, and the independent variables deployed for the models.

A deliberate attempt to identify threats to the internal and external validity of existing studies is missing from the reviewed studies. Especially, studies do not make attempt to relate the aims of primary studies with their future recommendations. This would easily reveal lapses in the future direction. Notwithstanding the performance metrics of primary studies, identified threats to their internal and external validity would avail optimized conceptual frameworks in future studies. These lapses are to be investigated in this study for future SDSP studies.

2.1. Contributions

The relevance of this SLR is stated as follows:(1)Identification of fifty-two primary studies on software defects predictive analytics within the last decade(2)Analysis of feature engineering techniques, unsupervised, supervised, semisupervised and clustering learning approaches in the last decade to capture the past while inspiring the future of the domain use case(3)This study underscores the direct correlation between threats to validity and future research recommendations as observed in the reviewed literature(4)With this work, a review of data preprocessing techniques was sought-after, and other feature engineering techniques

3. Research Methodology

The research methodology adopted in this work is a systematic literature review (SLR) within the scope of software defect prediction. The study aims at addressing research questions set out earlier in this work. As common with other SLRs in the area of software engineering, guidelines stipulated in ref. [27] are adopted in this work as best practices while “snowballing” as defined by Wohlin [28] for a systematic inclusion of some references is likewise employed. Specifically, snowballing involves the study of the reference list of a particular paper or the mention of the paper to detect extra sources. The process followed is described in the subsections as follows.

3.1. Search Strategy

The search process consists of activities including selecting choice digital libraries, preferred search strings, initial search, retrieving the initial list, and refining the search string to get an initial list of primary studies from digital libraries. The following list of digital libraries was consulted as those reputable for software engineering resources [29] as recommended in several software engineering-based reviews:(1)IEEE digital library (@ieeexplore.ieee.org)(2)ACM digital library (@dl.acm.org)(3)SpringerLink library (@link.springer.com)(4)Google Scholar (@scholar.google.com)(5)Elsevier (@elsevier.com)

To identify search items relevant to the course of intent and action:(1)Search terms were influenced by research questions by identifying population, intervention, and outcome(2)Synonyms were identified for major search terms for an inclusive result(3)Keywords were verified and ascertained in all the listed literature(4)Boolean operators, for allowed database, were used including OR and AND to either concatenate alternative spellings and synonyms or inclusion of major keywords, respectively(5)The search string was summarized sometimes for a more compact and specific search outcome

For (1), population, intervention, and outcome were adopted so that better search terms could be obtained.(i)Population: prediction of software defect severity(ii)Intervention: machine learning techniques(iii)Outcomes: severity level prediction

A sample research question with the detail mentioned above is given below:

R-Q5: does the choice of learner algorithm/ensemble (INTERVENTION) impact performance (OUTCOME) of defect severity prediction? (POPULATION)

For (2 and 5), alternative synonyms and spellings as the case may be included:(i)Software defect: (“software bug” OR “software smell” OR “software fault”)(ii)Machine learning: (“machine learning” OR “predictive analytics” OR “text analytics” OR “supervised learning” OR “unsupervised learning” OR “classification” OR “clustering”)(iii)Prediction: (“prediction” OR “detection” OR “identification”)

For (3), keywords in the literature were conventional; hence, no suitable alternative was found.

For (4), Boolean operators were deployed and were acceptable to coin search queries.

3.2. Process of Selecting Primary Study

The selection process overview of Figure 2 shows paths followed for capturing the research articles concerning the study’s defined objectives. Emerging search results are shown in the second column of Table 2 with their corresponding libraries with an initial 2641 papers regarding the query. Table 3 indicates the inclusion and exclusion criteria set out for this study. The title, abstract, and keywords of 2641 found articles were difficult to study, and articles that met such standards were removed, while the remaining were adopted for the next screening with new additions from snowballing (the process of identifying relevant literature from the reference list of a paper understudy) as depicted on Figure 2. Hence, 2.46% of the initial set comprising of 65 papers made the selection.

Specifically, the title is first considered: if it is out of scope (i.e., if it is not about the predictive analytics of the use case), the article is skipped; otherwise, if considered possibly useful, abstract, introduction, methodology, and conclusion suffixes to further ascertain appropriateness for the study. Thus, the eventual sixty-five papers passed all the steps. Once selection papers are established to be considered for the SLR, the quality assessment follows, primarily to confirm that all final papers had the requisite information to answer the research questions (see Table 1), culminating in the final selection analyzed for this study. The quality assessment test turned out positive for each of the selected fifty-one studies. Thus, the SLR study is based on fifty-two papers. The information extraction process and form are later presented in this section.

3.3. Inclusion and Exclusion Principles for Study Selection

The inclusion and exclusion criteria stipulated for primary study selection or rejection are as shown in Table 3. Table 4 shows the data extraction form designed to acquire valuable insights from the primary studies, which speaks to the internal validity observations inherent in their methodologies. The eventual list of primary studies, after the inclusion and exclusion criteria have been implemented as is presented in Table 5.

3.4. Quality Valuation of Papers

The selected primary studies are subjected to a quality assurance postfinal selection process. The following checklist was adopted for credibility checks on selected publications:(i)Q1: are the learning approach, cross-validation, and class labeling presented?(ii)Q2: is the choice of learner algorithms and ensemble stipulated?(iii)Q3: is the choice or otherwise of (i) feature engineering and (ii) parameter optimization adopted stated with reasons?(iv)Q4: is the choice of public data noted, sampling or resampling attempts, and metrics characterization stipulated?(v)Q5: does the study shows its domain choice of software metrics between cross and within projects?(vi)Q6: are there clearly stipulated aims, a threat to validity, and clearly defined future work suggestions?

Each question mentioned above attracts a “Yes,” “Partly,” or “No,” while a study is considered as partial in situations where questions not adequately answered are not those addressed by Q3–5. These answers are graded as 1, 0.5, and 0 for “Yes,” “Partially,” and “No,” respectively. For a study, its quality grade was computed as the sum of the graded answers with respect to the six questions. The quality level was regarded as good (with a grade ≥4), average (with 3≤ grade <4), and low (with a grade <3). Fifty-two studies made the good and average sets which forms the final selection.

As observed from Table 5, studies that made the final list with average and good qualities range from papers presented at conferences and international journals. The aims of the papers were clearly defined in the second column of the table and the year of publications show a period of eleven years, from 2011 to 2022 in the software defect severity prediction research domain.

3.5. Data Extraction

Data extraction commenced, answering stipulated research questions once the final papers selected for the review were ascertained. Precisely, the data extraction form earlier presented in Table 4 is used for grading.

4. Result Analysis

In this section, the results of the review are discoursed towards addressing the research questions.

4.1. Study Demographics

Table 5 shows the list of 52 kinds of literature studied for this SLR review with corresponding years of publication and the type of publication in a journal or conference proceedings. The works are clearly executed between 2011 and 2022, spanning a decade of software defect severity level prediction tasks using machine learning. As may be observed in Figure 3, 64% of the studies have been published in the last five years, indicating a rising profile of software defect severity prediction interest amongst researchers in the engineering genre, while over 58% of the publications are journal articles.

4.1.1. R-Q1: Data Sampling Approach in Literature

The outcome of this SLR concerning the R-Q1 is in no way different from trends observed from previous works where the majority of datasets deployed for training learner algorithms are grossly of imbalanced class distribution, which is a characteristic nature of software defect public data [67] and hence is highly biased. This is attributed to the fact that software metrics are more defect-free than defect prone; hence the defect instances are seldom as much as the instances without defect [48], which attests to why the dependent variables are favorably skewed towards nondefect instances, especially for binary classification since unsupervised learning is devoid of labeling. The result presented in Table 6 speaks volumes of the data sampling and resampling pattern in primary studies as only a few altered training dataset distributions by way of resampling to address the bias, which is essential for a better predictive accuracy [31]. Considering literature with resampling efforts, most of those adopting minority oversampling deployed the synthetic minority oversampling technique (SMOTE) for class distribution resampling amidst several other options. Ensemble approaches were likewise implemented in other literature, especially to limit the adverse effects of class imbalance. Few pieces of literature adopted the strategy of deploying a multivariate class labeling by further decomposing class labels into various severity levels to reduce bias.

(1) Summary for R-Q1. 72.5% of primary papers deployed an imbalanced training set for SDP while 17.6% adopted oversampling, with 1.96% adopting undersampling in their attempts to resample. Hybrid approaches and others which constitute 3.9% appease use other resampling methods.

4.1.2. R-Q2: Which Public Data Are Often Deployed?

Public data are often used for predictive analytics in SDP literature [10]. As observed from primary studies for this SLR, both within-projects and cross-projects have adopted the same approaches with varying choices across publicly available sets. Targets of researchers in choice of dataset vary likewise but ultimately of high consideration is the nature of metrics concerning independent variables under consideration as noticed from this study. Concerning the graph plot presented in Figure 4, twenty-two different datasets are under consideration, with each spreading across one, two, or more of the presented datasets.

(1) Summary for R-Q2. NASA dataset is the most adopted of all 22 used in literature with a 49% deployment rate while Promise follows with 29.41%. Eclipse, Mozilla, Apache, and Azeem likewise enjoyed considerable adoption in literature.

4.1.3. R-Q3: Most Adopted Machine Learning Variant

Machine learning variants adopted in the literature vary, and it is the exclusive preserve of a researcher to adopt whatever variant in the course of its SDP study. As observed in the sunburst plotted in Figure 5, the SDP research community highly deploys supervised learning with unsupervised learning coming second. [L09] and [L43] deployed a semisupervised approach in their study while [L34] adopted a dictionary learning approach. Adoption of both unsupervised and supervised learning variants is encapsulated in the work of [L36], where training data are clustered for a subsequent categorized supervised training while the work of [L44] seeks to compare the predictive accuracy of supervised and unsupervised variants of machine learning.

(1) Summary for R-Q3. Supervised learning is widely adopted despite reports of better performance by unsupervised learning. It is recommended that the unsupervised learning approach be given more attention in SDP to further improve the predictive accuracies of software defects.

4.1.4. R-Q4: Preferred Choice of Learner Algorithm and/or Ensemble

With several machine learning algorithms at the disposal of the software engineering research community, Figure 6 captures the most deployed in the primary studies across the various models proposed. While Naïve Bayes enjoyed the biggest patronage in the supervised learning subcategory, K-means was the most adopted in the unsupervised learning category for base learning and clustering, respectively. Bagging was mostly deployed across the literature in the ensemble set, as evident in Table 7 presented, followed closely by Stacking and Boosting with five (5) representations each.

(1) Summary for R-Q4. Naïve Bayes is widely adopted in literature, accounting for over 57% of the utilitarian value, and it is the most recommended by literature in their conclusions. Bagging tops the ensemble category.

4.1.5. R-Q5: What Training Strategy is Mostly Deployed?

Studies employed mostly 10-fold cross-validation with 5-fold cross-validation methodology coming next in various attempts to control the rate of variance in the estimation of predictive models. The stratified approach employed by some studies guarantees equal sampling of the minority and majority class as in the original data for each fold, avoiding folds founded only in the popular class. Table 8 shows the close range of the training strategy adopted in the literature.

(1) Summary for R-Q5. An average of 47% of primary studies deployed fold cross-validation while 52.9% employed the hold-out strategy for their studies, showing an almost balanced representation in the population.

4.1.6. R-Q6: Is Parameter Optimization Popularly Factored into SD Predictive Analytics?

Parameter tuning has been highly recommended in the literature for any predictive analytics [10] study for improved accuracy in an attempt to optimize performance by changing parameter settings. [L22] created two versions of K-Nearest neighbor 2NN and 5NN by tuning values of k (2,5) and [L04] likewise tuned two parameters of K-NN learner including the k-value (denoting how many neighbors be taken into consideration) and the nearest neighbor searching algorithm. As evident in Table 9, few studies reported parameter optimization measures in their research.

(1) Summary for R-Q6. To the best of our knowledge, 11% of the study population (6 papers) only reported cases of parameter optimization to enhance prediction performance.

4.1.7. R-Q7: Which Is the Most Adopted ML Tool?

Some of the models presented in the literature are simulations of the SDP analytics, and other studies developed software codes to implement proposed models. In an attempt to fully grasp different approaches in primary studies, the inclusion of this research question becomes imperative. As observed from Figure 7, few studies explicitly mentioned specific tools employed to model their proposed architectures. However, the Waikato Environment for Knowledge Acquisition (WEKA) tool was majorly deployed in reported cases of primary studies, and Python with Rapid miner was also patronized for the modeling.

(1) Summary for R-Q7. Integrated development environments are mostly preferred, especially for simulation purposes in analyzed primary studies.

4.1.8. R-Q8: Which Feature Selection Option Is Widely Adopted for Dimensionality Reduction?

Figure 8 shows the adoption rate and choice of feature engineering algorithms in primary studies with their respective rankers by way of reducing dimensionality in training sets. While machine learning thrives on huge datasets, feature engineering in predictive analytics optimizes performance [10] towards reducing redundancy, just as multicolinearity is topical for better performance in software defect predictive analytics. [L01] deployed Chi-Square and information gain algorithms for removing irrelevant features by finding dependence between two variables by ranking features and retaining top-ranking features for the SDP modeling.

(1) Summary for R-Q8. Feature engineering does not regularly feature in primary studies as only a few pieces of literature specifically aimed at deploying feature selection in their works.

4.1.9. R-Q9: What Is the Course of Action between Inter- and Intraproject Research Direction?

Figure 9 shows that within-project metrics are mostly deployed in primary studies with respect to the choice of data. At the same time, few studies reported inadequate defect metrics in within-project metrics which are used for both the training and testing phases of their work. [L37] and a few others employed a cross-project strategy between metrics from Eclipse and Mozilla as either is deployed for training and the other for testing for a better perspective on the model’s predictive accuracy. On the contrary, [L38] adopted a cross-project to solve data imbalance, especially for assessing the efficiency of class disparity in ensemble learning.

(1) Summary for R-Q9. Defect prediction in cross-project needed further exploration in software defect severity prediction studies as few studies adopted the approach in their works.

4.1.10. R-Q10: Prominent Threats to Models’ Validity Across Primary Study

Threats to validity (TTV) is germane in the field of software engineering for software defect severity prediction to ascertain levels of threats to internal and external validity in literature and as observed from Table 10, a sizeable number of studies that reported cases of TTV in their proposed models, which uncovers grey areas that may have impeded better performance in the models. Threats to external validity in an experimental software engineering study are circumstances that limit the generality of case study outcomes [71], while threats to internal validity are regarded as errors in empirical metrics and tool adoption [50]. The level of compliance with reporting TTV of inherent threats in studies is encouraging. However, studies like ref. [47] claim compliance with best practices while asserting the immunity of their study to threats. Table 10 clearly shows prominent areas of research studies where different categories of threats are reported.

4.1.11. R-Q11: Does TTV Inspire the Direction of Future Works in SDP

Mapping TTV with future work is an attempt to tag the consistency of current studies with proposed future studies as a continuous evaluation and improvement mechanism towards ensuring informed decisions. Future study direction in SDP could then be consistent with current realities. Table 11 shows the link between threats identified from primary studies and their respective future direction as observed from the concluding section of each study. All twenty-two studies that reported threats to validity indeed expressed intent to include in their future studies grey areas enumerated in their TTVs.

(1) Summary for R-Q11. It is observed from the study that TTV indeed inspires future work direction in the SDP industry.

5. Discussion and Implications

Further discussion of the main findings (which discusses contribution to knowledge) is presented in this section with respect to the research questions set out for this study.

5.1. R-Q1: Limited Awareness of Various Implications of the Imbalance Training Set

Consequent on the foregoing evaluation of the class sampling status of the dataset deployed in literature, it is evident that studies are fully not yet abreast of the varying implications of a bias training set for SDP studies, whereas some studies indeed noted the implications, thereby deploying the ensemble learner approach or other techniques, other studies failed to acknowledge the implications besides studies that employed resampling techniques as part of their methodologies. The first recommendation is for future studies to proactively incorporate a resampling component in the data preprocessing phase of software defect predictive analytics. To ease the choice of resampling approach, the subsequent research question shows the most adopted in the literature is oversampling. Furthermore, an exploratory data analysis is highly recommended to reveal actionable insights that could inspire future conceptual frameworks.

5.2. R-Q2: Need for Inclusiveness in Software Metrics Repository Choice

The majority of studies have concentrated on either Nasa or Promise repositories, with others trailing far behind in terms of application and analysis, whereas other studies trained on the two most prominent alongside other less popular repositories; there may be a need to spread studies across repositories for an inclusive and robust generalization of results during reportage which will situate conclusions reliably. That will also eliminate the possibility of a threat to the external validity of their experimental results.

5.3. R-Q3: Supervision of the Learner Algorithm Takes Center Stage in SDP Studies

A review in this work shows the towering adoption of the supervised learning approach in literature for SDP and the subtle adoption of semisupervised alongside unsupervised learning. The combination of both unsupervised and supervised approaches for the training set clustering and classification, respectively, is likewise gaining interest in the software engineering genre. However, it is pertinent to fully consider unsupervised learning as a way of software metrics clustering as a precursor to severity level prediction.

5.4. R-Q4: Towering Posture of Naïve Bayes

As witnessed in the study, the adoption of Naïve Bayes span through the eleven years of study despite the relative choice of software metrics adoption in literature. It is noteworthy that studies that experimented on more than two base learners likewise recommended the algorithm for its performance metrics in the study, hence the need to factor its relevance into subsequent studies while studying ways of its parameter optimization for better performance.

5.5. R-Q5: Widespread Adoption of n-Fold Cross-Validation

This study shows the widespread adoption of cross-validation predictive analytics for SDP alongside the traditional hold-out approach. Reduction of overfitting has been attributed to the trend, but either way, they are considered as efficient for SDP in primary studies.

5.6. R-Q6: Considerable Low Deployment of Parameter Optimization

Revelations show a low rate of parameter optimization in primary studies despite its efficiency in natural language processing with respect to the data preprocessing phase in software defect severity prediction. There is a dire need for its deployment for better performance metrics in subsequent studies as some prominent learner algorithms will be better efficient with parameter tuning.

5.7. R-Q7: Widespread Adoption of Simulation in SDP Modeling

As noted, integrated development environments for simulation is widely deployed in literature, with WEKA leading the pack of the most widely used environment for SDP predictive analytics.

5.8. R-Q8: Information Gain and Correlation Coefficient Algorithms for Feature Selection

An appreciable number of primary studies adopted feature engineering as part of their preprocessing techniques to reduce dimensionality, while information gain and correlation coefficient enjoyed the widespread application. Feature selection is recommended as an integral part of the preprocessing phase of software defect severity level prediction for enhanced performance metrics, as studies can observe.

5.9. R-Q9: Need for a Cross-Project Approach in Future Studies

Few literature gave consideration for a cross-project approach in their studies which is highly recommended to avail models of the opportunity of cross-fertilization of software metrics with respect to testing the efficiency of training through the deployment of the industry-based test set for prediction. Restriction of data metrics to within-project or closed-project metrics may constitute a threat to the external validity of proposed models, especially at test time with a cross-project test set.

5.10. R-Q10: Software Metrics Posing Major Threats to SDP Models’ Validity

As noted in studies with TTV included in their structure, dataset-related threats pose a major challenge to research works under reference, hence the need for future studies to undergo an in-depth analysis of software metrics available in repositories while ensuring inclusivity wider scope of databases for modeling of SDP studies.

5.11. R-Q11: Mapping of TTV with Future Works in Literature

The future direction of SDP studies is clear-cut for researchers in the industry as the consensus appears to be in the direction of ensuring the adoption of appropriate feature selection technique, ensuring data resampling, deployment of parameter optimization, concatenation of widespread software metrics of variant repositories to serve as the training set for building SDP models, etc.

Figure 10 shows a graphical representation of the final mind map of contributions in this study. With the initial mind map presented in Figure 1, it clearly clarifies the contributions of the primary studies with respect to the research questions presented in Table 1. With this, future studies on SDP could be guided by the various observations from this study.

5.12. Future Agenda

Consequently upon the foregoing, a descriptive statistical technique of exploratory data analysis (EDA) is highly recommended in future conceptual frameworks for data science-based Software Defect Severity Prediction. This will avail actionable insights from the dataset which should inspire conceptual frameworks. This will address various gaps identified from research questions R10, R9, R8, R6, R5, R2, and R1, which all constitute threats to the internal and external validities of primary studies. As observed from the final mind map of Figure 10, an EDA prior to predictive analytics would uncover the limitations of predictor attributes in an adopted dataset (R-Q10), which is caused by the prevalence of within-project choices (R-Q9). The threat to the external validity of studies is observed to be majorly caused by the adoption of a within-project dataset, whose results cannot be generalized when tested on a cross-project. If multicollinearity is discovered through a correlation test of EDA, this could influence the choice of feature selection (R-Q8) technique, depending on the degree of the level of identified correlation. The usual loss function problem without parameter tuning (R-Q6) will likewise be averted with an in-depth understanding of the training set at hand. While the cross-fold validation (R-Q5) helps reduce variance problems, an EDA would further help the tradeoff that must be considered between bias and variance. An EDA would clearly influence the choice of the tradeoff to be considered. Furthermore, the within or closed-project choices in literature indeed inspire the choice of the public repository (R-Q2) to adopt for historical data. The interquartile range (IQR) (through box plot), correlation coefficient (through heat map plot), and other multivariate plots of EDA would be needed to inspire required mitigating techniques to be adopted especially in respect to R-Q1 which reveals the adoption of highly imbalanced training sets across studies.

5.13. Threats to Validity

A prominent threat to the validity of this SLR is in the area of selecting primary studies, especially the quest of efficiently rating the previous studies to get the final fifty-one which constitutes the primary study. To mitigate the threat, the Herculean task of painstaking study of each and every 51 kinds of literature though time-consuming gave a robust and in-depth awareness of the subject of discussion and elements that make up each methodology. No step was speared in the process, just as the same exercise was repeated thrice to ascertain noted claims consecutively. In addition to the automated search of libraries, snowballing was likewise adopted, which is an offshoot of a thorough study of each paper to find other relevant papers. The exclusion and inclusion criteria earlier adopted for this SLR shaped the outcome in commendable ways as most papers adequately answered the research questions, just as the data extraction form plays a significant role in the whole process for quality assessment.

Other threats are the reporting styles of authors, which often encapsulate vital elements of their research away from their model designs. It takes various repetitive studies to identify some germane elements central to the quality assessment of their studies. While some of these elements are not included in their abstracts and keywords, glancing through their introduction and conclusion will yet not reveal their discovery until the entire study was studied prior to the full-scale understudying of the primary studies.

6. Conclusion

This study described an SLR on the deployment of machine learning techniques for software defect severity prediction in the software engineering genre. Eleven specific areas were targeted as an overview of how previous research conducted since eleven years up till 2022 fared with respect to (i) data sampling approach, (ii) choice of software repositories, (iii) most preferred machine learning variant, (iv) prominent learner algorithm and or ensemble, (v) mostly deployed training strategy, (vi) variant of parameter optimization adopted, (vii) most popular machine learning tool, (viii) prominent feature selection algorithm, (ix) choice between within and cross-project, (x) prominent threats to validity, and (xi) future direction of SDP with respect to threats aforementioned; which all speak to the bias-variance tradeoff considerations and likely threats to validity of proposed methodologies. The study was conducted on papers from 2011 to 2022 comprising of an initial 2653 study population, which eventually resulted in a 52-primary study subset after thorough analysis. The analysis conducted highlighted a handful of observations and limitations in primary studies which are essential to shape future studies in the SDP industry. The following are observed from primary studies: (i) less-deployment of unsupervised learning for SDP studies, (ii) less deployment of balanced data without resampling attempts by the majority of studies that used imbalance data for training, (iii) less-reportage of natural language processing techniques effected on software defect reports before classification, (iv) nonconsideration of multicolinearity problems, and (v) less consideration for cross-project approach which are all essential for a better software defect severity predictive analytics. Therefore, it is believed that this study will be a reference for future works and that the research community will find it as a signpost for better research quality in the software defect severity prediction case.

Data Availability

No data were used to support the study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

APC/Open Access funding was provided by Østfold University College, Halden, Norway.