A Tool-Based Perspective on Software Code Maintainability Metrics: A Systematic Literature Review
Software maintainability is a crucial property of software projects. It can be defined as the ease with which a software system or component can be modified to be corrected, improved, or adapted to its environment. The software engineering literature proposes many models and metrics to predict the maintainability of a software project statically. However, there is no common accordance with the most dependable metrics or metric suites to evaluate such nonfunctional property. The goals of the present manuscript are as follows: (i) providing an overview of the most popular maintainability metrics according to the related literature; (ii) finding what tools are available to evaluate software maintainability; and (iii) linking the most popular metrics with the available tools and the most common programming languages. To this end, we performed a systematic literature review, following Kitchenham’s SLR guidelines, on the most relevant scientific digital libraries. The SLR outcome provided us with 174 software metrics, among which we identified a set of 15 most commonly mentioned ones, and 19 metric computation tools available to practitioners. We found optimal sets of at most five tools to cover all the most commonly mentioned metrics. The results also highlight missing tool coverage for some metrics on commonly used programming languages and minimal coverage of metrics for newer or less popular programming languages. We consider these results valuable for researchers and practitioners who want to find the best selection of tools to evaluate the maintainability of their projects or to bridge the discussed coverage gaps for newer programming languages.
Nowadays, software security and resilience have become increasingly important, given how pervasive the software is. Effective tools and programming languages can(i)discover mistakes earlier(ii)reduce the odds of their occurrence(iii)make a large class of common errors impossible by restricting at compile time what the programmer can do
Several best practices are consolidated in software engineering, e.g., continuous integration, testing with code coverage measurement, and language sanitization. All these techniques allow the application of code analysis tools automatically, which can provide a significant enhancement of the source code quality and allow software developers to efficiently detect vulnerabilities and faults . However, the lack of comprehensive tooling may render it challenging to apply the same code analysis strategies to software projects developed with different languages or for different domains.
The literature defines software maintainability as the ease with which a software system or component can be modified to correct faults, improve performance or other attributes, or adapt to a changing environment . Thus, maintainability is a highly significant factor in the economic success of software products. Several studies have described models and frameworks, based on software metrics, to predict or infer the maintainability of a software project [3–5]. However, although many different metrics have been proposed by the scientific literature over the course of the last 40 years, the available models are very language- and domain-specific, and there is still no accordance in the industry and academia about a universal set of metrics to adopt to evaluate software maintainability .
This work aims at answering the primary need of identifying evaluation frameworks for different programming languages, either affirmed or newly emerged, e.g., the Rust programming language, developed by Mozilla Research as a language similar in characteristics to C++, but with better code maintainability, memory safety, and performance [7, 8].
Thus, the first goal of this paper is to find which are the most commonly mentioned metrics in the state-of-the-art literature. We focused on static metrics since the analysis of dynamic metrics (i.e., metrics collected during the execution of adequately instrumented software ) was out of the scope of this work.
The second goal of the paper is to determine which tools are more commonly used in the literature to calculate source code metrics. Based on the mostly used tools, we then define an optimal selections able to compute the most popular metrics for a set of programming languages.
To pursue both goals we(i)applied the systematic literature review (SLR) methodology on a set of scientific libraries(ii)performed a thorough analysis of all the primary studies, available in the literature, about the topic of software metrics for maintainability
Hence, this manuscript provides the following contributions to researchers and practitioners:(i)The definition of the most mentioned metrics that can be used to measure software maintainability for software projects(ii)Details about closed-source and open-source tools that can be leveraged by practitioners to evaluate the quality of their software projects(iii)Optimal sets of open-source tools that can be leveraged to investigate the computation of software metrics for maintainability adopt them in evaluation frameworks adapt them to other programming languages that are currently not supported
The remainder of the manuscript is structured as follows:(i)Section 2 describes the approach we adopted to conduct our SLR(ii)Section 3 presents a discussion of the results obtained by applying such approach(iii)Section 4 discusses the threats to the validity of the present study(iv)Section 5 provides a comparison of this study with existing related work in the literature(v)Section 6 concludes the paper and provides directions for future research
2. Research Method
In this section, we outline the method that we utilized to realize this study. We performed a systematic literature review (from now on, SLR), following the guidelines provided by Barbara and Charters  to structure the work and report it in an organized and replicable manner.
An SLR is considered one of the key research methodologies of evidence-based software engineering (EBSE) . The methodology has gained significant attention from software engineering researchers in recent years . SLRs all include three fundamental phases: (i) planning the review (which includes specifying its goals and research questions); (ii) conducting the review (which includes querying article repositories, selecting the studies, and performing data extraction); and (iii) reporting the review.
All those steps have been undertaken during this research and are detailed in the following sections of this paper.
According to Barbara and Charters guidelines, the planning phase of an SLR involves the identification of the need for the review (hence the definition of its goals), the definition of the research questions that will guide the review, and the development of the review protocol we will use.
The need for the review, as said in the introduction section, came from the need to improve the software maintainability, in terms of clarity of its source code, while implementing complex algorithms. Our primary objective was to identify a dependable set of metrics widely used in the literature and computed for software usage with available tools.
The objectives of our research are defined by using the Goal-Question-Metric paradigm by van Solingen et al. . Specifically, we based our research on the following goals:(i)Goal 1: have an overview of the most used metrics in the literature in the last few years(ii)Goal 2: find what tools have been used in (or described by) the literature about maintainability metrics(iii)Goal 3: find a mapping between the most common metrics and the tools able to compute them
2.1.2. Research Questions
Based on the goals defined above, our study entailed answering the research questions defined in the following:(i)RQ1.1: what are the metrics used to evaluate code maintainability available in the literature? Our aim for this research question is to determine what metrics are present in the literature and how popular they are in manuscripts about code maintainability.(ii)RQ1.2: which of the metrics we found are the most popular in the literature? This research question aims at characterizing the different metrics obtained from answering RQ1.1 based on their popularity and adoption.(iii)RQ2.1: what tools are available to perform code evaluation? The expected result of this research question is a list of tools, both closed source and open source, along with the metrics they can calculate.(iv)RQ2.2: what is the ideal selection of tools able to apply the most popular metrics for the most supported programming languages? This research question entails measuring the coverage provided by the set of the most popular metrics for each language and providing the optimal set of tools that can compute those metrics.
2.1.3. Selected Digital Libraries
The search strategy involves the selection of the search resources and the identification of the search terms. For this SLR, we used the following digital libraries:(i)ACM Digital Library(ii)IEEE Xplore(iii)Scopus(iv)Web of Science
2.1.4. Search Strings
The formulation of the search strings is crucial for the definition of the search strategy of the SLR. According to the guidelines defined by Kitchenham et al., the first operation in defining the search string involved an analysis of the main keywords used in the RQs, their synonyms, and other possible spellings of such words.
In this phase, all the researchers collaboratively selected several pilot studies. The selected pilot studies are presented in Table 1 and are related to the target research domain.
These studies are selected to be used to verify the goodness of the research queries: the researchers should review the queries if the pilot studies are not present after the refining phase.
The starting keywords identified were software, maintainability, and metrics. The search string “software maintainability metric” was hence used to perform the first search on the selected digital libraries. Our results include articles published between 2000 and 2019.
This first search pointed out that adding the code synonym of the keyword software added a large numbers of papers to the results.
Also, the following keywords were excluded from the search to reduce the number of unfitting papers from the results:(i)Defect and fault, to avoid considering manuscripts more related to the topic of verification and validation, error-proneness, and software reliability prediction, than to code maintainability(ii)Co-change, to avoid considering manuscripts more related to the topic of code evolution(iii)Policy-driven and design, to avoid considering manuscripts more related to the definition and usage of metrics used to design software, instead of evaluating existing code
Table 2 reports the search queries before and after excluding the keywords listed above, for each of the chosen digital libraries.
2.1.5. Inclusion and Exclusion Criteria
The final phase of the study selection uses the studies obtained by applying the final search queries detailed below.
The following are the inclusion criteria used for the study selection: IC1: studies written in a language comprehensible by the authors IC2: studies presenting a new metric accurately IC3: studies that present, analyze, or compare known metrics or tools IC4: detailed primary studies
On the other hand, in the following are defined the exclusion criteria: EC1: studies written in a language not directly comprehensible by the authors, i.e., not written in English, Italian, Spanish, or Portuguese EC2: studies that present a novel metric, but not do not describe it accurately EC3: studies that do not describe or use metrics or tools EC4: secondary studies (e.g., systematic literature reviews, surveys, and mappings)
After defining the review protocol in the planning phase, the conducting phase involves its actual application, the selection of papers by application of the search strategy, and the extraction of relevant data from the selected primary studies.
2.2.1. Study Search
This phase consisted of gathering all the studies by applying the search strings formulated and discussed in Section 2.1.4 to the selected digital libraries. To this end, we leveraged the Publish or Perish (PoP) tool . To aid the replicability of the study, we report that we performed the last search iterations at the end of October 2019. After the application of the queries and the removal of the duplicate papers on the four considered digital libraries, 801 unique papers were gathered (see Table 3). The result of this phase is a list of possible papers that must be subject to the application of exclusion and inclusion criteria. This action allows having a final verdict for their selection as primary studies for our SLR. We exported the mined papers in a CSV file with basic information about each extracted manuscript.
2.2.2. Study Selection
The authors of this SLR carried the paper selection process independently. To analyze the papers, we used a 5-point Likert scale, instead of dividing them between the fitting and unfitting. We performed the following assignation:(i)One point to the papers that matched exclusion criteria and did not match any inclusion criteria(ii)Two points to papers that matched some exclusion criteria and some inclusion criteria(iii)Three points to papers that did not match any criteria (neither exclusion or inclusion)(iv)Four points to papers that matched some, but not all, inclusion criteria(v)Five points to papers that matched all inclusion criteria
We analyzed the studies in two different steps: first, the title and abstract for finding immediate compliance of the paper to the inclusion and exclusion criteria. For papers that received 3 points after reading the title and abstract, the full text was read, with particular attention to possible usage or definition or metrics throughout the body of the article. At the end of the second read, none of the uncertain studies were evaluated as fitting with our research needs, and hence, no other primary study was added to our final pool.
During this phase, we also applied the process of snowballing. Snowballing refers to using the reference list of the included papers to identify additional papers . The application of snowballing, for this specific SLR, did not lead to any additional paper to take into consideration.
2.2.3. Data Extraction
In this phase, we read each identified primary studies again, to mine relevant data for addressing the formulated RQs. We have created a spreadsheet form to be compiled for each of the considered papers, and that contained the data of interest subdivided by the RQ they concurred to answer. The data extraction phase, again, was performed by all the authors of the papers in an independent manner.
For each paper, we collected some basic context information:(i)Year of publication(ii)Number of times the paper was viewed fully and number of citations(iii)Authors and location of the authors
To answer RQ1.1, we needed to inspect the set of primary studies to understand which metrics they defined or mentioned. Hence, for each paper, we extracted the following data:(i)The list of metrics and metric suites utilized in each paper(ii)The programming languages and the family of programming language (e.g., C-like and object oriented) for which the used or proposed metrics can be computed
To answer RQ1.2, we wanted to give an additional classification of the metrics, other than the number of mentions. We took in consideration the opinion of the authors on each of the metrics studied in their papers. This allowed us to evaluate if a metric is considered useful or not in most papers. This analysis allowed us to take into consideration the popularity of the metrics by counting the difference between positive and negative citations by authors.
To answer RQ2.1, we needed to inspect the primary studies to understand which tools they presented or used to compute the metrics that were adopted. For each paper that mentioned tools, we hence gathered the following information:(i)The list of tools described, used, or cited by each paper(ii)When possible, the list of metrics that can be calculated by each tool(iii)The list of programming languages on which the tool can operate(iv)The type of the tool, i.e., the fact that the tool is open source or not
Finally, to answer RQ2.2, we had to correlate the information gathered for the previous research questions. We achieved this by finding the tool or tools covering the metrics that proved to be the most popular among selected primary studies.
2.2.4. Data Synthesis and Reporting
In this phase, we elaborated the data extracted and synthesized previously to obtain a response for each of the research questions we had. Having all the data we needed, in the shape of a form per paper analyzed, we proceeded with the data synthesis.
We gathered all the metric suites and the metrics we found in tables, keeping track of the papers mentioning them. We computed aggregate measures on the popularity value assigned to each metric.
This section describes the results obtained to answer the research questions described in Section 2.1.2. The appendices of this paper report the complete tables with the extracted data to improve the readability of this manuscript.
At the end of this phase, we collected a final set of 43 primary studies for the subsequent phase of our SLR. Figure 1 reports the distribution over the considered time frame of the selected papers, and Figure 2 indicates the distribution of authors of related studies over the world. We report the selected papers in Table 4. The statistic seems to suggest that the interest in software maintainability metrics had grown since 2008 and has increased in the latest years since 2016 (see barplot in Figure 1).
3.1. RQ1.1: Available Metrics
The papers selected as primary studies for our SLR cited a total of 174 different metrics. We report all the metrics in Table 5 in the appendix. The table reports(i)the metric suite (empty if the metric is not part of any specific suite)(ii)the metric name (acronym, if existing, and a full explanation, if available)(iii)the list of papers that mention the metric. The last two columns, respectively, report(iv)the total number of papers mentioning the metric (i.e., the number of studies in the third column)(v)the score we gave to each metric
We computed the score in the following way:(i)+1 if the study used (or defined) the metric or the authors of the study expressed a positive opinion about it(ii)−1 if the paper criticized the metric
By examining the last two columns of the metrics table, it can be seen that the last two columns are most of the times identical. This is because the majority of the papers we found just utilize the metrics without commenting them, neither positively or negatively.
It is immediately evident that some suites and metrics are taken into consideration much more often than others. More than 75% of the metrics are mentioned by just a single paper. The boxplots in Figure 3 show, in red, the distribution of the total number of mentions and the score for all the considered metrics. It is evident, from the boxplots, that the difference between the two distribution is rather limited, confirming the vast majority of neutral or positive opinions when the metrics are referenced in a research paper. Since only 24.7% of the metrics are used by more than one of our selected studies, the median values of both the measured indicators, “TOT” and “Score”, are equal to 1 if the whole set of metrics is considered.
In general, however, it is worth underlining that a low score does not necessarily mean that the metric is of lesser quality but instead that it is less known in the related literature. Another interesting thing to point out is that we did not find a particular metric that received many negative scores.
3.2. RQ1.2: Most Mentioned Metrics
Since our analysis was aimed at finding the most popular metrics, to extract a set of them to be declined to different languages, we were interested in finding metrics mentioned by multiple papers. In Table 6 we report metrics that were used by at least two papers among the selected primary studies. This operation allowed us to reduce the noise caused by metrics that were mentioned only once (possibly in the papers where they were originally defined). After applying this filter, only 43 metrics (the 24.7% of the original set of 174) remained. The boxplots in Figure 3 show, in green, the distributions of the total number of mentions and the measured score for this set of metrics. On these distributions, the rounded median value for the total number of mention is 3, and for the score is 3.
Since our final aim in answering RQ1.2 was to find a set of most popular metrics for the maintainability of source code, we resorted on selecting, on the complete set of 43 metrics mentioned in at least two papers, those whose score was above the median.
3.3. RQ2.1: Available Tools
In Table 8, we report all the tools that were identified while reading the papers. The columns report, respectively, as follows: the name of the tool, as it is presented in the studies; the studies using it; a web source where the tool can be downloaded. In the upmost section of the table, we reported papers from which we cannot find the used tool (i.e., a tool was mentioned but no download pointer was provided, indicating that the tool has never been made public and/or it had been discontinued), or for which no information about the used tool was provided. For the latter, we have indicated the studies in the table with the respective author’s name.
In the second and third section of the table, we have divided the tools according to their release nature, i.e., we discriminated between open-source and commercial tools. The table reports information about a total of 38 tools: 19 were not found; 6 were closed source; and 13 were open source.
The majority of the tools we found are mentioned by only one study; three are cited by two studies, and only one, CKJM, is quoted by five papers.
It is immediately evident that the open-source tools are more than two times in number than the closed-source ones. This result may be unrelated to the quality of the tools themselves but instead be justified by the fact that open-source tools are better suited for academic usage since they provide the possibility of checking the algorithms and possibly modify or integrate them to analyze their performance.
For each of the tools that we were able to identify, we give a brief description in the following; the details about their supported languages and metrics can be found after the descriptions of the tools.
3.3.1. Closed-Source Tools
Six closed-source tools can be found in the analyzed primary studies, three of which are mentioned in the same paper. The tools described hereafter are listed in alphabetical order and not in any order of importance.(i)CAST’s Application Intelligence Platform. This tool analyzes all the source code of an application, to measure a set of nonfunctional properties such as performance, robustness, security, transferability, and changeability (which is strictly tied to maintainability). This last nonfunctional property is measured based on cyclomatic complexity, coupling, duplicated code, and modification of indexes in groups . The tool produces as output a set of violation of typical architectural and design patterns and best practices, which are aggregated in formats specific for both the management and the developers.(ii)CMT++/CMTJava. CMT is a tool specifically made to estimate the overall maintainability of code done in C, C++, C#, or Java, and to identify the less maintainable parts of it. It is possible to compute many of the discussed metrics with the tool: McCabe’s cyclomatic number, Halstead’s software science metrics, lines of code, and others. CMT also allows computing the maintainability index (MI). The tool can work in command line mode or with a GUI.(iii)Codacy. It is a free tool for open-source projects and can be self-hosted, otherwise a license must be purchased to use it. This tool aims at improving the code quality, to augment the code coverage and to prevent security issues. Its main focus is on identifying bugs and undefined behaviours rather than calculating metrics. It provides a set of statistics about the analyzed code: error-proneness, code style, code complexity, unused code, and security.(iv)JHawk. The tool is tailored to only analyze code written in Java, but it can calculate a vast variety of different metrics. JHawk is not new on the market since its first release was introduced more than ten years ago. At the time of writing this article, the last available version is 6.1.3, from 2017. It is used and cited in more than twenty of the selected primary studies. JHawk aids the empirical evaluation of software metrics with the possibility of reporting the computed measures in various formats, including XML and CSV, and it supports a CLI interface.(v)Understand. Developed by SciTools, it can calculate several metrics, and the results can be extracted automatically via command line, graphical interface, or through their AIP. Most of the metrics supported by this program are complexity metrics (e.g., McCabe’s CC), volume metrics (e.g., LOC), and object-oriented metrics. The correlation between the supported metrics and the inferred maintainability of software projects is not explicitly mentioned in the tool’s documentation.(v)Visual Studio. It is a very well-known IDE developed by Microsoft. It comes embedded with modules for the computation of code quality metrics, in addition to all its other functions. Among the maintainability metrics listed in the previous section, it supports MI, CC, DIT, class coupling, and LOC. The main limitation for the Visual Studio tool is that these metrics can be computed only for projects written in the C and C++ languages, and not for projects in any other of the many languages supported by the IDE. Also, from the Visual Studio documentation, it can be seen that the IDE makes some assumptions about the metrics that are different from the standard ones. As an example, the MI metric used in Visual Studio is an integer between 0 and 100, with different thresholds from the standard ones defined for MI (MI 20 indicates a code easy to maintain, a rating from 10 to 19 indicates that the code is relatively maintainable, and a value below 10 indicates low maintainability).
3.3.2. Open-Source Tools
3.3.3. Correspondence between Tools and Languages
From the table, it is evident that the closed-source tools support more programming languages (an average of 10.5) compared to open-source tools (an average of 4.85). By analyzing the primary studies selected for this SLR, it is also reported that closed-source tools tend to support some metrics better than open-source counterparts: for instance, a comparative study between different tools capable of MI reports a higher dependability of such metric when computed using closed-source tools rather than open-source alternatives .
3.3.4. Correspondence between Tools and Metrics
Figure 6 (CS tools and metrics) shows what metrics are calculated by each of the considered tools. For conciseness, only the metrics that are computed by at least one tool are reported in the table. In the upper section of the table, the most popular metrics identified in the answer to RQ1 are reported. Instead, the lower section of the table includes other metrics belonging to the complete set of metrics found in the set of primary studies mined from the literature. The table features a mark for a tool and a metric only in cases when an explicit reference to such metric has been found in the documentation of the tool.
Also, a suite was considered as supported if at least one of its metrics was supported by a given tool.
In the case of the closed-source tools, the metrics have been most of the times inferred from limited documentation. Most of the times, in fact, closed-source tools provide dashboards with custom-defined evaluations of the code, for which the linkage with widespread software metrics is unclear. For instance, the Codacy tool provides a single, overall grade for a software project, between A and F. This grade depends on a set of tool-specific parameters: error-proneness, code complexity, code style, unused code, security, compatibility, documentation, and performance. In addition to some metrics whose usage was explicitly mentioned by the tool’s creators (e.g., number of comments and JavaDoc lines for the documentation property and McCabe’s CC for the code complexity property), it was not possible to find the complete set of metrics used internally by the tool.
In many cases, the tools compute also compound metrics (i.e., metrics built on top of other ones reported in the literature) or metrics that were not previously found in the analysis of the literature performed to answer to RQ1. In these cases, the tools were labelled as featuring other metrics: this information is reported in the last row of the table.
As it is evident from the table, no tool supported all the most popular metrics previously identified. The number of supported metrics among the most popular ones ranged from 1 to 10. Two tools featured just one suite/metric from the set of the most popular ones. The Halstead Metrics Tool, as evident from its name, is an open-source tool with the only purpose of computing the entire set of metrics of the Halstead suite; as well, the CodeMetrics plugin is a basic tool capable of computing only the McCabe cyclomatic complexity (for each method and the total for each class of the project). Quamoco is indeed not only a tool but instead a quality metamodel, based on a set of metrics that are defined, in the scope of the paper presenting the approach, as base measures; the metamodel is theoretically applicable to any kind of base measure that can be computed through static analysis of source code; however, the literature presenting the tool mentions only the LOC metric explicitly. Some other tools, such as JSInspect, CCFinderX, and Ref-Finder tools, featured a limited set of the maintainability metrics previously identified, since they were mainly focused on other aspects of code quality, e.g., detecting code duplicates and code smells.
Tools such as MetricsReloaded, Squale, and SonarQube featured large sets of derived metrics, which were obtained as specializations, sums, or averages of basic metrics such as the McCabe cyclomatic complexity or the coupling between classes.
The bar graph in Figure 7 reports the number of tools that featured each of the considered metrics. Also, in this case, the metrics were divided into three sections on the x-axis: the 15 metrics/suites deemed as most popular in the answer to RQ1, other metrics from the full set, and other metrics not in the set of metrics mined from the literature. Two metrics stood out in terms of the number of tools that supported them. The LOC metric, despite many papers in the literature question its usefulness as a maintainability metric, was supported by 14 out of 19 tools. The metric is closely followed by the cyclomatic complexity (CC), which was supported by 13 tools. Those numbers were expectable since both the metrics are simple to compute and are needed by many other derived metrics. On the other hand, three of the most popular metrics were used by only two of the selected tools. The CHANGE metric refers to the changed lines of code between different releases of the same application and was not computed by most of the tools that performed static analysis on single versions of the application; it was instead computed by two tools that were particularly aiming at measuring code refactorings and smells. The LCOM2 metric is an extension of the LCOM metric, which is part of the C&K suite; several tools just mentioned the adoption of the suite without explicitly mentioning possible adoptions of enhanced versions of the metrics; finally, the message passing coupling was adopted by two tools and in both cases defined with the synonym fan-out.
In general, closed-source tools featured a higher number of metrics than open-source counterparts. Open-source tools, several times, were, in fact, plugins of limited dimension, tailored to compute just a single metric or suite. If only the measures mined from the primary studies are considered, the closed-source tools were able to compute an average of slightly less than 8 metrics, while open-source tools were able to compute an average of 5 metrics. Of the set of 15 most popular metrics, on average 6 could be computed by the closed-source tools and 3 by the open-source tools.
3.3.5. Correspondence between Tools and Languages
3.4. RQ2.2: Ideal Selection of Tools
Tables 10 and 11 show the optimal set of tools to cover all the most popular metrics shown in Table 5. The former takes into account both closed-source and open-source tools; the latter only considers open-source tools. We define an optimal set of tools as the minimal set of tools which can cover the highest possible amount of metrics (or suites) out of the set of 14 most mentioned ones (15 for Java, for which also the JLOC metric can be computed). Inside round brackets, we identified alternative tools that could be selected without influencing the number of tools in the optimal set or the number of metrics covered.
4. Threats to Validity
Threats to construct validity, for an SLR, are related to failures in the claim of covering all the possible studies related to the topic of the review. In this study, the paper was mitigated with a thorough and reproducible definition of the search strategy and with the use of synonyms in the search strings. Also, all the principal sources for the scientific literature were taken into consideration for the extraction of the primary studies.
Threats to internal validity are related to the data extraction phase of the SLR. The authors of this paper evaluated the papers manually, according to the defined inclusion and exclusion criteria. The authors limited biases in the inclusion and exclusion of the paper by discussing disagreements. The metric selection phase was performed based on the opinions extracted from the examined primary studies (considered as adverse, neutral, or positive). Again, the reading of the papers and the subsequential opinion assignments are based on the judgment of the authors and may suffer from misinterpretation of the original opinions. It is, however, worth mentioning that none of the authors of this paper were biased towards the demonstration of a specific preference for one of the available metrics.
Threats to external validity are related to the incapability of obtaining generalized conclusions from the conducted study. This threat is limited in this study since its main results, i.e., the sets of most popular metrics, were formulated w.r.t. to a set of programming languages. The results are not generalized to programming languages that were not discussed in the primary studies examined in the SLR.
5. Related Works
The literature offers several secondary studies regarding code metrics and tools. However, usually, those studies analyze or present a set of tools, and they describe the metrics based on the features of the tool. Our review instead started from an analysis of the literature that was tailored at finding all metrics available in relevant studies in the literature, and then the focus was moved to tools to understand whether the found metrics were supported or not by those tools.
For example, in the literature review published in 2008, Lincke et al.  compared different software metric tools showing that, in some cases, different tools provided uncompatible results; the authors also defined a simple universal software quality model, based on a set of metrics that were extracted from the examined tools. Dias Canedo et al.  performed a systematic literature review for finding tools that can perform software measures. Starting from the tools, the authors analyzed the tool features and described the metrics the software could analyze. For their secondary studies, the authors analyzed papers from 2007 to 2018.
On the other hand, there are also other secondary studies explicitly focused on metrics as the comparative case study published in 2012 by Sjoberg et al. , which has a focus on code maintainability metrics but only considers a subset of 11 metrics for the Java language. The work had a primary aim at questioning the consistency between different metrics in the evaluation of maintainability of software projects.
The systematic mapping study published in 2017 by Nun˜ez-Varela et al.  is one of the most complete works on this topic. The authors discovered 300 source code metrics by analyzing papers published from 2010 to 2015. They also mapped those metrics with the tools that can use them. This work, however, covers a limited time window and does not focus on a specific family of software metrics, gathering dynamic and change metrics along with static ones.
In a recent systematic mapping and review, Elmidaoui et al. identified 82 empirical studies about software product maintainability prediction . The paper focuses on analyzing the different methods available for maintainability estimation, including fuzzy, neurofuzzy, artificial neural network (ANN), support vector machines (SVMs), and group method of data building (GMDH). The paper concludes that the prediction of software maintainability, albeit many techniques are available to perform it, is still limited in industrial practice.
Our work differs from the secondary studies presented above. Our point of view is finding the most common maintainability metrics and tools to be applied to new programming languages. For doing so, we analyzed papers in a 20-year time window (2000–2019). We also distinguished open-source tools from closed-source tools, and for each of them, we mapped the maintainability metrics they use. The output of this work is actionable by practitioners wanting to create new tools for applying maintainability metrics to new programming languages.
Other primary studies in the literature presented (or used) popular software metric tools, which were, however, not extracted during our study selection phase, since their primary purpose was not analyzing code from a maintenance point of view, and hence, the manuscripts could not be found by searching for the maintainability keyword. A relevant example of those tools is CCCC, a widespread tool to evaluate code written with object-oriented languages [72, 73].
Maintainability is a fundamental feature for software projects, and the scientific literature has proposed several approaches, metrics, and tools to evaluate it in real-world scenarios. With this systematic literature review, we wanted to have an overview of the most used maintainability metrics in the literature in the last twenty years, to find the most commonly used ones, which can be used to evaluate existing software, and that can be adapted to measure the maintainability of new programming languages. In doing so, we wanted to provide the readers actionable results by identifying sets of (closed- and open-source) tools that can be adopted to be able to compute all the most popular metrics for a specific programming language.
This manuscript can provide actionable guidelines for practitioners who want to measure the maintainability of their software by providing a mapping between popular metrics and tools able to compute them. Also, this manuscript provides actionable guidelines for practitioners and researchers who may want to implement tools to measure software metrics for newer programming languages. Our work identifies which tools can provide the computation of the most popular maintenance metrics and the support they provide to the most common programming languages. Our work also provides pointers to existing open-source tools already available for computing the metrics, which can be leveraged by tool developers as guidelines for their counterparts for source code written in different languages.
As future work, we aim at implementing a tool that uses the set of metrics we found in RQ1.2 to analyze code written in the Rust programming language. For the Rust programming language, we identified no tool capable of computing the most popular maintainability metrics mentioned in the literature. We plan to extend a tool named Tokei1, which offers compatibility with many modern programming languages. The results of these works are considered capable of easing other researchers to create tools for measuring the maintainability of modern programming languages and for encouraging new comparisons between programming languages.
The data used to support the findings of this study are included within the article in the form of references linking to resources available on the FigShare public open repository.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Mozilla Research funded this project with the research grant 2018 H2. The project title is “Algorithms clarity in Rust: advanced rate control and multithread support in rav1e.” This project aims to understand how the Rust programming language improves the maintainability of code while implementing complex algorithms.
F. Zampetti, S. Scalabrino, R. Oliveto, G. Canfora, and M. Di Penta, “How open source projects use static code analysis tools in continuous integration pipelines,” in Proceedings of the 2017 IEEE/ACM 14th International Confer- Ence on Mining Software Repositories (MSR), IEEE, Buenos Aires, Argentina, May 2017.View at: Google Scholar
IEEE Standards Association, IEEE Standard Glossary of Software Engineering Terminology, IEEE Standards Association, Piscataway, NJ, USA, 1990.
C. Van Koten and A. Gray, “An application of bayesian network for predicting object-oriented software maintainability,” Information and Software Technology, vol. 48, no. 1, pp. 59–67, 2006.View at: Google Scholar
A. Kaur, K. Kaur, and K. Pathak, “Software maintainability prediction by data mining of software code metrics,” in Proceedings of the 2014 International Conference on Data Mining and Intelligent Computing (ICDMIC), Delhi, India, September 2014.View at: Google Scholar
M. I. Sarwar, W. Tanveer, I. Sarwar, and W. Mahmood, “A comparative study of mi tools: defining the roadmap to mi tools standardization,” in Proceedings of the 2008 IEEE International Multitopic Conference, Karachi, Pakistan, December 2008.View at: Google Scholar
S. Klabnik and C. Nichols, The Rust Programming Language, No Starch Press, San Francisco, CA, USA, 2018.
A. Tahir and R. Ahmad, “An aop-based approach for collecting soft- ware maintainability dynamic metrics,” in Proceedings of the 2010 Second International Conference on Computer Research and Development, Beijing, China, May 2010.View at: Google Scholar
K. Barbara A and S. Charters, “Guidelines for performing systematic literature reviews in software engineering,” Tech. Rep., Durham University, Durham, England, 2007, Tech. Rep. 2007.View at: Google Scholar
B. A. Kitchenham, T. Dyba, and M. Jorgensen, “Evidence-based soft- ware engineering,” in Proceedings of the 26th International Conference on Software Engineering, pp. 273–281, IEEE Computer Society, New York, NY, USA, May 2004.View at: Google Scholar
R. van Solingen, V. Basili, G. Caldiera, and H. D. Rombach, Goal Question Metric (GQM) Approach, American Cancer Society, 2002.
J. Ostberg and S. Wagner, “On automatically collectable metrics for software maintainability evaluation,” in Proceedings of the 2014 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process And Product Measurement, Rotterdam, The Netherlands, October 2014.View at: Google Scholar
J. Ludwig, S. Xu, and F. Webber, “Compiling static software metrics for reliability and maintainability from github repositories,” in Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada, October 2017.View at: Google Scholar
H. Liu, X. Gong, L. Liao, and B. Li, “Evaluate how cyclomatic complexity changes in the context of software evolution,” in Proceedings of the 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), Tokyo, Japan, July 2018.View at: Google Scholar
C. Wohlin, “Guidelines for snowballing in systematic literature studies and a replication in software engineering,” in Proceedings of the 18th International conference on evaluation and assessment in software engineering, Ciudad Real, Spain, May 2014.View at: Google Scholar
I. K´ad´ar, P. Hegedus, R. Ferenc, and T. Gyim´othy, “A code refactoring dataset and its assessment regarding software maintainability,” in Proceedings of the 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Osaka, Japan, March 2016.View at: Google Scholar
J. Gil, M. Goldstein, and D. Moshkovich, “An empirical investigation of changes in some software properties over time,” in Proceedings of the 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), Zurich, Switzerland, June 2012.View at: Google Scholar
A. Jain, S. Tarwani, and A. Chug, “An empirical investigation of evolu- tionary algorithm for software maintainability prediction,” in Proceedings of the 2016 IEEE Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), Bhopal, India, March 2016.View at: Google Scholar
B. Curtis, J. Sappidi, and J. Subramanyam, “An evaluation of the inter- nal quality of business applications: does size matter?” in Proceedings of the 33rd International Conference on Software Engineering, ICSE ’11, New York, NY, USA, May 2011.View at: Google Scholar
R. S. Chhillar and S. Gahlot, “An evolution of software metrics: a review,” in Proceedings of the International Conference on Advances in Image Processing, ICAIP 2017, New York, NY, USA, 2017.View at: Google Scholar
Y. Tian, C. Chen, and C. Zhang, “Aode for source code metrics for improved software maintainability,” in Proceedings of the 2008 Fourth International Con- Ference on Semantics, Knowledge And Grid, Beijing, China, December 2008.View at: Google Scholar
A. Kaur, K. Kaur, and K. Pathak, “A proposed new model for main- tainability index of open source software,” in Proceedings of 3rd International Conference on Reliability, Infocom Technologies And Optimization, Noida, India, October 2014.View at: Google Scholar
S. Rongviriyapanish, T. Wisuttikul, B. Charoendouysil, P. Pitakket, P. Anancharoenpakorn, and P. Meananeatra, “Changeability prediction model for java class based on multiple layer perceptron neural network,” in Proceedings of the 2016 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Chiang Mai, Thailand, June 2016.View at: Google Scholar
S. Arshad and C. Tjortjis, “Clustering software metric values extracted from c# code for maintainability assessment,” in Proceedings of the 9th Hellenic Conference on Artificial Intelligence, SETN ’16, New York, NY, USA, May 2016.View at: Google Scholar
M. Pizka, “Code normal forms,” in Proceedings of the 29th Annual IEEE/NASA Software Engineering Workshop, Greenbelt, MD, USA, April 2005.View at: Google Scholar
M. A. A. Mamun, C. Berger, and J. Hansson, “Correlations of software code metrics: an empirical study,” in Proceedings of the 27th International Workshop on Software Measurement And 12th International Conference on Software Process And Product Measurement, IWSM Mensura ’17, New York, NY, USA, May 2017.View at: Google Scholar
T. L. Alves, C. Ypma, and J. Visser, “Deriving metric thresholds from benchmark data,” in Proceedings of the 2010 IEEE International Conference on Software Maintenance, Timisoara,Romania, September 2010.View at: Google Scholar
T. Matsushita and I. Sasano, “Detecting code clones with gaps by function applications,” in Proceedings of the 2017 ACM SIGPLAN Work- Shop on Partial Evaluation and Program Manipulation, PEPM 2017, New York, NY, USA, may 2017.View at: Google Scholar
L. M. d. Silva, F. Dantas, G. Honorato, A. Garcia, and C. Lucena, “Detecting modularity flaws of evolving code: what the history can reveal?” in Proceedings of the 2010 Fourth Brazilian Symposium on Software Components, Architectures And Reuse, Bahia, Brazil, September 2010.View at: Google Scholar
A. Ch´avez, I. Ferreira, E. Fernandes, D. Cedrim, and A. Garcia, “How does refactoring affect internal quality attributes?: a multi-project study,” in Proceedings of the 31st Brazilian Symposium on Software Engineering, SBES’17, New York, NY, USA, May 2017.View at: Google Scholar
Y. Ma, K. He, B. Li, and X. Zhou, “How multiple-dependency structure of classes affects their functions a statistical perspective,” in Proceedings of the 2010 2nd International Conference on Software Technology and Engineering, San Juan, PR, USA, October 2010.View at: Google Scholar
M. Wahler, U. Drofenik, and W. Snipes, “Improving code maintainability: a case study on the impact of refactoring,” in Proceedings of the 2016 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME), North CA, USA, October 2016.View at: Google Scholar
G. Kaur and B. Singh, “Improving the quality of software by refactoring,” in Proceedings of the 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, June 2017.View at: Google Scholar
M. Yan, X. Zhang, C. Liu, J. Zou, L. Xu, and X. Xia, “Learning to aggre- gate: an automated aggregation method for software quality model,” in Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), Buenos Aires, Argentina, May 2017.View at: Google Scholar
K. Chatzidimitriou, M. Papamichail, T. Diamantopoulos, M. Tsapanos, and A. Symeonidis, “npm-miner: an infrastructure for measuring the quality of the npm registry,” in Proceddings of the 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), Gothenburg, Sweden, May 2018.View at: Google Scholar
J. Bohnet and J. D¨ollner, “Monitoring code quality and development activity by software maps,” in Proceedings of the 2Nd Workshop on Managing Technical Debt, MTD ’11, New York, NY, USA, May 2011.View at: Google Scholar
N. Narayanan Prasanth, S. Ganesh, and G. Arul Dalton, “Prediction of maintainability using software complexity analysis: An extended frt,” in Proceedings of the 2008 International Conference on Computing, Communication and Networking, Karur, Tamil Nadu, India, December 2008.View at: Google Scholar
L. Wang, X. Hu, Z. Ning, and W. Ke, “Predicting object-oriented software maintainability using projection pursuit regression,” in Proceedings of the 2009 First International Conference on Information Science and Engineering, Nanjing, China, December 2009.View at: Google Scholar
D. I. Sjøberg, B. Anda, and A. Mockus, “Questioning software mainte- nance metrics: a comparative case study,” in Proceedings of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’12, ACM, New York, NY, USA, September 2012.View at: Google Scholar
A. Hindle, M. W. Godfrey, and R. C. Holt, “Reading beside the lines: indentation as a proxy for complexity metric,” in Proceedings of the 2008 16th IEEE In- ternational Conference on Program Comprehension, Amsterdam, The Netherland, June 2008.View at: Google Scholar
Y. Lee and K. H. Chang, “Reusability and maintainability metrics for object-oriented software,” in Proceedings of the 38th Annual on South-east Regional Conference, ACM-SE 38, New York, NY, USA, May 2000.View at: Google Scholar
B. R. Sinha, P. P. Dey, M. Amin, and H. Badkoobehi, “Software com- plexity measurement using multiple criteria,” Journal of Computing Sciences in Colleges, vol. 28, pp. 155–162, April 2013.View at: Google Scholar
P. Vytovtov and E. Markov, “Source code quality classification based on software metrics,” in Proceedings of the 2017 20th Conference of Open Innovations association (FRUCT), Saint Petersburg, Russia, April 2017.View at: Google Scholar
J. Ludwig, S. Xu, and F. Webber, “Static software metrics for reliability and maintainability,” in Proceedings of the 2018 International Confer- Ence on Technical Debt, TechDebt ’18, pp. 53-54, New York, NY, USA, May 2018.View at: Google Scholar
M. Saboe, “The use of software quality metrics in the materiel release process experience report,” in Proceedings of the Second Asia-Pacific Conference on Quality Software, Brisbane, Queensland, Australia, December 2001.View at: Google Scholar
A. F. Yamashita, H. C. Benestad, B. Anda, P. E. Arnstad, D. I. K. Sjoberg, and L. Moonen, “Using concept mapping for maintainability assessments,” in Proceedings of the 2009 3rd International Symposium on Empirical Soft-Ware Engineering And Measurement, Lake Buena Vista, FL, USA, October 2009.View at: Google Scholar
D. Threm, L. Yu, S. Ramaswamy, and S. D. Sudarsan, “Using normalized compression distance to measure the evolutionary stability of software systems,” in Proceedings of the 2015 IEEE 26th International Symposium on Soft- Ware Reliability Engineering (ISSRE), Gaithersbury, MD, USA, November 2015.View at: Google Scholar
R. Gon¸calves, I. Lima, and H. Costa, “Using Tdd for Developing Object- Oriented Software — a Case Study,” in Proceedings of the 2015 Latin American Computing Conference (CLEI), Arequipa, Peru, October 2015.View at: Google Scholar
A. Jermakovics, R. Moser, A. Sillitti, and G. Succi, “Visualizing software evolution with lagrein,” in Proceedings of the Companion to the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications, OOPSLA Companion ’08, New York, NY, USA, May 2008.View at: Google Scholar
R. S. D. H. N. K. C. W. G. Jay and J. Hale, “Cyclomatic complexity and lines of code: empirical evidence of a stable linear relationship,” Journal of Software Engineering and Application, vol. 2, pp. 137–143, 2009.View at: Google Scholar
M. H. Halstead, Elements of Software Science (Operating and Program- Ming Systems Series), Elsevier Science Inc., New York, NY, USA, 1977.
I. Herraiz, J. Gonzalez-Barahona, and G. Robles, “Towards a theoretical model for software growth,” in Proceedings of the Fourth International Workshop on Mining Software Repositories (MSR’07:ICSE Workshops 2007), Minneapolis, MN, USA, May 2007.View at: Google Scholar
P. Oman and J. Hagemeister, “Metrics for assessing a software system’s maintainability,” in Proceedings Conference on Software Maintenance, Victoria, British Columbia, Canada, November 1992.View at: Google Scholar
S. Wagner, K. Lochmann, L. Heinemann et al., “The quamoco product quality modelling and assessment approach,” in Proceedings of the 34th International Conference on Software Engineering, IEEE Press, Zurich, Switzerland, June 2012.View at: Google Scholar
M. Fowler, Refactoring: Improving the Design of Existing Code, Addison- Wesley Professional, Boston, MA, USA, 2018.
R. Lincke, J. Lundberg, and W. L¨owe, “Comparing software metrics tools,” in Proceedings of the 2008 International Symposium on Software testing and Analysis, Seattle, WA, USA, July 2008.View at: Google Scholar
E. Dias Canedo, K. Valen¸ca, and G. A. Santos, “An analysis of measure- ment and metrics tools: a systematic literature review,” in Proceedings of the 52nd Hawaii International Conference on System Sciences, Maui, HI, USA, January 2019.View at: Google Scholar
D. I. Sjøberg, B. Anda, and A. Mockus, “Questioning software main- tenance metrics: a comparative case study,” in Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, IEEE, Lund, Sweden, September 2012.View at: Google Scholar
A. S. Nun˜ez-Varela, H. G. P´erez-Gonzalez, F. E. Mart´ınez-Perez, and C. Soubervielle-Montalvo, “Source code metrics: a systematic mapping study,” Journal of Systems and Software, vol. 128, pp. 164–197, 2017.View at: Google Scholar
S. Elmidaoui, L. Cheikhi, A. Idri, and A. Abran, “Empirical studies on software product maintainability prediction: a systematic mapping and review,” E-Informatica Software Engineering Journal, vol. 13, no. 1, 2019.View at: Google Scholar
C. Thirumalai, P. A. Reddy, and Y. J. Kishore, “Evaluating software metrics of gaming applications using code counter tool for c and c++ (cccc),” in Proceedings of the 2017 International Conference of Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, April 2017.View at: Google Scholar
U. Poornima, “Unified design quality metric tool for object-oriented ap- proach including other principles,” International Journal of Computer Applications in Technology, vol. 26, pp. 1–4, 2011.View at: Google Scholar