Software process improvement programs are partly founded on software measurement. However, despite their importance, it has been pointed out in the literature that many students are leaving the academic world without the necessary skills to conduct this kind of process. This can be understood by people’s attitudes to this process which is regarded as time-consuming and difficult to understand—factors that explain the lack of interest in it during a student’s academic life. In light of this, the application of serious games or gamification can show useful alternative ways of meeting this need, because the strategies they involve are well accepted by students and have a motivational and engaging effect on them. The objective of this work is to discover different approaches to the teaching of software measurement and software process improvement through gamification projects and serious games. This involves carrying out a systematic review of the literature, which is aimed at characterizing the state-of-the-art on the use of methods related to gamification and serious games in the abovementioned subjects. We conducted a systematic review of the literature to identify primary studies that address the use, planning, or evaluation of gamification, serious games, their features, and game mechanics in software engineering. We located 137 primary studies, published between 2000 and 2019. Although the use of serious games and gamification in software engineering is not recent, there still remains a large area to be explored, especially in software process improvement and software measurement. The study expands and advances the research on how serious games and gamification proposals can be used for teaching software measurement in the context of software process improvement programs by conducting a systematic review of the literature.

1. Introduction

Software engineering is directly related to the generation of high-quality software product. This quality reduces the need for rework, and less rework results in a faster delivery time [1]. In other words, software engineering seeks to ensure the quality of processes that involve software development. Among the approaches adopted to achieve this goal are software process improvement (SPI) programs, which are based on measurement practices.

The software measurement process entails collecting, storing, analysing, and reporting the data on the products developed, as well as the implemented processes of a given organization, to further its organizational objectives [2]. This process is a key strategy in the software process improvement programs; however, the software industry has been hesitant in applying efficient measurement programs [3, 4]. This is due to the fact that many software managers and professionals, including academics in software engineering and computer science, are not fully aware of the application of this subject [5]. Although this work investigated only these two courses, it is understood that other courses derived from computing, such as Information Systems and Computer Degree, go through the same fundamental problem of not exploring software measurement in their curricula, consequently training professionals with little knowledge in this area.

People’s attitudes are based on the assumption that the measurement process is difficult to master and time-consuming [69]. The first approach that is needed for an understanding of this problem lies in the question of how this subject should be taught [10], since it does not feature prominently in the undergraduate curriculum, and is often relegated to the background; therefore, these students receive little incentive to learn this practice. Another factor is the absence of guidelines for assisting students in the practice of measurement [1113].

In general, human factor is determinant for the success of every measurement program, since if there is not a suitable degree of motivation and commitment to the measurement program, it is unlikely that it will achieve the desired result—the control of software metrics to assist decision-making. Among the alternative means of ensuring that people involved in the SPI program are fully engaged is the adoption of the gamification concept [14].

Gamification can be defined as the use of game elements and game design techniques outside the context of games [15]. According to Breuer and Bente [16], the games represent an intersection between the different learning strategies that allow serious games to serve as a subset of e-learning (electronic learning), educational entertainment, and game-based learning. According to Zyda [17], serious game defines a certain form of a game that uses computer games and simulation approaches and/or technologies for primarily nonentertainment purposes. The “serious” term refers to the game being aimed more at educational than entertainment purposes. These approaches seek to improve the engagement, motivation, and performance of a user in carrying out or learning some task or subject, by incorporating mechanics and game features, which makes them more attractive [18].

The objective of this work is to find teaching solutions for the subjects of software process improvement and software measurement by making use of gamification or a serious game, with a view to devising good practices, a suitable framework, validation methods, and in particular, the features, mechanics, and dynamics of games that can be more effectively employed for teaching purposes. As different authors in the literature have advocated numerous approaches for making use of games as a teaching tool, it has become necessary to find a mechanism that can allow a more suitable choice to be made from among different solutions. Mafra and Travassos [19] argue that the desired solutions should be found in an intensive and systematic adoption of an evidence-based approach.

A systematic review of the literature can be carried out as a reporting mechanism, which is the means by which a researcher can determine what expert knowledge is required in a given area to plan his research, while avoiding unnecessary duplication of effort and repetition of past errors [19]. A preestablished protocol is essential to mitigate errors related to the validity of the review carried out and ensure that this review does, in fact, have scientific value and potential for repetition. Unless this occurs, there is a risk that reviews become dependent on the researchers and hence reduce their reliability. Thus, a systematic review of the literature was carried out, in which 137 primary studies were investigated, these were analysed to find solutions for the teaching of software process improvement and software measurement based on the use of serious games and gamification.

The next sections of this article are structured as follows: Section 2 will provide an overview of software process improvement, software measurement, gamification, and serious games, Section 3 is aimed at setting out the methodological procedures of the systematic review of the literature (SRL) adopted in this work, Section 4 examines the results obtained from this SRL and attempts to answer the bibliometric and research questions defined in this work, Section 5 discusses the research carried out in the literature and the findings, Section 6 discusses different threats to the validity of this systematic review, Section 7 investigates some related work, and, finally, Section 8 summarizes the research undertaken, as well as making some recommendations for future work in the field.

2. Background

In this section, we define the underlying concepts that are needed for an understanding of this article, which are software process improvement, software measurement, gamification, and serious games.

2.1. Overview of Software Process Improvement and Software Measurement

A process works like a “glue” that keeps people, technologies, and procedures tightly bound together and is used by software engineers to design and develop computer programs [20]. Organizations involved in software development have their own processes, and the standard of these processes tends to influence the quality of the developed product. Thus, it is of great value for organizations to remain competitive by investing in software process improvement programs. These programs are aligned with process improvement goals, which are a set of desired and defined objectives that can guide the process improvement in a practical and measurable way [21]. The goals should give added value to a company’s business and improve the quality of the goods produced.

Thus, it is necessary to have mechanisms capable of evidencing problems in the processes and to support the identification of improvement objectives [22]. The mechanism used as a thermometer to verify the health of a process is the measurement process, as it is the basis for the control, improvement, and understanding of the behaviour of a product or process from a quantitative evaluation [5]. This process serves as an aid in decision-making, because “you can only control what you can measure” [23], and “you can only predict what you can measure” [24].

Despite the importance of the measurement process and software improvement programs, the way they are taught has proved to be inefficient, as is pointed out by Jones [25]. This author lists 28 problems related to the area of software measurement, one of the more recurrent, being the lack of training of those involved. The findings of this study are corroborated in a survey conducted by the Brazilian-American Chamber of Commerce [26], where 44 IT executives were interviewed. It was reported that 86% of them were not satisfied with the way the measurement process was being conducted in their companies, since most of the problems arose from the inability of the professionals involved to solve them; this was because most of them lacked the necessary skills to conduct the process efficiently.

A striking feature, which should be noted for understanding this problem, is the difference between what is taught in educational institutions and what is required by industry [27]. The needs of industry can only be met by adopting innovative practices that go beyond traditional lectures. Among the various approaches available, the use of games is believed to be a powerful tool and has a high acceptance rate among students of different ages from different backgrounds [28]. In addition, according to Bjork and Holopainen [29], computer games can help create a more attractive and stimulating environment for the contemporary generation of students than “paper versions.”

2.2. Gamification

Gamification is one of the different teaching techniques that seeks to improve user engagement and motivation in carrying out or learning tasks [18]. Gamification involves using the elements, mechanics, and dynamics of games outside the context of traditional games [15]. In addition, this author outlines three reasons why gamification can serve companies; these reasons can be easily adapted to different situations—so much so that they will be outlined in the teaching context. The reasons for the success of gamification are based on three cornerstones: engagement, experimentation, and results.

With regard to engagement, as Koster [28] maintains: “With games, learning is the drug.” Gamification acts as a form of extrinsic motivation, as well as a reinforcing mechanism. It responds to one of the intrinsic needs of humans, that is, to seek the chemical rewards released by the brain as a motivating “engine” for the execution of tasks. Thus, the stimulus created by this feedback strengthens engagement with the class and the learning process and keeps the students motivated and hence eager to be engaged.

With regard to experimentation, games do not usually have permanent punishments for those who fail them. As a result, they create a safe and often competitive or cooperative environment, which tends to stimulate participation by trial and error. This safe environment can be characterized as one of the most valuable contributions of serious games. In addition, according to Werbach and Hunter [15], serious games can be seen as a special type of gamification, as they make use of nonfocused games for entertainment.

Finally, results, as depicted in the studies carried out by Hamari et al. [30] and Pedreira et al. [18], show that the adoption of gamification in organizations has had positive effects, depending on their application in a given context. In addition, large software organizations have employed gamification to encourage users to carry out ordinary tasks, since they know it achieves results.

2.3. Serious Games

According to Zyda [17], serious games raise the challenge of producing a set of rules which is aimed at training or teaching in a playful way. One of the distinguishing features of this type of game is that it is geared towards training and teaching; that is, its focus is not on entertainment, even though the fun engendered is usually a part of the user experience. In addition to this, serious games create a simulating atmosphere with real-life situations in such a way as to create a safe environment for users to experiment with different solutions and learn by trial and error or cause and effect. The main benefits of serious games are as follows [31]: (i) to derive pleasure from learning; (ii) to create an environment where the students construct their knowledge in a dynamic way; (iii) to formulate concepts that are difficult to understand, in a playful way; (iv) to enable students making decisions and then assessing them; (v) to foster socialization among the students; (vi) to allow the teacher to diagnose learning difficulties in what has been taught.

Dale’s Cone of Learning [32] (see Figure 1) has been used as a reference point in planning instructional strategies in higher education [33]. This same study points out that simulating the real experience can improve the understanding of what is being taught more effectively than learning by just reading or listening, that is, passively. Dale states that people remember 90% of what they have learned through simulation. The study by Aydan et al. [34] corroborates this result by suggesting that there is a significant difference between students who have learned ISO 12207 by simulation and those who have learned it only by the traditional means, i.e., by reading sections of texts. This prompted the authors of this study to say: “we recommend the use of serious games that seem to be superior to a traditional paper-based approach.”

3. Materials and Methods

This section outlines the following: objectives and research questions, the method and the search strategy used to mine the relevant papers/articles to this study, the procedure for selecting and classifying a primary study, and, finally, the method of conducting data extraction.

3.1. Goals and Research Questions

This systematic review of the literature (SRL) seeks to find different approaches for teaching the process of measurement and teaching processes related to SPI by using gamification systems and serious games. By determining these approaches, this SRL will highlight the dynamics, mechanics, and game components and show how they can assist in the development of an educational tool for the teaching of software measurement. A set of research questions was prepared to meet the planned objectives. Owing to the complexity of this SRL, the questions were divided into two groups: general questions and specific questions.

3.1.1. General Questions

General questions are pertinent to both areas of this research, namely software process improvement and software measurement. The general questions will be listed below: (i)GQ1. In what contexts (i.e., academic or professional) did the gamification or serious game projects take place?(ii)GQ2. What limitations have been reported in the use of gamification or serious games for teaching?(iii)GQ3. What research methods were employed in the validation of the gamification or serious game projects?(iv)GQ4. What game elements were included in the gamification or serious game projects?(v)GQ5. What game mechanics were used in the gamification or serious game projects?(vi)GQ6. What game dynamics were involved in gamification or serious game projects?(vii)GQ7. What genres were included in the gamification or serious game projects?(viii)GQ8. How does the effectiveness of learning through gamification or serious games compare with what is achieved by traditional learning?

3.1.2. Specific Questions

Specific questions, which as the name implies, are concerned with issues that are individually applied to each of the topics of this research. The specific question of software process improvement (SPIQ) is as follows: (i)SPIQ1. In what processes (measurement and requirements collection, among others) were the gamification system or serious games applied in the area of SPI?

The specific questions about software measurement (MEAQ) are as follows: (i)MEAQ1. Was the system employed based on a model or standard or paradigm? If so, which?(ii)MEAQ2. What metrics were covered by the gamification or serious game projects?(iii)MEAQ3. Which measurement activities (collect, store, analyse, and report) were covered by the systems?(iv)MEAQ4. How can educators or the industry benefit from teaching or applying software measurement programs through gamification or serious game projects?

3.2. Method

This review lasted for 34 months, starting in February 2017 and continuing until December 2019, and was overseen by four researchers (one doctoral student, two undergraduates, and one supervisor) who carried out the activities of this systematic review of the literature, all of which were in the area of computer science. There were two searches in the selected databases: the first was in early 2018, and the second search was conducted in December of 2019. This SRL was based on the Kitchenham guidelines [35], and the method used is listed as follows: (i)Step 1. To check and to validate the search strings to ascertain their accuracy in the return of the primary papers/articles and thus be able to create multiple instances of these strings adapted for each database,(ii)Step 2. Search for possible primary papers/articles in the science citation index; there were available from the domain of the Federal University of Pará. This domain allowed free access to the papers/articles from the selected scientific databases,(iii)Step 3. Read the titles and abstracts of the papers/articles returned by the search string, to create a list with the possible primary papers/articles,(iv)Step 4. (a) Read the titles, abstracts, introductions, and conclusions of the papers/articles in the list of possible primary studies; (b) apply the inclusion and exclusion criteria to reject false positives; and (c) create a list of the primary papers/articles included and a list of those excluded,(v)Step 5. Compare and combine the lists of different researchers, and if there is disagreement among the researchers over the inclusion or exclusion of a paper/article, this one should also be included,(vi)Step 6. Read the papers/articles in the final list in full and apply the quality criteria to grade the remaining ones,(vii)Step 7. Extract the data of all the papers/articles found in the list compiled previously.

In addition, all the documents and procedures were validated from meetings with the supervisor of the SPIDER Project (Software Process Improvement: Development, and Research) [36], Professor Sandro Oliveira. He has had practical experience of implementing the measurement process by consulting several Brazilian companies on this subject and is a credentialed evaluator, consultant/implementer, and official instructor of process improvement and software product models, such as CMMI, MPS.BR, Certics, Medepros, and QPS. For further details, Figure 2 provides an overview of all the phases followed in this work.

3.3. Search Strategy

There were two main research questions that were raised, one focused on software measurement and the other on software process improvement. Initially, the authors developed only one research question on software measurement. However, it was realized that the return of studies with an emphasis on software measurement was very scarce. Consequently, a second more broader research question was raised on software process improvement, bearing in mind that every software process improvement program uses the software measurement process as a framework. The authors realized that many studies that had no emphasis, but made use of software measurement, and were returned because of the research question on improving software processes. From this point, this research question was included in the systematic review of the literature and both main questions used the PICOC guidelines as a framework that helped to establish the search strings for each main research question.

The research questions that were raised in Section 3.1 were derived from the two main questions, which were arranged in accordance with the framework for Population, Intervention, Context, Outcomes, and Comparison (PICOC), recommended by Kitchenham [35]—with the exception of the comparison criterion which was not used, because the search string encompasses the papers/articles referenced in the other systematic review of the literature found by this study. In addition, the rest of the components of the structure were also used to define the following two main questions, namely, (1)What is the state-of-the-art of research on the application or teaching of software process improvement (SPI) programs through the use of serious games or gamification? (a)Population (P). Software Organizations and Teaching Institutions,(b)Intervention (I). Approach used to apply or teach the software improvement process,(c)Context (C). This article is aimed at making a comparison between papers/articles which are aimed at both the industrial and the academic sectors,(d)Outcomes (O). To capture the dynamics, mechanics, and game components present in the systems discussed and the efficiency in teaching or practical application of the software improvement process when based on gamification or serious games,(e)Comparison (C). This does not apply to this study.(2)What is the state-of-the-art of research on the application or teaching of software measurement by making use of serious games or gamification? (a)Population (P). Software Organizations and Teaching Institutions,(b)Intervention (I). Approach used to apply or teach the measurement process,(c)Context (C). This article is aimed at making a comparison between papers/articles which are aimed at both the industrial and the academic sectors,(d)Outcomes (O). To capture the dynamics, mechanics, and game components present in the systems discussed and the efficiency in teaching or practical application of the software measurement process when based on gamification or serious games,(e)Comparison (C). This does not apply to this study.

On the basis of the research questions, keywords were obtained in accordance with the framework for: Population, Intervention, Context, and Outcomes for the subsequent formulation of the search string. Here is the list of keywords defined for the first main search question: (i)Population (P). Project, Development, Organization, Enterprise, Company, Industry, Institute, Research Group, and Technology Center,(ii)Intervention (I). Process, Improvement, and SPI,(iii)Context (C). Learning, Teaching, Education, Training, Practice, and Application,(iv)Outcomes (O). Gamification, Game, Serious Game, Funware, Game Elements, Game Mechanics, Game Component, Game factor, and Game appearance.

The following keywords were defined for the second main research question: (i)Population (P). Project, Development, Organization, Enterprise, Company, Industry, Institute, Research Group, and Technology Center,(ii)Intervention (I). Process, Measuring, Software, Measurement, Metrics, and Metrology,(iii)Context (C). Learning, Teaching, Education, Training, Practice, and Application,(iv)Outcomes (O). Gamification, Game, Serious Game, Funware, Game Elements, Game Mechanics, Game Component, Game factor, and Game appearance.

Later, the search string was assembled on the basis of the keywords using the AND and OR connectors, as follows: the AND connector was used to integrate the Population, Intervention, Context, and Outcomes, and the OR connector was used between keywords in the same category. After the search string was designed, it underwent a validation process and was incorporated in the search databases that have the following features: availability of papers/articles in full from queries by the UFPA web domain or when using Google or Google Scholar or Portal CAPES search engines and availability of papers/articles in English or Portuguese and academic libraries that have search engines. Thus, the following databases that comply with these criteria were established: IEEE Xplore, ACM DLL, Science Direct, Scopus, ISI of knowledge (Web of Science), and Ei compendex. Moreover, each database was checked to see whether applying the search returned its control papers/article strings. Previously, the researchers collected the following control papers/articles from the selected search databases [14, 18, 3743]. Each paper/article was chosen in terms of its relevance to this study. By repeating the validation process of the string, it was possible to arrive at more precise strings for the subject of this research. The following are the final strings: (1)title-abstr-key(Software AND (Project OR Development OR Organization OR Enterprise OR Academy OR Industry OR Learning OR Teaching OR Education OR Training OR Simulation) AND (Process OR Improvement) AND (Gam OR Funware OR Ludification))(2)title-abstr-key(Software AND (Project OR Development OR Organization OR Enterprise OR Academy OR Industry OR Learning OR Teaching OR Education OR Training OR Simulation) AND (Measu OR Metr) AND (Gam OR Funware OR Ludification))

3.4. Study Selection

The scope of the research complies with the restrictions defined in Table 1, to ensure its viability.

Papers/articles were also included in the following areas: experimental studies, experience reports, systematic reviews of the literature, technical reports, bibliographic surveys, systematic study maps, and case studies. In addition, there were papers/articles written in Portuguese and English: the former because it is important to take account of national research, given the relevance of the MPS.BR Program to the study and the latter to broaden the scope of the research, since English is the language set as the standard in most journals and international conferences. Furthermore, the collected papers/articles were all written in the period 2000-2019. The first threshold was set in a way that ensured it was alongside the appearance of the term gamification and the second to be the currently closed year while this research was being conducted.

Additionally, the inclusion and exclusion criteria were employed to analyse the significance of a scientific paper/article while carrying out the systematic review of the literature, and this involved compiling a list of the primary papers/articles and another with the papers/articles that were excluded. The researchers involved in this SRL defined the criteria used in this research, and this is illustrated in Table 2, which shows the inclusion criteria and Table 3, which outlines the exclusion criteria defined for this SRL.

The evaluation of the quality of a paper/article allows works that are closely aligned to the objectives of the projected SRL to make a greater contribution to the research questions. Thus, since the evaluation of the quality of a scientific paper/article is based on an assessment of its significance and content, this evaluative procedure cannot be used as one of the inclusion or exclusion criteria applied to the scientific output during the selection, since it reduces research bias and ensures the internal-external validation [35]. The following are the criteria for assessing the quality of the primary studies, adapted from [44]: (1)Introduction/planning (a)Are the objectives or questions of the study clearly defined? And is the problem addressed in the research clearly described (including the justification for conducting the study)?(b)Is the type of study clearly defined?(2)Development (a)Is there a clear description of the context in which the research was conducted?(b)Is the work suitably referenced (does it refer to related or similar works and is it based on models and theories in the literature)?(3)Conclusion (a)Does the study support its results in a clear and unambiguous way?(b)Have the objectives been achieved and the research questions properly addressed?(4)Criteria for the research question (a)Does the study adopt a primary or secondary approach or make use of a tool for teaching or applying software improvement programs or measurement systems through the use of gamification or serious games?(5)Specific criteria for experimental studies (a)Is there a method or set of methods described in the study?(6)Specific criteria for theoretical studies (a)Is there an unbiased system for choosing studies?(7)Specific criteria for systematic reviews of the literature (a)Is there a strict protocol that has been described and followed?(8)Specific criterion for industrial experience reporting (a)Is there a description of the organization(s)/company where the study was conducted?

It should be noted that criteria (1) to (4) are generic, that is, they apply to all the primary studies evaluated, whereas criteria (5) to (8) are specific and correspond to the respective study types mentioned.

The studies in the list that were selected on the basis of the application of the inclusion and exclusion criteria were read in their entirety. When applying the quality criteria, the approach recommended by Costa [44] was adopted, in which the different levels of the Likert-5 scale were used to represent the study’s compliance with the quality criteria. These levels are listed below. (a)Totally Agree (4). This should apply if the work fully meets the requirements of the criteria of the question,(b)Partially Agree (3). This applies if the work partially meets the criteria of the question,(c)Neutral (2). This applies if it is not clear whether or not the question has been answered,(d)Partially Disagree (1). This must apply if the criteria contained in the question are not met by the evaluated work,(e)Totally Disagree (0). This should apply if there is nothing in the work that meets the criteria of the question.

An evaluation scale is defined for each quality criterion previously established. Table 4 outlines the scale used for each quality criterion.

The two strings were applied to the search engines in the science citation indexes and returned a total of 19050 papers/articles. Scopus database had 30.4%, the largest number of papers/articles returned. Science Direct had 4.8%, IEEE 14.4%, ACM 11.5%, Ei compendex 22.5%, and Web of Science 16.1% of the number of papers/articles returned. Table 5 shows the number of studies returned and the remaining work after each of the criteria were processed; there was no occurrence of exclusion criteria 2, 3, and 4. In Table 5, the sum of the numbers in each row results in the total returned studies of each scientific indexer.

After that, a score was assigned for each paper/article evaluated that was based on the presence of each criterion in the Likert scale and the calculation was made by using the simple Rule of Three, so the papers/articles can be placed in one of the five quality levels defined by Beecham et al. [37] (as shown in Table 6).

An electronic spreadsheet was used to store the data of the papers/articles to answer the bibliometric questions and also calculate the grade (excellent, very good, good, fair, and poor) for the evaluated paper/article. The grade was calculated based on the attributes evaluated in the quality criteria and the Likert-5 scale, which represented the adherence of these attributes to the quality criteria. Table 6 shows the results of the quality evaluation.

The quality criteria were not exclusive, that is, there was no cut index for the evaluated papers/articles, because these papers/articles evaluated by the quality criteria had already passed through the exclusion and inclusion criteria. These quality criteria only served to categorize the writing of the papers/articles and not to exclude them. The quality criteria did not impact the number of accepted papers/articles, given that they did not have the role of excluding them, but rather qualifying them in five different levels, namely, poor, fair, good, very good, and excellent. Thus, all papers/articles were considered to be important for the data extraction from research questions.

The three researchers applied the quality criteria indicated in Table 4 in the 137 primary papers/articles and, whether was a conflict between the quality criteria applied by these different researchers, it was resolved through discussions between them supervised by the advisor of this study. Thus, a single document was generated containing all primary papers/articles qualified through quality criteria. Table 6 summarizes the results achieved by the researchers regarding the quality of the papers/articles. The percentages of this table were reached from the analysis of the complete reading of all papers/articles and the addition of the score according to the analysis of each criterion in Table 4. From the sum of these scores, each paper/article was framed in a quality range, according to Table 6, and a percentage was generated in relation to the total analysed.

As can be seen, few studies are in the poor range and 20 are in the fair range, while 23 studies (16.78%) are in the good range, 46 studies (33.57%) are in the very good range, and 43 studies (31, 38%) in the excellent range. Therefore, the analysed papers/articles present quality above average according to the criteria used. Thus, the quality evaluation criteria were used only to qualify the paper/article and not as an exclusion criterion.

3.5. Classification Study and Data Extraction

This stage involves arranging the data extracted for the display of the charts that provide a general overview and form the basis for future analysis. In addition to the analytical charts of the research questions, the following charts were also generated in response to bibliometric questions: (a) the number of papers/articles returned by the search database (see Figure 3), (b) the number of studies returned per year, (c) the number of studies returned per country, (d) the 5 authors with the highest number of publications, (e) the number of studies per type of study, (f) the number of experimental studies by type, (g) the number of studies returned by publication, (h) the number of studies returned by type of project, (i) the frequency of game elements, and (j) the frequency of game mechanics.

4. Results Achieved

In this section, the results of the systematic review of the literature will be examined. Section 4.1 provides an overview of the selected primary studies and Sections 4.2 to 4.14 describe the results of the research questions.

4.1. Overview

The selection of the studies resulted in a total of 137 primary studies published between 2000 and 2019 (see Table 7). Figure 4 plots a histogram displaying the frequency of primary studies per year, with the different colours representing the related papers/articles about serious games and gamification, and it shows a growing pattern until the year 2016 in the use of games and gamification for teaching.

Figure 4 shows a decrease in the use of games and gamification for teaching (2016-2019), but the authors cannot confirm with precision the reason for this event. It is possible to assume that the field has already reached a certain level of maturity and consequently had a reduction in the novelty factor due to already having a range of studies exploring the topic.

With regard to the distribution of papers/articles by type of publication, it was found that most of the primary studies (i.e., 73%) were published in conferences, 5% in workshops, and only 22% in journals, as shown by the chart in Figure 5. The authors consider conferences and workshops as two different events, because some papers point to workshops as a publication venue, for instance, the paper entitled “HALO (Highly Addictive, Socially Optimized) Software Engineering” that came with its DOI linked to the “Proceedings of the 1st International Workshop on Games and Software Engineering.”

The systematic review of the literature is a method used to highlight trends. In spite of this, this work does not indicate a reason that justifies the conferences and workshops as the main means of publishing the studies of gamification and serious games analysed in this SRL. The authors believe that conferences and workshops have three main advantages over other venues, namely, (a) speed of publication: it usually took only a few months to have their work published in a conference or workshop, unlike journals that have a much longer time, and it can be from quarters to semesters to publish the same article; (b) full papers are generally papers of 8 or more pages that report the results of a research, unlike workshops that generally feature training, dynamics, or short papers of up to 4 pages to describe a work in progress; (c) network, which makes it possible to exchange knowledge with other researchers in the same area during the conference or workshop. Therefore, these characteristics can make it possible for conferences or workshops to be the main publication venue.

In addition, Table 8 shows the five conferences, journals, and workshops that have had more primary studies published.

The main contribution of Table 8 is to highlight the main publication venues for researchers who are working on the theme proposed by this SRL. In addition, the conferences, journals, and workshops allow inferring about the maturity of the researched field, that is, the higher the quality of the publication venue, the greater the maturity of the study topic.

In addition, the five universities, five authors, and the five countries that have had most published papers/articles between the primary studies were also analysed. The adopted method for counting authors, universities, and countries was the same that it was to account for all authors, universities, and countries present in the study using a spreadsheet that catalogued all metadata of selected studies, for instance, in the study that contained two or more countries all the countries involved were counted. The same occurs to authors and universities. This information can be seen in Table 9.

The bibliometric data, defined in Table 9, have the importance of showing a worldwide panorama of the research on the theme established by the SRL. Knowing the main authors, universities, and countries behind the advancement of the research field allows researchers to find possible mentoring or cooperation in research this field. In other words, it enables the creation of a network for the exchange of knowledge and the improvement of ideas related to the researched field.

4.2. GQ1: In Which Contexts (i.e., Academic or Professional) Were the Gamification or Serious Game Projects Applied?

This question was addressed within two domains: academic and professional. In the academic domain, students are the target audience. In contrast, projects that took place in a professional context were applied in software organizations. When the research questions are answered, a code () will be used that represents each paper/article that is listed in Table 7. A total of 84 primary studies referred to the academic context. Projects are generally aimed at teaching some process or subject related to software process improvement programs, for instance, P3, P5, P9, P10, P11, P12, P13, P14, P16, P17, P18, P21, P23, P24, P25, P30, P32, P33, P34, P36, P37, and P39.

In the case of the professional context, the approaches fluctuated between (a) schemes that were aimed at encouraging patterns of behaviour within the organization, (b) adopting processes (P1, P2, P4, P6, P7, and P8), (c) making measurements in the development process and for their teams (P19 and P22), and (d) acculturation (P29, P31, and P35). Only paper P126 was applied within a combined academic and professional context, and this had a gamified tool to assist in the code review. Only P27 did not state what kind of environment it was in. Table 10 shows the percentage of papers/articles for each context and their code.

The SRL is a secondary research method, which is the synthesis of information and data on studies aligned with the theme of SRL collected in the selected databases. It is well known that the industrial or professional environment does not report all its practices and approaches used to have its competitive advantage. Consequently, it is plausible to have more studies which are aimed at the academic environment instead of studies with industrial reports.

4.3. GQ2: What Are the Limitations Reported in the Use of Gamification or Serious Games for Teaching?

About 22% of the total number of papers/articles related to gamification reported some limitations. Among these, many referred to the need for some improvements to be made in the methodology employed in the work, but these were disregarded in this analysis. Others pointed out limitations that were only found in the gamified project, and these are listed below.

The primary study P1 included an interactive questionnaire as a personality assessment tool, which was specifically designed for software engineers. It was stated that there is a need to improve the aesthetics of the project, both in graphics as the sounds used because it was found that these areas were unattractive, so at the risk of the participants lost interest in the application. In addition, in this same work, it was stated that the participants were bored, as they considered the project repetitive and with many interruptions. This contrasted with the P7 study, which includes the Gamiware tool, a gamification platform which is aimed at increasing motivation in software projects. It was reported that its main limitation was the lack of integration with a consolidated framework for project management, such as Jira or Redmine.

The study P24 is a learning process that features a quiz-making and quiz-solving game, where students work together in teams to create multiple-choice quiz questions which challenge the knowledge of the other students. However, some of the students complained that there was a lack of balance in the project which discouraged them, because it was very difficult to obtain points. The purpose of study P39 was to allow the students to compete, by using gamification, and play a role, join a team, and choose a project. However, it was found that not all students liked the competitive atmosphere surrounding gamification, and they argued that this great pressure led to a sense of demotivation. As in the case of P53, which seeks to encourage code refactoring through the use of gamification, it was found that some students unnecessarily refactored the code just to get more points in the game. As a result, some students admitted that they were cheating and did not deserve their score, and hence, this mistaken assignment of points ended up discouraging those students who had earned the points honestly. In addition, the P53 showed that the use of the leaderboard, which is always available on the screen during the sessions of the game, ended up discouraging the students, because it caused some of them to abandon the gamification when they found themselves at the bottom of the leaderboard.

In P62, it was a gamification to check the punctuality of those who attend group meetings, which showed that if there is a lack of penalties in the project, it made it easier for the participants to become disengaged. In study P82, it was argued that the aesthetic features are not shown in a user-friendly way, since the participants are burdened with a lot of textual information, arising from the use of the quiz mechanics that proved to be unattractive to some students. The study P89 suggested that the lack of feedback in the small cycles increased the anxiety of some students and thus hampered their focus, because the feedback from the leaderboard was only provided once, at the end of a week.

Another factor that was regarded as a limitation was the “replay value,” as was the case with P106. The purpose of this study is to create an educational board game that shows the basic concepts of the essence of software engineering in an enjoyable way. But, even though, the participants found the game experience fun for a few rounds, it gradually became less interesting, as the replay value diminished. In the case of others, like P108, where gamified tool was used to track the code review, many badges were awarded without any motivational value, and as the leaderboard was easy to handle, it meant that only small contributions were made, without much relevance. Thus, this system did not take into account the complexity of the task, but only the number of tasks performed by the same user.

Approximately 26% of all the papers/articles dealing with serious games described their limitations. The following limitations are related to the game itself and not the methodology of the study.

In studies P16, P37, P85, and P119, it was reported that there was no statistical difference between the control group and the experimental group, which suggests there is an equivalence between teaching methods. In addition, most of the participants in the P16 study did not enjoy playing the game and some of them said they would prefer a written exercise to the game itself. This same game used quiz mechanics and had unattractive graphics and no sound effects. Apparently, the aesthetic features and mechanics of the game were not ideally suited to meeting the requirements of the target audience.

In paper P32, it was found that the low learning curve discouraged students from engaging with it. However, study P43 pointed out that the tool that was used (ProDec) should only be regarded as a support tool because it just helps students to apply their knowledge. In light of this, the students had to obtain knowledge by other methods, such as traditional classes. Similarly, study P100 maintained that its PlayScrum tool should not be used alone, but in conjunction with traditional teaching, like P110 that introduced legacy, which is a board game for teaching project management, who had a serious limitation, and the high difficulty of the challenge, which made it hard to engage students who had little or no practical experience?

In the case of papers P49, P50, and P51, the drawback was the degree of realism of the game. Some students complained that the game did not reflect reality, and they had limited content. They were aware that the game was supposed to deal with the teaching of life cycles but in fact only included the Waterfall model and left other life cycle models aside. Moreover, several students thought that the “requirements and design” phases were boring. Another serious drawback is the rejection or scepticism of people who are not familiar with the use of games in the workplace, as mentioned in paper P67, which is concerned with a card game to meet requirements in an organizational environment. At the same time, some limitations were also noted in the teaching environment. As was mentioned in work P83 which taught software process modelling through a serious game, in this study, it was noticed that the game-imposed constraints on the creativity of the students in the answers put forward in its teaching method and also induced the students only to memorize the answers of the game and remember its predefined models.

4.4. GQ3: What Research Methods Were Employed in the Validation of Gamification or Serious Game Projects?

Two methods were employed for the validation of the projects—validation with users and validation involving experts. Not all the papers/articles validated their projects, and about 41% of the primary studies were purely theoretical and set out schemes without implementing them. However, about 55% of the remaining papers/articles were validated through the application of questionnaires, and 12.5% were validated through experiments involving two groups—one for control and another in the experiment. This meant that the results could be compared and it could be determined if the data had any statistical significance. In addition, the following methods were employed: user interviews and mixed method research (i.e., research that relies on more than one of the previous methods). Each of them had a 2.5% share of all the primary studies. Furthermore, a usability test had a 3.75% share of the total. Finally, the user observation testing technique (used in ethnography) was an evaluation method that comprised 1.25% of the total number of analysed papers/articles.

Two approaches were adopted to ensure a high-quality assessment. The first entailed conducting an interview with experts, and this was used in 16.25% of the papers/articles, and on the basis of this, it was possible to conduct a survey or make a presentation of the projects to allow the experts give their opinions. The second strategy was the application of the Delphi method for the validation of 2.5% of the projects. In this technique, a questionnaire is sent to a team of experts who are not aware of the views of other experts and in consecutive rounds, and the experts’ opinions are analysed until a consensus is reached. Figure 6 shows each of the validation methods found in the papers/articles of this review, and in Table 11, each study is examined by means of this validation method.

As can be seen in Figure 6, most studies analysed used questionnaires or interviews as a validation method. This points out that the vast majority of studies have not been concerned with verifying the accuracy of their proposals in the face of the traditional teaching method. In other words, only 12.5% of the studies that reported their validation method used a control group and another experiment group to compare their results and verify the effectiveness of their proposals. This can be related to the difficulty and time needed to perform an experiment. Consequently, few of the 137 primary studies will be used to support the development of a teaching tool on software measurement, which is future work for these authors.

4.5. GQ4: What Game Elements Were Used in the Gamification or Serious Game Projects?

The elements of a game are the components used in its interface and are intrinsically related to dynamics and mechanics. A mechanic makes use of the implementation of one or more game elements, and a dynamic is the composition of one or more game mechanics. Since gaming components form the atomic structure of a gamified project, the central task of gamification is to combine them in order to create motivating and engaging mechanics and dynamics. In other words, when designing gamified proposals, the concern must be with the intelligent concatenation of the elements, mechanics, and dynamics of the games. Otherwise, these proposals are usually doomed to fail. This study was based on the book by [15] to select a set of game elements that would be investigated in the primary studies, and these game elements can be listed and explained as follows: (a)Avatar. A visual representation of the player in the game world,(b)Virtual Goods. In-game items that players can collect and use in a virtual rather than real fashion, but which still generate endogenous value to the players. Players can pay for items either with game currency or with real money,(c)Boss. A generally difficult challenge at the end of a level that has to be overcome before an advance can be made in the game,(d)Collections. Formed of items accumulated within the game. Badges and medals often form a part of the collections,(e)Combat. A dispute that occurs which allows the player to defeat opponents in a confrontation,(f)Achievements. A reward that the player receives for carrying out a set of specific tasks,(g)Unlockable Content. The ability to unblock and access certain content in the game if a number of prerequisites are fulfilled. The player needs to do something specific to be able to unlock the content,(h)Badges. Visual emblems of in-game achievements,(i)Social Graph. A diagram that makes it possible to see friends who are also in the game and interact with them. A social graph makes the game an extension of one’s social networking experience,(j)Mission. Similar to “achievements.” It is an assignment in which the player must carry out some activities that are specifically defined within the framework of the game,(k)Levels. Numerical representation of the progress made by the player. The player’s level rises as the player becomes better at the game,(l)Points. In-game actions that score points. They are often linked to levels,(m)Gifts. The possibility of providing items or virtual currency to other players,(n)Leaderboard. A means of listing players who have the highest scores in a game,(o)Teams. Possibility of playing with other people who have the same goal.

Of all the game elements that were searched, only the transaction and boss had no correspondence, while the collections and gifts had only one incidence among the primary studies. As expected, the most widely used game elements in the primary studies were the PBL triad: points with 59.85%, badges with 22.63%, and leaderboards with 25.55%. The other elements of the games, together with their frequencies of use and respective primary studies, can be seen in Table 12. However, many primary studies do not appear in this Table because they are often purely theoretical and do not include game elements or report them in their work.

4.6. GQ5: What Game Mechanics Were Used in the Gamification or Serious Game Projects?

Mechanics are the second layer discussed by [15], and these are similar to the rules of the game and how the player should interact with it. That is, they steer the players’ activities in the required direction by setting out what a player can or cannot do during the game. The mechanics also have an intrinsic relation to the genre of the game; for example, the RPG or Board usually employs the mechanics of turns. In addition, the implementation of a dynamic can be operated by the use of one or more mechanics; for example, the dynamics of progression can be implemented by the use of mechanical feedback and rewards. The mechanics defined by [15] are listed and explained below. (a)Resource Acquisition. The player can collect items that might help him achieve his goals,(b)Feedback. The evaluation allows players to see how they are progressing in the game,(c)Chance. The results of the player’s activities are random and can thus create a sense of surprise and uncertainty,(d)Cooperation and Competition. This creates a sense of triumph or disappointment in defeat,(e)Challenges. The goals that the game defines for the player,(f)Rewards. The benefits that the player can gain from achievements in the game,(g)Transactions. The means of buying, selling, or exchanging something with other players in the game,(h)Turns. Each player in the game has his/her own time and opportunity to play. Traditional games such as card games and board games often rely on turns to maintain a balance in the game, while many modern computer games take place in real time,(i)Win State. The “state” that defines who is winning the game.

Table 13 lists the mechanics and the primary studies that implemented them. The process used to identify the mechanics involved a complete reading of the paper/article in search of the mechanics used by the author. Even if the author did not supply this information, a search was made to find evidence of the use of the mechanics listed by Werbach and Hunter [15]. There were no occurrences of transaction mechanic in the primary papers/articles.

As can be seen in Table 13, the most used mechanics were feedback (25.55%) and cooperation and competition (29.20%). The feedback was used to engage the user with the game approach by presenting the progress with different game elements, like points, leaderboards, levels, and social graphs, as a constant factor to encourage the user to keep the game on. The cooperation and competition used similar game elements to create an environment of player versus player or team versus team. These two mechanics were the most common ones used in games.

4.7. GQ6: What Game Dynamics Were Used in the Gamification or Serious Game Projects?

Dynamics are the highest level of abstraction of the set of game components outlined by [15]. They are usually related to the sensations that gamification seeks to arouse in users. Below are listed and conceptualized the dynamics defined by [15]: (a)Emotions. Games can induce different types of emotions, especially a sense of fun (an emotional reinforcement that keeps people playing),(b)Narrative. The structure that makes the game coherent. The narrative does not have to be explicit, like a story in a game. It can also be implicit, in so far as all the experience has a purpose in itself,(c)Progression. The idea of giving players the feeling of advancing within the game,(d)Relationships. Refers to the interaction between the players, whether it be between friends, companions, or adversaries,(e)Restrictions. Refers to limiting the freedom of the players within the game.

The use of dynamics is not mandatory, but is important because of its effects on user engagement within gamification. As an example, in the primary study P3, which makes use of gamification for teaching risk management, it was recommended that restriction, narrative, and progression dynamics should be employed to encourage users to carry out the project in the best possible way. When the dynamics were implemented as a means of restricting the user’s time for decision-making, this helped create a sense of urgency. Moreover, the narrative provided the activities with a context and sense of order, while the progression was implemented with represented graphics that showed the advances made by the participants to ensure a sense of achievement or triumph.

The dynamics that were used less frequently were the “Emotion” category, which was only found in article P3 and papers P75 and P112. This was designed to achieve the following goal, as stated in the primary study P3, “The game must cause any imaginable emotion. During the gamification course, it is expected to arouse emotions in the participants to encourage them to complete identification and analysis tasks, and enable students to earn points, complete all the levels and win the game.” While the most widely used dynamics in primary studies were “Relationship” (as can be seen in Table 14), these studies were generally implemented by means of cooperation and competition and using game elements such as a social graph, leaderboard, gifts, levels, and avatars.

Remember that a dynamic is only achieved from the application of mechanics and game elements, according to Table 14, which presented the greatest occurrence, among dynamics, the use of relationships, which has competition and cooperation mechanics as one of its bases. In addition, the relationship dynamic is one of the most applied in commercial games and also in educational games, due to the majority of games exploring aspects of multiplayer. So, unsurprisingly, this was the dynamic most explored by the studies analysed in this SRL.

4.8. GQ7: What Genres Were Used in the Gamification or Serious Game Projects?

The game genre offers an established classification of entertainment games that provides a useful way of identifying characteristics that the games have in common. One of these categories which is well accepted by industry is defined by Herz [45].

Herz distinguishes between the following game genres: action, adventure, fighting, puzzles, RPG (role-playing games), simulation, sport, strategy, cards, and board. In addition to the standard classification of Herz, we add two more categories, which are without genres and collaborative games, because some primary studies were classified this way. Table 15 shows the relationship between genres and primary studies.

The most notorious category was the without a genre, and this can be understood because gamification is not necessarily a game, and hence, it is not always possible to classify it in terms of a genre. The second most frequent genre was simulation, which allows the researcher to create a safe environment for their students to learn by trial and error in a game that simulates the software process in real-life situations.

4.9. GQ8: How Does the Effectiveness of Learning through Gamification or Serious Games Compare with That of Traditional Learning?

By investigating the primary studies found in this SRL, it could be determined that some studies (P36, P39, P70, P71, and P72) showed significant gains with regard to student learning. The studies (P36 and P39) show a gain with regard to the student’s awareness of the concepts defined for the studied subject. In article P36, a board game was provided to teach efficient communication during the elicitation requirements process, and the results were an improvement in the users’ perception of the importance of communication in this process, as is made clear in the article. Before the use of the game, only 27% of the students thought this kind of process was important but after the application of the game, this perception increased to 68%. In concordance, in the P39 study, 67% of the students felt that they learned the concepts of software engineering more easily by using the game-based strategy and 80% agreed that they had much more practical knowledge through this teaching method than by the traditional approach. In the traditional approach, teaching tends to be teacher-centered and the students only carry out tasks that are prescribed for them [46].

On the other hand, articles P70, P71, P72, and P135 showed a significant statistical gain with regard to the traditional methods of teaching. As was shown in P70 when comparing Cohen’s stipulated value of 0.8 with the obtained value of 1.35, the Cohen’s d effect compares the average of two groups (like a “control” and an “experimental” group), by subtracting the mean average of the control group from that of the experimental one and dividing this result by the average standard deviation. The result of this equation can be interpreted in 3 ranges: 0 to 0.2 is regarded as a small effect, 0.2 to 0.5 is a medium effect, and greater than 0.5 is a large effect. Thus, in this case, it can be stated that there was a large effect on student learning. However, in P71 when the answers of the knowledge questionnaires were compared between group A and group B, there was an improvement which suggests that the game actually promoted the acquisition of knowledge about project management. Furthermore, this was also the case in article P135, where statistical process control is taught by means of collaborative games. Apart from this, the students designed a measurement plan with the aid of Goal Question Metrics (GQM) and discussed what might be the most feasible chart to represent it. The results of this work compare the grades of the students in the control and experimental group and found the planned approach leads to more effective learning since the average score obtained in the experimental group was 30% higher than that obtained by the control group. Finally, in the P72 study, it could be concluded that there was a positive gain when a comparison was made between pre and postquestionnaires in a class of 42 students who achieved scores of 39.05 and 61.91, respectively, after they had been taught by a method based on RPG that was used in the classroom for the teaching of measurement and analysis through estimates of cost, time, and risk.

However, most studies such as P16, P37, P44, P53, P83, P85, P104, P113, P115, P119, P121, P126, and P131 showed no statistically significant gain or signs that the game-based schemes were superior to the traditional teaching environment, although they were often referred to as having the same effect and being equally effective in teaching. For example, in the case of article P16 where software measurement is taught through the GQM paradigm, no statistical differences were found between the control and experimental groups. As discussed in P44, the results obtained in these studies, may show that games or gamification are not more efficient in teaching than in traditional classes. In contrast, these studies suggest that these schemes are no less efficient than the traditional medium. That is, even if the value of teaching software process improvements and software measurement is questionable, this does not preclude the qualities involved in games and gamification from being motivational or mean they are unable to create a safe and simulating application environment for the deployment of practical knowledge. As article P16 makes clear, these schemes can be beneficial when used as a teaching support tool.

4.10. SPIQ1: In What Processes (Measurement and Requirements Collection, among Others) Were the Gamification System or Serious Games Applied in the Area of SPI?

Software process improvement carries out practical activities to optimize the processes in the organization and ensure that they meet the business objectives, more effectively [20], i.e., to deliver software faster to the market, improve quality, and reduce waste. The goal is to make the organization more competitive by producing higher-quality software in less time and at a more affordable price.

There are international and national standards and models that are designed to optimize the organizational processes and, hence, make an improvement in the quality of the software. These include the CMMI model, the MPS.BR model, and ISO 15504. The CMMI model that was developed by the CMMI Institute (organization belonging to ISACA) includes best practices for software and systems processes, since it is an internationally adopted standard. Thus, this model was chosen as a reference point for the processes addressed in this work because it allows a company to improve its processes based on the application of maturity and capacity, enabling an analysis the performance of these processes.

During the data extraction, some of the processes listed by CMMI were identified in the primary studies. Table 16 lists the processes and studies that addressed these processes. The papers/articles included in the primary study are papers/articles that deal directly with the SPI area, as well as papers/articles that address some SPI process (even if indirectly). Most of the studies were about project planning (24%), while the measurement and analysis process was only found in 18% of all the primary studies. It was noted that a considerable proportion of the studies used more than one process area.

The most relevant process area to this research was the measurement and analysis with 18.98% (see Table 16) of the total papers/articles. This process was the second most frequent process, second only to project planning with 24.09%, which can be considered the most important because it allows developing plans to describe what is needed to accomplish the work within the standards and constraints of the organization. One of the reasons the measurement and analysis takes second place is that this process is used as a basis for the other processes, as project planning. Although the measurement process has a significant occurrence, the vast majority of primary papers/articles do not address the software measurement process as an exclusive focus, and only 4% (P16, P29, P134, P135, P136, and P137) of all primary papers/articles have an exclusive focus on it, which is a very low percentage, showing that this field has a lot to be explored. In addition, as can be seen in Table 16, the primary papers/articles usually have more than one related software process.

4.11. MEAQ1: Was the Scheme Examined Based on a Model or Standard or Paradigm? If So, Which?

Concepts such as GQM (Goal-Question-Metric) and Practical Software Measurement (PSM) paradigms, norms such as ISO 15939 and ISO 25000, and models such as MPS.BR and CMMI-DEV are commonly found in the subject of software measurement. For this reason, this research question is aimed at identifying the paradigms, norms, and models used in primary studies.

Primary study P16 that showed a serious game for software measurement teaching, based on the Goal-Question-Metric (GQM) paradigm, laid exclusive emphasis on the software measurement process and its different stages (collect, store, analyse, and report). Another article that clearly made use of GQM was P135, which was employed to design a measurement plan with the purpose of teaching statistical process control by means of collaborative games. The only approach that used COSMIC function points was paper P137.

Other studies have addressed software measurement, but have not limited themselves to investigating this process. In other words, these studies only mentioned this type of process, but did not describe in detail the serious game or the planned gamification, for example, in the article P28 and papers P43 and P93. Study P28 provides an overview of a system to gamify a software process improvement program without implementing it. A team of gamification and SPI experts has only validated it, as measurement is one of the basic requirements for any SPI. This article includes such a process, although it is not restricted to it, but just sees it as another process within the same program. As it is an article that provides only an overview of a gamification structure of an SPI program, it does not specify in detail what the measurement process and how the gamification of this process should be. It only guides the use of the GQM paradigm, as one of the foundations of this process. In addition, similarly, studies P43 and P93 make use of a serious game for teaching software process models and this involves the use of ISO 12207 and CMMI. As mentioned earlier, these works do not investigate the measurement process, but still refer to it because it is part of the software process models addressed.

Thus, it is clear that making use of serious games and gamification as a teaching or training tool for software measurement has still not been adequately explored, since out of 19050 papers/articles, it was only possible to extract six which concentrated exclusively on this process.

4.12. MEAQ2: What Metrics Were Covered by the Gamification or Serious Games Schemes?

This question highlights which metrics are being explored and which need more attention. Moreover, how they were used and how they relate to gamification. Gamification, in general, is intended to measure certain desired behaviours; consequently, the metrics were used within gamification as mechanics, which verified the behaviour of the user within the reach of the milestones stipulated by the game. Therefore, the metrics acted as a basis for gamification, acting as part of the rules of the game and also used as an evaluative component.

Since metrics are one of the key components of gamification, they must be correctly selected to encourage the desired behaviour. Thus, it is worth drawing attention to some criticisms that have been made about the use of metrics in gamification. As previously noted in study P63, the lines of code and coverage of test cases were adopted as metrics and the mechanical features of its gamification. As a result, the author states that some students tried to exploit the system by adding lines of code or test cases that had no significance to the project. As a countermeasure, the author notified the students that the codes produced would be randomly selected for manual or automatic evaluation. This same behaviour was apparent in other works such as the P53 study, which made use of gamification to encourage the practice of code refactoring, but found some students made use of “insipid refactoring” to increase their score and move up in the ranking. This distorted the balance in the game and allowed students to exploit it.

Table 17 lists the metrics identified in the different primary studies. It should be reiterated that few metrics include an explanation of their calculation.

As can be seen in Table 17, the most widely used metric in the primary papers/articles were lines of code (with 3.65%). This is understandable because this is one of the simplest metrics that can be applied in a software project. However, its great weakness is in an industrial application, because depending on the language used, this metric varies a lot. For example, a language like Java is much more verbose than Python, and since multiple languages are commonly used for software projects in the same project, this metric ends up being inaccurate. Nevertheless, the lines of code metric is an excellent starting point for introducing concepts of software measurement because of its simplicity and the fact that it is easy to understand.

In second place was sprint velocity with 2.92%. This is understandable since the Scrum method is one of the most widely used in academia and industry [47]. On account of its simplicity in managing teams and effectiveness as a method, some of its practices are recommended. Among these practices, it has the capacity to measure the development process and divide tasks into user stories that have their difficulties linked to points. This method recommends measuring the number of user story points the team can produce in a time interval (sprint). This relationship between sprint and the number of user story points is what is called sprint velocity. This is one of the ways to measure the productivity of a team in a Scrum environment. However, papers P12 and P93 and article P17, which represent 2.19% of the primary papers/articles, measured productivity differently. P12 was the only system that showed the calculation made by the productivity metric. In the paper, the productivity per employee was calculated, that is, it was estimated how many tasks a team member was able to perform in a time box. However, since it is a personal metric, care should be taken in how it is used, as it can make a team member feel he/she is at risk, if there is a low value in that metric.

Another categories of most widely used metrics were the estimates of time (2.92%), size (2.19%), and effort (1.46%). These are generally estimated at the beginning of a software project, by means of historical data and reflect the team’s previous results to provide stakeholders with estimated values for project completion and price. Furthermore, these metrics serve as milestones that can enable the team to adjust their productivity to achieving these goals. It should be noted that these estimation metrics generally use a model as a benchmark, as was the case in paper P137, which used the COSMIC method [48] and taught this reference model through a serious game.

4.13. MEAQ3: Which Areas of Measurement (Collect, Store, Analyse, and Report) Have Been Covered by the Schemes?

Only two of all primary studies, P16 and P135, covered all the stages of the measurement process with clarity and in-depth.

P16 presented a serious game that simulated the adoption of metrics in a small software company. During the game, an expert asked a number of questions in the form of a quiz designed to analyse the situation and assist the player to choose the best metrics. The metrics that were collected and stored were the Schedule Performance Index (SPI), Cost Performance Index (CPI), level of effort activity vs. planned effort, schedule variation, and SPI variation. Following this, when conducting the analysis, the charts that best suited these metrics were chosen in the game, for example, the Gantt chart was chosen to represent schedule variation. After this analysis of the chart, the results were reported to the development team and stakeholders.

On the other hand, P136 proceeded with these stages through a teaching methodology called Dojo Handori, and this process was carried out in the classroom. It involved creating a measurement plan by means of the Goal-Question-Indicator-Metric (GQIM) method that was used to measure features related to ISO 25010 that deal with the quality of a software product. The software product that was chosen was the serious game put forward in article P16 which raises problems related to software quality. Right after the measurement plan was created, the metrics established that they should be collected and stored in a spreadsheet during the lesson. Finally, the students created a report based on the analysis of the collected metrics and spoke to the room instructor about possible improvements in the chosen game. In other words, all the stages of the measurement process were covered.

4.14. MEAQ4: What Were the Elements, Mechanics, Dynamics, and Genres of the Games Covered by the Schemes?

The purpose of this research question is to find evidence of the customary practices in the teaching or training of the software measurement process. Table 18 lists the genres, dynamics, mechanics, and game elements identified in the primary software measurement studies.

As can be seen in Table 18, the majority of the works (57.78%) dealt with gamification, while the largest proportion of the serious games, which were the rest of these works, were in the simulation genre. This can be understood, as an attempt by the authors to create a safe environment that allows the learning of the selected topic through trial and error. In other words, the user is not severely penalized for his failures, but on the contrary, is encouraged to fail until he reaches the desired result, this being one of the means that serious games surpass other approaches.

As expected, the most widely used game elements in the primary papers/articles were the PBL triad: points with 66.67%, badges 33.33%, and leaderboard 39.39%. Most examples of gamification make use of this triad, as there are many benefits to be derived from adopting these elements. One of the main purposes of the point system is to provide immediate feedback to users, or in other words, it allows the player’s performance to be monitored. This suggests that awarding points is an excellent way to represent the player’s progress and provides an accurate metric to show the balance of the approach. Similarly, badges offer the following benefits: they visually represent something achieved by the user. In other words, they are a way of visualizing the user’s progress and setting objectives or “milestones” that can be achieved. In addition, leaderboard publicly shows the progress of the players and can serve as a motivating factor in encouraging players to rise to greater heights. However, as stated in the book by [15] although the PBL triad is an excellent starting point, it is strongly recommended that other elements in addition to PBL are used to achieve more significant results and a greater diversity of gamification.

In addition, it is worth mentioning that the elements of the games often recurred several times in the same scheme. An example of this is study P29, which adopts the RUPGY approach, which is the application of dynamics, mechanics, and game elements. This is a kind of gamification that is aimed at motivating a development team, by giving visibility to software processes through the metrics used by gamification, and allowing the software development with Scrum to be more attractively displayed in the form of a game. It included achievements and badges based on historical data from a software house where this scheme was applied. In addition, the game elements used in this scheme were as follows: avatar, collections, achievements, badges, levels, points, leaderboard, and teams. The present elements lead the mechanics of competition and/or cooperation, achievements, rewards, and feedback. And such mechanics lead to the progression, relationships, and narrative dynamics. Finally, by way of illustration, there was a challenge called Clockwork Developer, which was an achievement based on the number of tasks that the developer completed during a sprint. This achievement had three levels of completion, i.e., related badges, which were as follows: bronze (50%), silver (75%), and gold (100%). Thus, this achievement gave visibility to the most productive developers and could be used as a balancing parameter in the training of teams.

5. Discussion

Studies that addressed SPI or software measurement through serious games have made a number of contributions in various contexts, including education, and industry. Their main achievement is to create a simulating environment in which students can apply their theoretical knowledge in a practical way and through this means learning by doing. As can be seen in the Dale cone in Figure 1, learning is theoretically more effective when carried out in an atmosphere that simulates actual experience. As can be seen in article P16, educators can benefit from using this kind of tool as a pedagogical aid for the subject of software measurement because it is very difficult to cover all the software engineering processes. Moreover, when being involved in serious games, students can learn measurement practices in their spare time.

As in the case of P44, involvement with serious gameplay through classroom dynamics has led students to understand the CFD (Cumulative Flow Diagram), which enables them to become aware of the importance of measuring a software process. As a result, the students can understand how the CFD evidences the WIP (work in progress) in the course of time and hence the bottlenecks of the process. This approach has also taught students how to draw a CFD chart and how to interpret it when searching for information, such as the average leadtime.

Both article P16 and paper P44 followed a pattern that was evident in GQ8, which was to ask questions about the effectiveness of learning through gamification or serious games when compared with traditional learning. Although the application of serious games theoretically has the capacity to pass on knowledge in a more efficient way, this effect cannot be determined by statistical data, and neither of the papers/articles provided evidence of statistical gains when contrasted with the traditional method of teaching. Several factors may have had an influence on this result, one of them being the limited attractiveness of the planned games. This is because the aesthetic appeal of the games that were shown is far less than the games offered by the game industry, and the players are already accustomed to a high standard of quality. Another point to consider is that, according to McGonigal [49], games should be a voluntary experience, while in the case of the analysed works; it is evident that they have been introduced in a mandatory way, which can cause the users to lose interest. Moreover, no simulation model can accurately replicate the real world, and this is a factor that should be taken into account. To sum up, the use of games provides solid support for teaching, but they should not be used without being supplemented by other methods, whether traditional methods or otherwise.

With regard to gamification, in P19, the author states that “people love competition, it is the fuel that drives them to follow organizational processes and their daily activities with greater impetus and will power.” If gamification can act as a driving force for the employees and encourage them to carry out their ordinary tasks, it means they will be more willing to comply with an organizational process and their obligations or duties will be more palatable, that is, more pleasant and subject to less resistance. In fact, gamification acted as a behaviour measurement tool and increased the visibility of different individuals in development teams. This attribute can be seen in paper P22, where developers working on critical areas of projects had higher points and rewards than those who only refactored simple programs. A more detailed application of this scenario was shown in paper P99, which transformed the goals inherent in software, into real achievements. Thus, encouraging the developers to be more willing to achieve them assisted the management to observe the progress of the team by being provided with instant feedback.

The intelligent application of gamification in SPI and software measurement processes had a positive effect on those involved in them. Despite this, it should be borne in mind that gamification is not just the use of gaming elements in a nongaming context. Rather, it is the intelligent use of this concept, but as [15] points out, there is a problem that arises because many gamification schemes do not bother to focus on the social, cognitive, and emotional factors that the games address. These schemes tend to be quickly ignored because they seem to be superficial and fail to catch the attention of the player. Table 19 lists the strengths, weaknesses, opportunities, and threats (SWOT) for a better understanding of this work.

Most of the studies analysed were not concerned with verifying the accuracy of their proposals. That is, not all studies used a control group and another experiment group to compare their results and verify the effectiveness of their methods. Even less was the number of studies that showed positive results when compared to the traditional teaching method. It is a common understanding that games are an optimized way to learn a new skill, especially if the target audience is the current generation. However, when comparing educational games with commercial games, it is possible to notice a great disparity in graphics, design, sound effects, ambient music, and game production. The production of a commercial game is usually done by a multidisciplinary team of experts containing dozens and even hundreds of members and has a production time of months to years of development. On the other hand, a small team develops most educational games with little diversity of experts, which generally do not have experts in game design, and short development time. Therefore, even with the undeniable potential of games to teach, far less effort is devoted to developing an educational game compared to commercial games and, consequently, the result is also below those achieved by them. It is well known that game-based education, especially the teaching of software measurement and software process improvement, has a lot to mature as a research field.

6. Threat to Validity

This section examines different threats to the validity of this systematic review of the literature, based on the four most common “threat to validity” categories [50]: internal validity, external validity, construct validity, and conclusion validity. There is also a discussion about how these threats were mitigated in this article.

6.1. Internal Validity

The internal validity refers in particular to the independent variables, that is, it questions whether the research was conducted in the correct way. An extensive protocol was designed to mitigate the causes of internal validity, and this set out the steps that had to be followed and the tasks that needed to be carried out together with descriptions of each of them. This protocol was updated and validated for each new version by the expert of software engineering, who is also the supervisor of this study. Moreover, the researchers continuously consulted the protocol to clarify certain stages of the process and to discuss any possible deviations from it.

6.2. External Validity

The external validity is concerned with determining whether this study can be replicated by other researchers and if their results are consistent with those found here. The following practices were adopted to reduce the risks posed by external validity: the creation of a protocol and constant meetings with all the researchers to check the degree of conformity to the protocol and to validate the articles that might be selected at each stage of the process. Before an article could be included, it had to be accepted by at least two of the four researchers. Thus, on the basis of these practices, it was possible to reduce the risk of primary articles being excluded.

6.3. Construct Validity

Construct validity concerns the measures taken to ensure they really represent what they are intended to measure, i.e., whether the data collected assists in answering the research questions. The research questions were devised in conjunction with an educational expert, who is also responsible for game-based educational strategies. This researcher has extensive experience in these areas and collects publications, such as those listed by [38, 49, 51]. Frequent consultations were conducted with this researcher with the aim of overcoming construct validity problems.

6.4. Conclusion Validity

The purpose of the conclusion validity is to determine whether the conclusions are correctly supported by the collated data. There is a threat to this validity in the data extraction stage, since many articles found did not answer the research questions directly, and so it was necessary to infer the information. The authors met to address these issues and discussed the reliability of the inference, and if two authors agreed about this, this information was included.

During the systematic analysis of the collected papers/articles, some systematic reviews of the literature were found that addressed similar topics or that covered topics of interest in this study. In the present systematic review of the literature, the objective was to find and extract information from gamified schemes or serious games that could be used for teaching software process improvement or software measurement. In addition, most of the works found were about the use of games for teaching software processes, and the most significant of the related works are examined below.

The study by [53] provides a review that investigates the method of evaluating serious games that are aimed at training students in the subject of project management. This review has obtained 102 primary studies and summarizes information about the following: methods, processes, assessment techniques, the field of application (education, health, and welfare), different types of serious games, the number of users, and the main features of the games evaluated. The results shown in the article are useful since they assist the process of evaluating serious game projects, especially those aimed at project management, but not restricted to this process.

In the work of [54], a systematic mapping procedure was conducted that involved selecting 173 primary papers/articles with the objective of classifying works that referred to practical experiences in the teaching of software engineering. The systematic mapping sought to answer the following key questions: What are the main approaches used to address practical experiences in software engineering education? Is there an emerging trend in addressing such a need? What software process models are used to support hands-on experience in software engineering courses? Have universities changed the means of conducting these experiments over the years? What are the main forums for seeking information on practical approaches to teaching software engineering?

As a result, the most frequent practical experiences were determined and classified, such as gaming learning, a case study, simulations, inverted classrooms, project maintenance, service-learning, and open source development. Additionally, the authors found that methodologies for guiding the development of projects only appeared in 40% of the studies and mostly involved flexible methods. In conclusion, the author provided evidence to show that there is a clear concern about how to adopt practical approaches in the teaching of software engineering, and also, there are also numerous alternative ways of filling this gap. Among these, it is worth mentioning the current trend for a method of teaching based on games.

In [55] a systematic review of the literature was conducted based on papers/articles written in the period 2000-2015, where a total number of 53 primary papers/articles were analysed relating to games for the teaching of software engineering. These were classified as follows: games for the students to play, games for students to carry out as a project, curricular projects, innovative web design tools, a framework, suggestions, and other factors. In short, this study has shown that software engineering and games are being approached in different ways and that investment in software engineering education will have an influence on future software engineers, by enabling them to achieve the broader goal of software process improvement.

The study by [56] detected 42 primary studies between 1992 and 2013 that made use of Software Process Simulators for teaching software engineering. As a result, the authors confirmed that there had really been a positive impact on the teaching of software processes, and in addition, simulators were provided with their individual capacities and features and their respective evaluations. The studies that addressed the question of software process improvement and software measurement were also analysed by our systematic review of the literature.

The work of [18] outlined a systematic mapping of the field of gamification when applied to software engineering and to characterize its state-of-the-art. As a result, 29 primary studies, published between 2011 and 2014, were identified. These were classified and analysed in terms of the software process area addressed, the gamification elements used, the type of search method followed, and the type of forum in which the paper/article was published. As a result, the research discovered that the most frequent process areas were system implementation, collaboration, project planning, project control, and evaluation. With regard to the last of these, 5 works were found that make use of gamification for project management, and these were included in the present systematic review of the literature. It was also stated by the authors that most of the gamification projects were based on PBL (points, levels, and leaderboards), which for some critics like Margaret Robertson [15] show they are being used for superficial purposes.

Thus, we can point out that the differentials of this work in relation to the others discussed in this section are as follows: initially, this work focuses on a research field (software process improvement and software measurement) not found in other SRL, which allows an analysis of the application of these fields in serious games and gamification, another point concerns the analysis and discussion of a large number of general and specific questions related to the measurement and process improvement, enabling an identification of the current scenario of their application in teaching from games and gamification, and finally, the details of all the steps followed for the execution of the SRL, enabling its reapplication in other contexts, since the related works do not detail them.

8. Conclusion and Future Work

This systematic review of the literature was aimed at examining genres, dynamics, mechanics, and game elements present in gamification projects and serious games for teaching software process improvement, with an emphasis on software measurement. A protocol was used for this that allowed 137 primary articles to be analysed from a total of 19050 articles in the IEEE, Scopus, ACM, Ei compendex, Web of Knowledge, and Science Direct databases. In addition, this study had the following objective: to present the state-of-the-art at using gamification and serious games in the teaching of software measurement and software process improvement programs. To achieve this objective, we established the following steps: (i)To make prior definitions, such as research restrictions, criteria for inclusion and exclusion of primary studies, and quality criteria for these studies, among others, as guidelines for the systematic review of the literature(ii)To conduct a systematic review of the literature following the specifications previously established in the protocol(iii)To analyse the results of the review from the characterization of the selected studies

There has been a growing interest in addressing software processes through the use of serious games, and in 2016, it comprised the largest output of scientific studies in this area. However, there was also a notable underexploitation of the software measurement process among the researched studies. As reported in this SRL, only 4% (P16, P29, P134, P135, P136, and P137) of the studies concentrated exclusively on this process and its different stages. Thus, it is clear that this area needs more exploratory studies to teach this process.

In addition, the most widely used genre was simulation, which entails creating a simulated environment for the practical application of the students’ theoretical knowledge. Relationships were the most widely used gamification dynamics, since this category was designed to create a relationship between those involved in competition or cooperation and thus to engage the users in social factors. And as a reflection of the relationship dynamic, the most used mechanics were feedback, cooperation, and competition. Finally, the most widely used game elements were points, leaderboards, badges, and avatars.

With regard to future work, this review will serve as input to guide the development of an educational tool for teaching the measurement process in the context of software projects. This kind of tool will be based on the elements of games, good practices, and strengths that were found in the primary studies. In addition, for the purposes of validation, the tool will be evaluated with the user to decide whether the game-based project (gamification and serious games) can be regarded as appropriate in terms of relevance of content, correctness, degree of difficulty, method of teaching, and duration within the context for which it is intended.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


The authors would like to thank the Coordination for the Improvement of Higher Education Personnel (CAPES) for awarding a doctoral scholarship to Graduate Program in Computer Science (PPGCC) at Federal University of Pará (UFPA) in Brazil. This scholarship was used to foster the research that led to the development of the article entitled “Teaching Method for Software Measurement Process based on Gamification or Serious Games: A Systematic Review of the Literature,” which was supervised by Professor and Doctor Sandro Ronaldo Bezerra Oliveira. This research belongs to the SPIDER-UFPA (https://spider.ufpa.br) research group. The authors would also like to thank the Dean of Research and Graduate Studies at the Federal University of Pará (PROPESP/UFPA) for providing financial support through the Qualified Publication Support Program (PAPQ) Public Notice 06/2021.