#### Abstract

This article purports to analyze the content validity of model examinations for grade 10 mathematics. The study looked at the model tests to evaluate if they were indicative of the course content and emphasized on the syllabus’ learning outcomes. A survey design with six years of mathematics model exam, syllabi, and textbooks served as the key data sources was considered in the study. Kendall’s coefficient of concordance and chi-square test of statistical treatment were used to analyze the quantitative data obtained. In addition, the qualitative data were evaluated using narration and description. The study’s statistical findings revealed that there was no relationship between test items and learning outcomes in cognitive domain categories or main textbook content. As a result, the exam items did not correspond to the syllabus’s objectives and content. Furthermore, the qualitative data revealed that the test items were unclear, poorly laid out, and multidimensional, as well as having low content validity.

#### 1. Introduction

A test or examination is an educational assessment to assess student’s learning. Examinations are not meant to trick students or confuse them. Examinations should be related to important learning outcomes, objectives, goals, and/or course competencies. Scholars in the field of measurement evaluation consider examinations for three reasons [1]. First, examinations help to evaluate students and determine if they are learning what a teacher expect. Second, well-designed tests inspire and structure students’ academic endeavors. Students study according to how they expect to be tested. It is easy to memorize facts, but it is difficult to comprehend and apply information. Third, tests can assess teacher’s presentation skills. In addition to reinforcing learning, tests can help students identify areas of weakness and focus their study efforts [2].

According to Coombe, Folse, and Hubley [5], examinations are systematic procedures administered to get information about students’ performance. The results of examinations not only reflect students’ level of success, but they also give information to stakeholders about the other components of teaching process. The information provided can be used to make decisions in a variety of educational situations [4]. Brown [3] stated that a well-designed examination or test is a tool that provides an accurate measure of the test-taker’s ability within a particular domain. Accordingly, for tests to help stakeholders make relevant decisions, they must possess two important characteristics, namely, validity and reliability. Teachers should pay attention to and check whether measurement tools function for the purposes that they are intended to serve or not. This means that when tests do not attain the quality of truthfulness, teachers are not advised to use tools for decision-making. Hence, in preparation of tests, teachers need to look into and take practical measures to enhance the validity and reliability of their classroom tests [6].

Tests in mathematics are supposed to be valid and reliable measures of ability. The extent to which teachers are able to construct and apply valid assessment instruments is determined by their understanding of validity as a means of ensuring classroom assessment quality. As an example, expert judgment is required to decide if the test is representative of the knowledge and skills that are to be measured. This entails a level of consistency in curriculum content, test objectives, and test content. Content validity is determined by the test’s coverage of essential objectives and content, as well as an adequate sampling of essential curriculum content [7]. More researchers like Regasa [8], Tamrat [9], and Mulugeta [10] conducted research on the validity of tests. They concluded that there is widespread inappropriate use of achievement tests that threatens the validity of educational evaluations. To better support quality and relevance, evaluators must devote more attention to the validity of the outcome measures they use.

According to the various studies reviewed above, examination should involve at least two essential characteristics. First, examination should demonstrate strong association between contents or learning experience with the items included or considered. Second, items included in the examination should be carefully selected and represent the learning outcomes. To check the existence of these attributes, the study considered the validity of examinations administered for six consecutive years. These were model examinations based on grade 9 and 10 mathematics syllabi. The study emphasized period allotments, learning outcomes, and factors that could affect the content validity of the examinations. Accordingly, two objectives of this study were considered to evaluate the content validity of Grade 10 Mathematics Model Exams administered by the Oromia Region Education Bureau. These were as follows:(i)To evaluate the strength of the association between the contents of the textbooks and the items on the model exams(ii)To determine whether each item of the sample model exams matches the expected learning outcomes of the syllabus

#### 2. Review of the Literature

##### 2.1. Mathematics Teaching and Testing

Testing could be a mechanism to recognize to what extent our teaching is appropriate to the level of the class, to know students’ weaknesses and strengths in the teaching process and to indicate the general direction of the program. Supporting this view, mathematics assessment is sensitive to changes in student performance over time [11] and also investigated whether information gathered from the measures could be used to support teachers’ instructional decision-making and thereby enhance the learning of struggling students [12].

According to Shayer and Adhami [13], examinations and tests are excellent tools to assess what students have learned in specific subjects. Examinations reveal parts of the class each student appears to have remembered and taken the most interest in. Examinations are also good methods for teachers to learn more about their pupils because each student is so unique. The test environment adds to the stress, allowing teachers to see how their students argue and think individually through their work, which is a useful skill to remember for future class activities. Meanwhile, tests have different purposes. They may be constructed primarily as an instrument to reinforce learning and to motivate the student or primarily as a means of assessing the students’ performance in mathematics. Students in the middle classes are experiencing important crossroads in their mathematical education. They are “forming conclusions about their mathematical abilities, interest, and motivation that will influence how they approach mathematics in later years” [14].

The kind of test should be very appropriate and very constructive; otherwise, teachers, students, and other administrative workers may be misled. Invalid tests may direct students to wrong study habits. Item analysis serves to improve items to be used later in other tests, to eliminate ambiguous or misleading items in a single test administration, to increase instructors’ skills in test construction, and to identify specific areas of course content that need greater emphasis or clarity [15]. A test that measures only facts and simple level of thinking tends or imposes students’ habit to study only the facts. This means that the test is not an accurate measurement of the intended objectives and contents of the syllabus, which makes the decision maker put false judgments on the students’ assessment.

To sum up, better tests or examinations mean better teaching; better teaching means better learning. A well-designed testing system can spearhead educational improvements, while a poorly designed system can sabotage the most dedicated efforts to improve instructional quality.

##### 2.2. Maintaining Mathematics Test Validity

Teachers, parents, and students receive a powerful message through examinations and national/regional examinations about what is important to learn and how it should be taught. Teachers must be well versed in preparation of examination and tests in order to assure their accuracy and appropriateness. Expert judgment is required to decide if the examination is representative of the information and abilities that are to be measured. This entails a level of consistency in curricular content, examination objectives, and content [16].

It is obvious that one cannot validate a test, but one may validate the conclusions taken from students’ test scores [17]. Teachers frequently ignore this fact because they are more concerned with the legitimacy of their questions than with the conclusions drawn from them. Instead of developing reliable exams, teachers should focus on developing assessments that provide evidence from which accurate conclusions about students’ learning can be derived. As Killen points out, this is a significant issue for teachers in order to avoid overlooking one of the most crucial aspects of assessment, which is the effective use of test findings in making instructional decisions [18].

Given the importance of validity in classroom testing, teachers must be familiar with the concept of validity and how to acquire validity-related evidence for their tests and other forms of assessment in order to draw correct conclusions and make appropriate judgments based on students’ test results. Unfortunately, test construction abilities are lacking among teachers [16, 19, 20] Onyekuba, and Anyichie, 2013. After certification, most instructors receive little or no training or assistance. Although teachers are not expected to be experts in educational measurement and evaluation in order to construct valid and reliable tests, they do need a basic understanding of how to develop and validate classroom tests in order to use the results of their assessments to make informed decisions about their students. The situation may be much worse at the university level, where most professors, with the exception of those in the college of education, lack formal training in educational assessment. The majority of the efforts are usually focused on evaluating instructors’ ability to develop tests at the primary and secondary levels. In order to offer baseline data for capacity building in test development and validation for quality assurance in assessment of learning outcomes, it is necessary to find out what teachers know about the validity of classroom examinations.

##### 2.3. Knowing and Measuring the Content Validity of Tests

Many measurement experts in education refer to validity as a logical process educators follow in testing, in which we define what we measure, construct measures, and seek and analyze data relevant to the validity of interpreting a test score and its future application. This logical process applies to tests as well as the test items that make up the tests. In this regard, Haladyna [21] claims that item development is a primary source of evidence in supporting a test score interpretation or use. A valid assessment, according to Mc Alpine [22], is one which measures what it claims to measure.

All measurements should have particular features, regardless of the type of device used or how the data will be used. Validity as enunciated by Miller et al. [4] is a spectrum, not an all-or-nothing proposition. As a result, we should refrain from referring to evaluation outcomes as valid or invalid. The best way to think about validity is in terms of degrees, such as high validity, mode rate validity, and low validity. Validity is always tied to a certain application or interpretation. There is no such thing as a test that can be used for everything. This is due to the fact that the validity of evaluation results varies, depending on the interpretation [23].

In achievement tests, more emphasis is given to content validity than the other validity types such as predictive validity, construct validity, face validity, and concurrent validity. To conclude, a test is said to have content validity if it is a representative sample of the contents and objectives in the syllabus. In other words, a test is considered to have content validity only if it includes a proper sample of the relevant test items; the quality of an item decides the quality of the test, which means revising and improving the item in the test improve the quality of test [24]. Test items are often assessed by a group of subject matter experts (SMEs) to determine content validity. These SMEs are provided a list of content areas that are specified in the test blueprint, as well as the test items that will be based on each content area. The SMEs are then asked if they think that each item is adequately matched to the topic area specified. Any elements identified by the SMEs as being unsuitably matched to the test, blueprint, or otherwise defective are either amended or removed from the test [25].

The sufficiency of a sampling domain of content determines the content validity of an instrument. Content validity, according to Bush (as quoted in [26], refers to the degree to which the instrument covers the content that it is designed to measure. It also refers to the precision with which the content to be measured was sampled. As a result, content validity assesses the comprehensiveness and representativeness of a scale’s content. The measurable extent of each item for defining the traits and the set of items that represents all features of the traits are both required for content validity. Validity of content can be proved in two stages: development and judgment.

As per the development stage, addressing content validity should start with test development. The initial stage in developing a test is determining “which domain of construct” should be measured. There is no comprehensive objective way for determining a test’s content validity [27]. In the process of creating a test, the test maker first determines the widely acknowledged aims of the subject’s instruction and then creates a test blueprint. The test content is derived from the course content and is weighted according to the importance of the course’s objectives and content. In this light, evaluating a test’s content validity entails a thorough and extensive assessment of the actual test tasks. In the same vein, content validity must be considered. The test’s content and items are based on national standards, curriculum benchmarks, mathematics textbooks, and research on best practices in mathematics education [28].

In the judgment stage, content validity is based on quantitative evidence. Professional subjective judgment is necessary to establish the extent to which the scale was created to measure a trait of interest when examining the content validity in the judgment stage. The degree of relevant construct in an assessment instrument is determined by experts’ subjective judgments of content validity. However, at least five experts in that subject, or five to ten experts, should be included. Meanwhile, rating scales are useful for judging the content areas of a scale. Relevance, clarity, simplicity, and ambiguity are all criteria for determining content validity [29].

In general, content validity refers to the substance of the test as it relates to what was taught or covered in class. A test’s content validity must be appropriate, with a representative sample of the content area covered. As a result, test content validity is a process of developing a test through the use of an adequate set of test specifications and item writing standards. As a result, content validity assesses the content area’s comprehensiveness and representativeness.

#### 3. Method

##### 3.1. Knowing and Measuring the Content Validity of Tests

In this study, a survey design quantitative and qualitative approach was implemented. The purpose of using quantitative and qualitative researches approaches was to build strong relationship between quantitative and qualitative data collection and to fully understand the issue under investigation. In terms of data sources, both primary and secondary data sources were used. Mathematics teachers were used as a primary source of the data. Data secured from six consecutive years of grade 10 mathematics model examinations, syllabi, and mathematics textbooks of grades 9 and 10 were served as secondary data sources.

Based on the criteria set by the researchers, 3 mathematics teachers were purposively selected to prepare the content validity forms (coding sheets) with the researchers. Teachers’ qualification and experience were the criteria used to select the teachers. Accordingly, having at least a Bachelor degree in mathematics and a minimum of ten years of experience in teaching grade 9 and 10 mathematics were used as criteria. Moreover, 8 mathematics teachers were purposively selected for an interview and as judges to fill out the prepared content validity form. The criteria to select the judges were their having at least a Bachelor degree in mathematics and a minimum of seven years of experience in teaching grade 9 and 10 mathematics.

In order to obtain relevant data for this study, two data gathering instruments, namely, coding sheets (content validity forms) and interview, were used. The content validity form was drafted by the researchers, and it was coded by the three selected mathematics teachers based on syllabi objectives and contents. Teachers coded the test items into the different content areas in the syllabus. As to the interview, unstructured interview was used. The interview focused on the factors that reduce content validity, such as the relations between test items and exercises of major topics of the text books, multidimensionality of tests items, ambiguity, and layout and arrangement of test items. To check the face validity of the instrument, two experts from the Department of Mathematics and College of Education had taken part.

##### 3.2. Methods of Data Analysis

The data obtained through the content validity form was analyzed quantitatively using descriptive statistics. Items in the content validity form were summarized in tables and processed by means frequency and percentage. The analysis was made in line with the research objectives. Agreements or disagreements between judges on the categorization of syllabus objectives were analyzed by Kendall’s coefficient of concordance. Meanwhile, the strength of association between the contents of the textbooks and the number of items of the model was analyzed using the chi-square test. Chi-square (hypothesis of independence or hypothesis of association test) was calculated and table values of Pearson’s chi-square were compared at the specified degree of freedom with 0.05% of significance to make decision. The data obtained by interviewing mathematics teachers was analyzed qualitatively using methods of narration and interpretation.

#### 4. Results and Discussion

The main functions of a test in the educational system coated in Nwana stated in Osuji [30] are “to motivate pupils to study, determine how much the pupils have learned, special difficulties, special abilities, the strength and weakness of the teaching method, the adequacy of instructional resources and the extent of achievement of the objectives.” To get all these functions, focuses should be on the quality of test. There are different ways of measuring the quality of a test. One of these is content validity that measures quality of test associating to the learning outcomes and contents of the lesson [31].

Grades 9 and 10 syllabi contain 30 and 26 units of instructions and 167 and 162 periods’ allotment, respectively.

The syllabi also contain 20 major topics prepared in different cognitive domain categories. The general objective of the syllabi is to develop solid mathematics knowledge, skills, and attitudes. Hence, it is expected that the regional mathematics model examinations be prepared in that manner. In this study, six consecutive years’ mathematics model examinations (2012–2017), prepared from the same curriculum, were analyzed.

##### 4.1. Data on the Classification of Syllabus Objectives into the Taxonomy of Educational Domain

The categorization of grades 9 and 10 mathematics objectives was established by adopting Osuji and Okonkow [30]. The categorization, which was labeled with the help of the judges, is presented as follows:

Table 1 shows the comparative percentage of the syllabus. The affective domain and psychomotor, which comprise feelings, emotions, values, and mental abilities, respectively, were ignored as compared to the cognitive domain. Compared to other domains, a large emphasis was given to the cognitive domain and less attention was given to the psychomotor domain. The data indicates that there were no syllabus objectives that corresponded to the affective domain. However, Marzano and Kendall [32] claim that educational objectives should be designed to address all areas of the cognitive, psychomotor, and affective domains in a balanced manner. In relation to such findings, Osuji and Okonkow [30] disclosed the fact that the instructional objectives usually stated for assessment of behaviour are in the cognitive domain. In test development and planning, test experts are more concerned about how fairly the categories of the cognitive domain are presented in the test items. To fulfil this, the categorization process is given as follows:

##### 4.2. Values of Coefficient of Concordance (*W*) among Judges on All Categorizations

In an effort to minimise the effect of the judge’s factor on data quality, investigations would like to know whether all judges applied the data collection method in a consistent manner. Interrater reliability quantifies the closeness of scores assigned by a pool of judges to the same study participants. The closer the scores are, the higher the reliability of the data collection method is [33]. Kothari [34] defines conditional measures of agreement on a specific classification category and proposes a generalisation of Kendall’s coefficient of concordance (*W*) to the case of multiple judges. This was considered an appropriate measure of studying the degree of association among three or more sets of rankings. It helps to imagine how the given data would look if there were perfect agreement among the several sets and gives an instructive discussion of interrater agreement among multiple judges.

Row 1 in Table 2 indicates that to judge the significance of the Kendall’s coefficient of concordance (*W*), the critical values of **S**^{∗} at 5% level for *K* = 8 and *N* = 3 were 48.1 and the calculated value of **S** was 128. This is greater than the critical value of **S**^{∗}. This value shows that w = 1 is significant (it means that *K* = 8 sets of rankings are dependent), as the ranks were very close, there was higher reliability. From row 2, the categorization of the syllabus cognitive domain was tested using the critical value of **S**^{∗} at the 5% level for *K* = 8 and *N* = 6, as well as the observed value of **S**, to judge the significance of Kendall’s coefficient of concordance (*W*). The critical values of **S**^{∗} and the observed value of **S** at the 0.05 level are 299 and 1076.50, respectively. As the observed value of **S** is greater than the critical value of **S**^{∗}, the result *W* = 0.9881 verifies a statistically significant agreement among the judges. This shows that there is high interrater agreement among judges. Row 3 shows the calculated value of **S** and the critical value of **S**^{∗} on the classification of test items for the whole year in categories of cognitive domain, which are 989.50 and 299, respectively. This result shows that there was statistically significant agreement among the judges at 0.05 levels of significance. This means the ranks were closer, so there was consistency among the judges. Row 4 shows the computed Kendall’s coefficient of concordance on the categorization of test items into the syllabus contents by judges. As *N* is larger than 7, the chi-square value *x*^{2} was 151.06 with a degree of freedom (*N* − 1) = (20 − 1) = 19. The critical value at this degree of freedom was 30.144 at 0.05 levels. The result validates a statistically significant agreement among the judges, indicating a high interrater agreement among them. This homogeneity in rating ensures that there is consistent knowledge and skills among raters. This means the judges were experts and careful in their rating. In general, there was statistically significant agreement among the judges. Researcher also used these data to calculate the theoretical variance (expected values) and compare it with the actual variance (observed values).

##### 4.3. Proportion of Major Topics and Exercises of the Textbooks

To determine the strength of association between major topics and exercises of the textbooks, it is important to determine the expected number of test items that could be classified under the exercises of each major topic of the textbooks, which is calculated proportionally, based on the total number of items of the model exams and the number of exercises in each major topic. The succeeding table shows the number of exercises of each major topic in both grade 9 and 10 textbooks.

From Table 3, one can observe that more numbers of exercises were given under the major topics like real number systems (13%), followed by equations and inequalities (11.38%), measurement (9.27%), and polynomial functions (9.11%). A few examples of exercises were given under the major topics like the reciprocal of trigonometric functions (0.98%), simple trigonometric identities and real-life application problems (1.13%), equations and applications of exponents and logarithms (1.79%), and distance and section formulas (1.95%). In practice, however, exercises should be fairly distributed in accordance with the amount of content in the textbook and syllabi. Unequal distribution of content and corresponding exercises will not lead to a complete accomplishment of educational objectives or curriculum ends [35].

##### 4.4. Chi-Square Values of Exercises of the Textbooks and Tests Items

To determine whether the observed test contents fit with the exercises contents of the syllabi, the chi-square statistics was employed. These were observed and expected value. The expected value is known as theoretical value, which is calculated by total sum of observed row times total sum of observed column divided by the whole sum [34]. Table 4 indicates the observed number of exercises of each major topic with its expected value, the observed number of items average across judges with its expected value and the chi-square value of both categories.

Hence, *x*^{2} = 117.19.

As shown in Table 4, the calculated chi-square value was 117.19 and the degree of freedom from the contingency table is (20 − 1) (2 − 1) = 19. The critical value for 19 degrees of freedom at a 5% level of significance is 30.144. When the calculated and table values are compared, the calculated value is greater than the table value. Thus, the implication of the result is that there is no strong association between the model exams and the exercises in the textbooks. In reality, there should be a strong relationship between the exercises included in the mathematics textbooks and the model examination [36]. Unfortunately, this had not been done proportionally in the current settings.

##### 4.5. Major Topics, Period Allotments, and Tests’ Items

To decide the proportionality of the topics covered in the exams with the time allotted to cover them in a class, it is necessary to determine the expected number of tests’ items that can categorized under the major topics of the textbooks. That is computed proportionally based on the total number of items of the model exams and the number of periods allotted to each major topic of the textbooks. Therefore, chi-square statistics was employed to check if the observed tests’ contents fit with the number of periods allotted to major contents of the syllabi.

###### 4.5.1. Major Topics and Period Allotments

Table 5 shows that the amount of time allotted to equations and inequalities were (12.77%) followed by the real number system (10.03%), measurement (9.42%), and statistics and probability (8.27). A few number of periods are allotted to distance and section formulas (1.22%) and exponential and logarithms (1.82%). The allotment of great number of periods to equations and inequalities, the real number system, measurement, and statistics and probability refers the emphasis given to the solid mathematics knowledge, skills, and attitudes of students. Curriculum designers claim that the amount and magnitude of contents or learning opportunity should match with the amount of time or duration of the study it may take [37].

###### 4.5.2. Categorization of Tests Items to the Syllabus Contents by Judges

In Table 6, the percentage decimals of each topic were approximated to two decimal places and the sum of the percentage of the topics was 100.02%, which is greater than 100%. This shows that 0.02% is an error of approximation. In the tests, observed from Table 6, much weight is given for the content areas of equations and inequalities (16.07%), equations and applications of exponential and logarithms (8.97%), relations and functions (8.29%), and the real number systems (7.54%). This shows that the decreasing orders of percentages of periods allotted to the major content of the textbooks do not match with those of the test items categories.

###### 4.5.3. Chi-Square Values of Syllabus Contents and Tests Items

The following table shows the observed number of periods allotted to each major topic with its expected value. The observed number of items average across judges with its expected value and the chi-square value of both categories are also presented in the table.

Hence, *x*^{2} = 53.32.

Table 7 shows that the computed chi-square value was 53.32 and the degree of freedom from the contingency table was (*r* − 1) (*c* − 1) = 19. And, at 0.05 levels of significance, 30.144 is the critical value. The calculated value exceeds the table value. This result shows there is a significant difference between the observed and expected content of the test items in both categorizations. The conclusion is that the required strength between the contents of the Oromia Regional Mathematics Model examinations and the contents of the syllabi is not strongly associated. Scholars in the field of educational measurement and evaluation suggest that content or learning opportunities should match the number and types of test items drawn from them [38].

###### 4.5.4. Classification of Syllabus Learning Outcomes and Tests Items to Cognitive Domain Subcategories

To determine whether or not each item of the sample model exams matches the expected learning outcomes of the syllabi, it is a core point to find the number of observed and expected test items that may be classified under the learning outcomes of the syllabi. The succeeding table (Table 8) shows the clear difference between the observed and expected values, which are eventually determined and tested by chi-square.

###### 4.5.5. Chi-Square Values of Learning Outcomes of the Syllabus and Tests

Table 8 shows the true difference between the observed and expected values determined and tested by chi-square value. The next table indicates the calculated number of items categorized under the six categories of cognitive domain that are averaged across judges. The table also shows the determined corresponding expected values and the calculated chi-square value.

Hence, *x*^{2} = 83.10.

From Table 8, the calculated chi-square value is 83.10, and the degree of freedom from the contingency table is (*r* − 1) (*c* − 1) = 5. To arrive at a conclusion about the matching of items in the model examinations and the learning outcomes of the syllabus, there must be a comparison between the calculated and the critical chi-square value at some level of significance in the degree of freedom obtained. As it is observed, the critical and calculated values of chi-square at 0.05 level of significance are 11.07 and 83.10, respectively. Therefore, the result shows that there was a significant difference between the observed and expected learning outcomes of the test items. From this analysis, one can conclude that the items in the model examinations do not match the learning outcomes of the syllabus. Overall, it can be said that the main focus areas of the tests and the textbook contents varied significantly, and they were not the same. Thus, there was no strong association between the contents of the test items and the content of textbooks.

##### 4.6. Results Obtained through Interview Extracts

The purpose of the interview was to examine teachers’ views about the validity of the model exams in regard to the relations between test items and exercises covering major topics of the textbooks, the multidimensionality of test items, ambiguity, and the layout and arrangement of test items. Five qualified and experienced mathematics teachers were selected for the interview.

The interviewees emphasized that in each year, some of the test items did not correspond to the major topics of the syllabus. They said that the model exams did not pay attention to the skills needed for the learning outcomes of the syllabus; that is, the test items’ focus was on the lower skills levels of the cognitive domain rather than the high skills. Almost all of the interviewees agreed that the number of items related to a particular topic was not in proportion to the number of periods allotted to cover the topic in class. They have confirmed that there is no association between the tests and exercises.

The other result of the interview was that, unlike the exercises in the textbooks, the items on the model exams were not multidimensional. This shows limitation of the exams to cover contents indicated in the syllabus. Mathematics tests, according to Shayer and Adhami [13], should be multidimensional in order to assess required skills and cover the topic under consideration. In this regard, the model tests have failed to assess the necessary skills and knowledge.

The interviewed teachers confirmed that some of the test items were ambiguous or confusing. When students have difficulty interpreting the questions due to ambiguity, it can result in assessing students’ abilities to decode the questions or guess the answers, instead of assessing their knowledge and skills.

Concerning layout and arrangement, the teachers have observed that items on the exams were not organized into topics and they were not arranged based on the order of topics presented in the class. They were not arranged based on the order of difficulty either. Researchers like Ijeom and Idongesit [39] have investigated the effect of test item arrangement on performance in mathematics among junior secondary school students. The finding reveals that test item arrangement based on ascending order of difficulty has a significant positive effect on performance. Overall, concerning the content validity of the model examinations, almost all interviewees agreed that the model exams were not standardized tests and had relatively less content validity. This supports the statistical results obtained.

#### 5. Conclusions and Implications

The study results uncovered the reality that the sample model exam items and exercises for the major contents of the textbooks were not strongly associated. The sample model exams were not in proportion to the periods allotted to the major contents of the syllabus. Regarding the emphasis given to the categories of the cognitive domain, there was a mismatch between the sample model exams and the learning outcomes of the syllabus. This may happen because teachers, curriculum designers, and experts in the education sector have not been paying the necessary attention to formulate the necessary contents, objectives, and exercises drawn from the content of textbooks and syllabi. Therefore, further steps should be taken to securitize the problem and find an appropriate remedy.

Also, the findings evidently show that the grade 10 mathematics model exams were deficient in content validity, which means the exams did not measure the required learning outcome and did not reflect the main topics on which the textbooks focused. This implies that neglecting content validity of tests is leading students in the wrong direction of the syllabus goals, resulting in lower scores in their exam results and less development in solid mathematics knowledge, skills, and attitudes.

The results of the study showed that the items on the model exams were not related to the activities, group work, and exercises given under the major topics of the textbooks. This affects the motivation of students to practice the exercises given in the textbooks. Moreover, the findings indicated that the exams were ambiguous, had many mistakes, were poor in layout, and were not multidimensional. From this, it can be said that there has been a poor examination development trend in the regional state. The implication is that appropriate steps were not taken by teachers and concerned bodies to develop sound and valid tests to measure students’ performance in mathematics.

Last but not least, the study’s findings imply that, as many scholars have stated, poor test quality in the region has a negative impact on students’ scores and the quality of education in the region. In developing test items, attention should be given to validity, reliability, and practical applicability. To develop a mathematics model examination that attains content validity, first, the concerned office in charge should have to prepare a well-developed plan of test that represents the contents and learning outcomes of the syllabus appropriately. When exams are prepared at regional and national levels, experts in the field should be consulted, and the items should be reviewed for context and clarity. The implication is that professionals involved in syllabi design, textbook preparation, and exam preparation should be qualified. More importantly, teachers and experts who are responsible for preparing examinations should have the essential orientations through ongoing training in order to prepare high-quality tests and examinations.

#### Data Availability

All the data and tables used for the analysis are included in the supplemental files (tables).

#### Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

The authors would like to thank Mr. Alemayehu Negash, Haramaya University, for his valuable comments and suggestions given during the preparation of this article.