Applied Computational Intelligence and Soft Computing

Applied Computational Intelligence and Soft Computing / 2021 / Article

Research Article | Open Access

Volume 2021 |Article ID 8899649 | https://doi.org/10.1155/2021/8899649

Thepparit Banditwattanawong, Masawee Masdisornchote, "On Characterization of Norm-Referenced Achievement Grading Schemes toward Explainability and Selectability", Applied Computational Intelligence and Soft Computing, vol. 2021, Article ID 8899649, 14 pages, 2021. https://doi.org/10.1155/2021/8899649

On Characterization of Norm-Referenced Achievement Grading Schemes toward Explainability and Selectability

Academic Editor: Christian Dawson
Received07 Aug 2020
Revised26 Dec 2020
Accepted08 Feb 2021
Published20 Feb 2021

Abstract

Grading is the process of interpreting learning competence to inform learners and instructors of the current learning ability levels and necessary improvement. For norm-referenced grading, the instructors use a conventionally statistical method, z score. It is difficult for such a method to achieve explainable grade discrimination to resolve dispute between learners and instructors. To solve such difficulty, this paper proposes a simple and efficient algorithm for explainable norm-referenced grading. Moreover, the rise of artificial intelligence nowadays makes machine learning techniques attractive to the norm-referenced grading in general. This paper also investigates two popular clustering methods, K-means and partitioning around medoids. The experiment relied on the data sets of various score distributions and a metric, namely, Davies–Bouldin index. The comparative evaluation reveals that our algorithm overall outperforms the other three methods and is appropriate for all kinds of data sets in almost all cases. Our findings however lead to a practically useful guideline for the selection of appropriate grading methods including both clustering methods and z score.

1. Introduction

In both formal and informal education, grading is the process of interpreting learning competence to inform learners and instructors of current learning ability levels and necessary improvement. There are basically two types of nonbinary grading systems [1]: criterion-referenced grading and norm-referenced grading. The former normally calculates the percentage of a learning score and maps it to the predefined percent range of a specific grade. This grading system is suitable for an examination that covers all content topics of learning and thus requires long exam-taking as well as answer-checking times. In contrast, large classes and/or large courses widely use the norm-referenced grading system to meet exam-taking time constraints and to save exam-answer-checking resources. Such a system compares the score of each individual to relative criteria defined based on all individuals’ scores to determine a proper grade. The criteria are set by a conventionally statistical means either without or with conditions (e.g., a class’s grade point average (GPA) must be kept below 3.25).

This paper focuses on the unconditionally norm-referenced grading. The type of problem that the paper targets is data clustering where its difficulty is that the reasons behind cluster boundaries must be explainable as the first priority. A concrete problem is norm-referenced grading while its difficulty is that how to make learners whose scores are contiguously ranked accept their different grades (i.e., their scores fell in different cluster boundaries) with no doubt. To our experiences, this classical problem has long made graders seriously reluctant to resolve dispute with learners. Let us consider the following example to comprehend such a situation: given a simplified series of ranked scores …, 84, 80, 78, …, performing the norm-referenced grading on such a score series by using a traditional method may result in grades …, A, B, B, …, respectively. The learner who scores 80 can make an objection to why he or she receives B rather than A. It is not only difficult for grader to explain the entire steps of the traditional method (which is complicated) but also difficult for the learner to understand. Our algorithm provides a simple and clear-cut justification based on the widest score gaps: “because 80 is closer to 78 than to 84 so 80 should be assigned the same performance level as 78 rather than 84.”

The rise of artificial intelligence nowadays makes machine learning techniques attractive to the norm-referenced grading. We therefore investigate an opportunity to exclusively adopt four methods from the realm of statistical and machine learning: our novel algorithm, a conventionally statistical method, and two unsupervised machine learning techniques, namely, K-means and Partitioning around medoids (PAM) (aka K-medoids). We selected the unsupervised learning techniques since the norm-referenced grading cannot have a training data set. In particular, we selected K-means and PAM as they are the only well-known clustering algorithms that allow us to specify the number of output clusters to represent the desired number of grades (as specified by an employed grading policy). Therefore, both K-means and PAM are naturally applicable to the norm-referenced grading. The grading results of each approach will be measured and compared based on the practical data sets of various distribution characteristics.

The main contributions of this paper are a simple and efficient grading algorithm and a novel insight into the performance of statistical method, machine learning methods, and our algorithm in unconditionally norm-referenced grading. To the best of our knowledge, we also demonstrate for the first time the applicability of K-means and PAM clustering techniques for norm-referenced grading. The merit of this paper would help worldwide graders with the selection of the right grading method to meet their objectives.

The rest of this paper is organized as follows. Section 2 explores previously existing research studies. Section 3 explains the z score grading method. Section 4 reviews machine learning techniques, which includes K-means and PAM, applicable to norm-referenced grading. Section 5 explains our proposed grading algorithm. Section 6 justifies a grading performance metric in terms of clustering quality. Section 7 experiments our algorithm, z score, K-means, and PAM methods based on normal and asymmetric distribution data sets. Section 8 discusses the main findings. Section 9 draws the conclusion.

As for applying a machine learning clustering technique to learners’ achievement, Arora and Badal [2] analyzed the competency of students by using K-means. The competency is attributed by 10-subject marks. The centroid of each cluster was mapped to one of the grade symbols A to G. The resulting grade of each cluster was the competency indicator of students belonging to such a cluster. Academic planners could use such an indicator to take appropriate action to remedy the students. Similarly, Borgavakar and Shrivastava [3] clustered GPAs and internal class assessments (e.g., class test marks, lab performance, assignment, quiz, and attendance) separately by using K-means. Therefore, each student’s competency was associated with several clusters, which were used to create a set of rules for classifying the student. Any weak students were identified before the final exam to reduce the ratio of fail students. Research by Parveen et al. [4] employed K-means to create 9 groups of GPAs: exceptional, excellent, superior, very good, above average, good, high pass, pass, and fail. Students whose GPAs belonged to the exceptional and the fail groups were called gifted and dunce, respectively. The gifted students were enhanced of their knowledge, whereas the dunce students were remedied through differentiated instruction. Research by Shankar et al. [5] clustered students from different countries based on their attributes: average grade, the number of participated events, the number of active days, and the number of attended chapters. An optimal k value of K-means was determined by means of the Silhouette index resulting in k = 3. Among the 3 clusters, the most compact cluster (i.e., a cluster with the least value of within-cluster sum of square) was further analyzed for correlation between the average grade and the other attributes. Xi [6] utilized K-means to cluster students’ test scores into 4 classes, excellent, good, moderate, and underachiever, to take the appropriate self-development and teaching strategy for treatment. Research by Iqbal et al. [7] explored several machine learning techniques for early grade prediction to allow instructors to improve students’ competency in early stages. In such work, Restricted Boltzmann Machine was found to be most accurate for students’ grade prediction. K-means was also used to cluster students based on technical course and nontechnical course performance.

Regarding an automated grading and scoring approach, Ramen and Joachims [8] proposed a peer grading method to enable student evaluation at scale by having students assess each other. Since students are not trained in grading, the method enlisted probabilistic models and ordinal peer feedback to solve a rank aggregation problem. Bai and Chen [9] proposed a method to automatically construct grade membership functions, lenient-type grades, strict-type grades, and normal-type grades, to perform fuzzy reasoning to infer students’ scores.

This paper significantly extends our immature work [10] with a full-fledged algorithm, a newly practical data set, a newly experimented machine learning method, a set of new findings, and a novel guideline for method selection.

3. Conventionally Statistical Grading

A conventionally statistical grading method relies on z scores and t scores [1]. z score is a measure of how many standard deviations below or above the population mean a raw score is. z score (z) is technically defined in (1) as the signed fractional number of standard deviations (σ) by which the value of an observation or a data point x is above the mean value (μ) of what is being observed or measured.

Observed values above the mean have positive z scores, otherwise, negative z scores.

The t score converts individual scores into standard forms and is much like z score when the sample size is above 30. In psychometrics, t score (t) is a z score shifted and scaled to have a mean of 50 and a standard deviation of 10 as in (2).

The statistical grading method begins by converting raw scores to z scores. The z scores are further converted to t scores to simplify interpretation because t scores normally range from 0 to 100, unlike z scores that can be negative real numbers. The t scores are then sorted and a range between maximum and minimum t scores is divided by the desired number of grades to obtain an identical score interval. The interval is used to define the t score ranges of all grades. In this way, raw scores can be mapped to z scores, the z scores to t scores, the t scores to t score intervals, and the t score intervals to resulting grades, respectively.

4. Machine Learning-Based Grading

This section explains how to apply K-means and PAM clustering algorithms to the norm-referenced grading, which is natural to unsupervised learning rather than supervised one. K-means and PAM were selected since both allow specifying the number of clusters in advance to match the number of eligible grades known a priori.

4.1. K-Means

K-means [11] is an unsupervised machine learning technique for partitioning n objects into k clusters. K-means begins by randomizing k centroids, one for each cluster. Assign every object to a cluster whose centroid is nearest to the object. Recalculate the means of all assigned objects within each cluster to serve as k new centroids aka barycenters of the clusters. Iterate the object assigned to the clusters and the centroid recalculation until no more object moves between clusters. In other words, the K-means algorithm aims at minimizing an objective function , where nj is the number of objects in cluster j, xi = <xi1, xi2, …, xim> is an object in cluster j whose centroid is cj, xi1 to xim are the features of xi, and |xi − cj| is Euclidean distance. Also, note that the initial centroid randomization can result in different final clusters.

When applying the K-means algorithm to higher educational grading, k is set to the number of eligible grades. Graders must decide such a number in advance.

4.2. Partitioning around Medoids

Unlike K-means representing each cluster with the mean value of objects within clusters, PAM [12] represents each cluster by one of the objects nearest to the cluster’s center. PAM proceeds in two phases. In the first phase, build, select k objects nearest to the center of all other unselected objects. Such k objects called medoids are selected one by one. In the second phase, swap, assign all unselected objects to their nearest medoids to obtain k initial clusters. For each cluster, calculate average dissimilarity (i.e., average distance) between a medoid and the other objects. Then, for such a cluster, search whether any object if it became a new medoid minimizes the average dissimilarity. If it does, select such an object as a new medoid. Once all clusters have been searched and if at least one medoid has changed, repeat the second phase; otherwise, PAM ends.

Similar to applying K-means, PAM requires that k be set to the number of eligible grade symbols beforehand.

5. Proposed Grading Algorithm

This section proposes a statistical algorithm for norm-referenced unconditional grading. The algorithm works step by step as defined in Algorithm 1.

Input
  S: vector of learners’ scores
  GS: set of ranked eligible grade symbols
Output
  G: vector of learners’ grades
Local variable
  cnt: number of eligible grades
  SG: vector of score gaps
  R: vector of score ranges
Begin
(1)  S ← sort(S);
(2)  cnt ← countEligibleGrades(GS);
(3)  SG ← calculateAllScoreGaps(S);
(4)  SG ← descendingSort(SG);
(5)  SG ← selectWidestGaps(SG, cnt – 1);
(6)  R ← defineScoreRangesFromGaps(SG);
(7)  G ← grades(S, R);
End

The algorithm is explained as follows. In line 1, sort(S) initially ranks the scores of learners within a group from the best down to the worst. In line 2, countEligibleGrades(GS) counts the number of eligible grades. In line 3, calculateAllScoreGaps(S) sequentially goes through the score ranked list to straightforwardly determine a gap between every two contiguous scores (i.e., a score difference). Line 4 sorts the gaps in a descending order. In line 5, selectWidestGaps(SG, cnt—1) selects a set of maximum gaps that equal the number of eligible grades minus one. For instance, four eligible grades require four score ranges; thus, selectWidestGaps(SG, cnt—1) function returns the first three maximum gaps. In case that some gaps are identical, the gaps of scores that are closest to the middle of the score rank will be returned by the function. In line 6, defineScoreRangesFromGaps(SG) creates a series of score ranges, each of which is associated with each eligible grade. For instance, the score range of grade B is 76 to 82 points. Finally, grades(S, R) in line 7 completely assigns proper grades to all scores based on the defined score ranges. In this way, our algorithm is simple while its performance will be proved in Section 7.

As for the cost effectiveness of the proposed algorithm, we analyze its computational complexity as follows. Let n be the number of scores to be graded (i.e., |S|). In the worst case, sort(S) in line 1 finishes in nn, countEligibleGrades(GS) takes |GS|, calculateAllScoreGaps(S) takes n to do all subtractions between every consecutive scores, descendingSort(SG) takes , selectWidestGaps(SG, cnt—1) takes |GS|—1, defineScoreRangesFromGaps(SG) takes |GS|, and grades(S, R) takes n. Therefore, the algorithm takes at most nn + |GS| + n +  + (|GS|—1) + |GS| + n. Suppose that n is much greater than |GS|, thus our algorithm = O(nn), which is relatively tractable.

Remark that our algorithm gets only two input parameters, the learners’ scores and the eligible grades while the local variables of the algorithm are used for temporary value assignment rather than as controlling parameters. Also, all of the called functions in our algorithm perform straightforward tasks as implied by their names without any tuning parameters. Therefore, our algorithm keeps users away from the parameter tuning burden.

6. Grading Performance Measurement

In this paper, the performance of each grading method is represented with clustering quality. The quality of clustering results can be measured by using a well-known metric namely Davies–Bouldin index (DBI). We employed DBI instead of another related metric, Silhouette, because DBI is much computationally less complex; thus, it is highly readable by practical graders. Let us denote by δj the mean intracluster distance of the nj points (each of which is expressed as xi) belonging to cluster Cj to their barycenter cj: δj = . Let us also denote a distance between barycenters cj and cj of clusters Cj and Cj by Δjj = |cj′ − cj|. DBI is figured out by using (3) [13]. The lower DBI, the better quality of clustering results (i.e., low DBI clusters have low intracluster distances and high intercluster distances).

The underlying reason for using DBI as the grading performance metric in norm-referenced grading is intuitive as follows. Learners with much similar achievement should receive the same grade (i.e., equivalent to low intracluster distances), and different grades must be able to discriminate achievements between the groups of learners as much clearly as possible (i.e., equivalent to high intercluster distances). DBI value will be low (i.e., better grading performance result) if clusters are compact and far away from one another.

7. Evaluation

We evaluated our algorithm, z score method, K-means, and PAM in norm-referenced unconditional grading. Experimental configuration and data sets’ characteristics are initially described. Then, grading results along with performance metrics are provided.

7.1. Experimental Configuration

A grading policy that evaluated the scores into 5 eligible grades, A, B, C, D, and F, without any class GPA constraint was engaged. The grading policy was implemented in 4 ways by using our algorithm, z score, K-means, and PAM methods. The number of clusters was predefined to 5 (i.e., the 5 eligible grades) in K-means and PAM. Each method had its performance measured in DBI metric as if the grades represented distinct clusters.

The six data sets of accumulative term scores were used to ensure fair comparison among the grading methods. We characterized the data sets through data distribution in order to verify their coverage of all possible distribution patterns (i.e., the representativeness of various case studies). In particular, the data distribution patterns that were employed included normal distribution (ND data set in Table 1) and positively and negatively skewed distributions (SD+ and SD− data sets in Tables 2 and 3). The algorithm’s effectiveness was also double-checked by using two additional data sets, slightly positively and negatively skewed distributions (RD+ and RD− data sets in Tables 4 and 5). Last but not least, the other rare data set with an exclusively wide score gap (WD data set in Table 6) was also exploited. The scores relied on a scale of 0.0 to 100.0 points. A one-dimensional vector was used to represent each data set as shown in Tables 16 so that readers could dive deep into the scores to judge the effectiveness of each applied method. Every data set is also described in the term of statistic along with its distribution pattern.


Record#Score

188
286
384
479
578
677
776
875
974
1073
1172
1267
1366
1465
1564
1663
1762
1861
1960
2059
2154
2253
2352
2451
2550
2649
2748
2847
2942
3040
3138


Record#Score

192
290
389
486
577
674
773
873
973
1065
1162
1261
1360
1454
1553
1653
1753
1852
1952
2052
2152
2252
2351
2451
2551
2651
2750
2850
2946
3046
3145


Record#Score

194
293
387
487
587
687
786
886
986
1085
1185
1285
1384
1484
1583
1682
1777
1875
1974
2073
2172
2265
2364
2463
2562
2661
2752
2850
2938
3036
3134


Record#Score

180.8
280.2
378.7
476.8
576.1
675.2
775.1
872.5
972.1
1071.6
1170.8
1270.6
1369.1
1468.7
1568
1667.6
1766.7
1866.7
1965.8
2063.5
2161.6
2261.5
2361.4
2460.7
2560.5
2659.2
2758.7
2858.5
2957.8
3057.4
3156.6
3255.7
3355.5
3455.5
3555.2
3655.2
3755.1
3854.7
3953.9
4052.6
4152.5
4251.7
4351.3
4451
4550.7
4650
4748.8
4848.7
4948.6
5046.7
5146.4
5246.2
5345
5444.9
5544.6
5644.5
5743.5
5842
5935.7
6028.4
6128


Record#Score

189.47
287.1
382.73
482.53
582.53
682.17
780.7
880.5
979.97
1079.43
1179.3
1278.9
1378.47
1478.27
1577.87
1677.87
1775.73
1874.57
1973.3
2073.2
2173.1
2272.83
2372.63
2472.1
2571.83
2671.77
2770.8
2870.4
2970.23
3070.2
3170.2
3269.43
3369.17
3469.17
3569.1
3668.77
3768.6
38368.27
967.87
4067.77
4167.63
4267.63
4367.57
4467.33
4567.1
4667
4766.77
4866.73
4966.4
5066.37
5166.37
5266.1
5365.87
5465.8
5564.77
5664.73
5764.73
5864.57
5964.57
6064.3
6164.17
6264.13
6363.93
6463.9
6563.57
6663
6762.83
6860.63
6960.33
7059.83
7158.93
7258.87
7358.53
7458.47
7558.27
7657.53
7757
7856.77
7955
8054.8
8154.57
8254.5
8354.5
8454.43
8554.37
8653.8
8753.73
8853.37
8953.37
9052.87
9152.47
9252.1
9352
9451.97
9551.8
9650.9
9750.7
9850.2
9950.1
10045


Record#Score

198
297
393
492
591
690
789
887
987
1086
1185
1285
1384
1483
1583
1682
1781
1881
1979
2030
2128
2227
2326
2425
2523
2622
2721
2821
2920
3018
3117

The first data set, namely, ND, has a normal distribution. Table 1 shows the raw scores of ND. Mean and median are 63. Mode is unavailable as every score has the same frequency of 1. σ is 13.9.

To comprehend the characteristics of ND, Figure 1 projects its normal distribution. The horizontal axis represents z score. The curve was computed with (4) where x represents a score. The area under the curve represents a distribution value [1].

The second and the third data sets have positively and negatively skewed distributions namely SD+ and SD−, respectively. Positively skewed distribution is an asymmetric bell shape skewed to the left probably caused by overly difficult exam questions from the viewpoint of learners. Table 2 shows the raw scores of SD + set. Mode, median, mean, and σ are 52, 60.9, 53, and 14.236 respectively. Figure 2 depicts the normal distribution of SD + set. Its skewness is heavy and equals 1.006.

Negatively skewed distribution is an asymmetric bell shape skewed to the right probably caused by too easy exam questions from the viewpoint of learners. Table 3 shows the raw scores of SD−. Mode, median, mean, and σ equal 87, 82, 73.5, and 16.929, respectively. Figure 3 depicts the normal distribution of SD−. The skewness is as heavily as −1.078.

These 3 data sets contain the same number of raw scores and were realistically synthesized to clarify the extreme behaviors of the four studied methods.

The fourth data set RD− was collected from a group of real 61 anonymized learners taking the same undergrad course in the academic year 2019. Unlike SD+ and SD− that are heavily skewed, RD− (and RD+) represents imperfectly normal distributions (i.e., slightly skewed). RD− in Table 4 has the slightly negative skew of −0.138 as shown in Figure 4. Mode, median, mean, and σ equal 66.7, 56.6, 57.9, and 12.136 respectively.

The fifth data set, RD+, was the real term scores of the other group of 100 anonymized learners from another anonymized university. Opposite to RD−, RD + has the slightly positive skew of 0.155. The characteristics of RD + are shown in Table 5 and Figure 5. Mode, median, mean, and σ equal 82.5, 66.4, 65.7, and 9.662, respectively.

The last data set, WD, consists of the broad range of scores with a relatively wide gap. Such a score pattern exists in the group of learners with a learning competency divide. As a result, some enclosed grade ought to be skipped. The characteristics of WD are shown in Table 5. A significant gap lies between the scores 79 and 30 as depicted in Figure 6. Mode, median, mean, and σ equal 87, 82, 62.3, and 31.975, respectively. WD has the moderately negative skew of −0.450.

7.2. Grading Result

We graded ND data set by using the proposed algorithm, z score, K-means, and PAM methods and reported their results, respectively, in angle brackets:< our-algorithm grade, z score grade, K-means grade, PAM grade >

shown in Table 7 resulting in an Nx4 matrix where N rows equal a number of scores. Our algorithm delivered exactly the same results as those of K-means. Both methods’ DBIs equaled 0.330. z score method yielded the equivalent DBI of 0.443. It might be questionable from student viewpoint why graders using z score gave learners who scored 78 and 79 the same grades A as that of 84 and 47 mark holder the same grade F as that of 42 marks. These are simply because 78 and 79 fell in the same z score interval of A while 47 fell in the z score interval of F. PAM also yielded the DBI of 0.330 despite too many grades A.


ScoreGrade

88<A, A, A, A>
86<A, A, A, A>
84<A, A, A, A>
79<B, A, B, A>
78<B, A, B, A>
77<B, B, B, A>
76<B, B, B, A>
75<B, B, B, A>
74<B, B, B, A>
73<B, B, B, A>
72<B, B, B, A>
67<C, C, C, A>
66<C, C, C, A>
65<C, C, C, A>
64<C, C, C, A>
63<D, D, D, A>
62<D, D, D, A>
61<C, C, C, A>
60<C, C, C, A>
59<C, C, C, A>
54<C, C, C, B>
53<C, C, C, B>
52<D, D, D, B>
51<D, D, D, B>
50<D, D, D, B>
49<D, D, D, C>
48<D, D, D, D>
47<D, F, D, D>
42<F, F, F, F>
40<F, F, F, F>
38<F, F, F, F>

We also graded SD+ data set with our algorithm, z score, K-means, and PAM methods as shown in Table 8. Our algorithm delivered the same results as K-means and PAM. Their DBIs were 0.222. z score method gave the equivalent DBI of 0.575. There were many grades F when using z score method.


ScoreGrade

92<A, A, A, A>
90<A, A, A, A>
89<A, A, A, A>
86<A, A, A, A>
77<B, B, B, B>
74<B, B, B, B>
73<B, C, B, B>
73<B, C, B, B>
73<B, C, B, B>
65<C, C, C, C>
61<C, D, C, C>
61<C, D, C, C>
60<C, D, C, C>
54<D, F, D, D>
53<D, F, D, D>
53<D, F, D, D>
53<D, F, D, D>
52<D, F, D, D>
52<D, F, D, D>
52<D, F, D, D>
52<D, F, D, D>
52<D, F, D, D>
51<D, F, D, D>
51<D, F, D, D>
51<D, F, D, D>
51<D, F, D, D>
50<D, F, D, D>
50<D, F, D, D>
46<F, F, F, F>
46<F, F, F, F>
45<F, F, F, F>

Next, we graded SD− data set in Table 9. Our algorithm delivered the equivalent DBI of 0.299. The DBIs of z score, K-means, and PAM methods were equally 0.233.


ScoreGrade

94<A, A, A, A>
93<A, A, A, A>
87<B, A, A, A>
87<B, A, A, A>
87<B, A, A, A>
87<B, A, A, A>
86<B, A, A, A>
86<B, A, A, A>
86<B, A, A, A>
85<B, A, A, A>
85<B, A, A, A>
85<B, A, A, A>
84<B, A, A, A>
84<B, A, A, A>
83<B, A, A, A>
82<B, A, A, A>
77<B, B, B, B>
75<B, B, B, B>
74<B, B, B, B>
73<B, B, B, B>
72<B, B, B, B>
65<C, C, C, C>
64<C, C, C, C>
63<C, C, C, C>
62<C, C, C, C>
61<C, C, C, C>
52<D, D, D, D>
50<D, D, D, D>
38<F, F, F, F>
36<F, F, F, F>
34<F, F, F, F>

In practice, there is no perfectly normal distribution with respect to learners’ achievement. Now experimental results based on data sets having slightly skewed distributions are described. We graded RD data set with our algorithm, z score, K-means, and PAM methods as shown in Table 10. The gap columns show differences between every two consecutive scores (i.e., the results of calculateAllScoreGaps() function in Algorithm (1) to be utilized by our algorithm where 4 widest gaps (indicated by the bold numbers) were used as grading steps.


ScoreGapGrade

80.8<A,A,A,A>
80.20.6<A,A,A,A>
78.71.5<A,A,A,A>
76.81.9<A,A,A,A>
76.10.7<A,A,A,A>
75.20.9<A,A,A,A>
75.10.1<A,A,A,A>
72.52.6<B,A,B,A>
72.10.4<B,A,B,A>
71.60.5<B,A,B,A>
70.80.8<B,A,B,A>
70.60.2<B,A,B,A>
69.11.5<B,B,B,A>
68.70.4<B,B,B,A>
680.7<B,B,B,A>
67.60.4<B,B,B,A>
66.70.9<B,B,B,A>
66.70<B,B,B,A>
65.80.9<B,B,B,A>
63.52.3<C,B,B,A>
61.61.9<C,B,C,A>
61.50.1<C,B,C,A>
61.40.1<C,B,C,A>
60.70.7<C,B,C,A>
60.50.2<C,B,C,A>
59.21.3<C,C,C,A>
58.70.5<C,C,C,A>
58.50.2<C,C,C,A>
57.80.7<C,C,C,A>
57.40.4<C,C,C,A>
56.60.8<C,C,C,A>
55.70.9<C,C,C,A>
55.50.2<C,C,C,A>
55.50<C,C,C,A>
55.20.3<C,C,C,A>
55.20<C,C,C,A>
55.10.1<C,C,C,A>
54.70.4<C,C,C,A>
53.90.8<C,C,C,A>
52.61.3<C,C,C,A>
52.50.1<C,C,C,A>
51.70.8<C,C,D,A>
51.30.4<C,C,D,A>
510.3<C,C,D,A>
50.70.3<C,C,D,A>
500.7<C,C,D,A>
48.81.2<C,C,D,A>
48.70.1<C,C,D,A>
48.60.1<C,D,D,A>
46.71.9<C,D,D,A>
46.40.3<C,D,D,A>
46.20.2<C,D,D,A>
451.2<C,D,D,A>
44.90.1<C,D,D,A>
44.60.3<C,D,D,A>
44.50.1<C,D,D,A>
43.51<C,D,D,A>
421.5<C,D,D,B>
35.76.3<D,F,F,C>
28.47.3<F,F,F,F>
280.4<F,F,F,F>

All four methods produced different grading results. Particularly, our algorithm and K-means assigned A for the same group of learners whereas z score and K-means methods gave F to the same group of learners. Our algorithm had the DBI of 0.375 whereas K-means, PAM, and z score method gave the equivalent DBIs of 0.469, 0.474, and 0.492, respectively. Therefore, our algorithm delivered the best grading results for RD−-. Our algorithm accomplished the lowest DBI partly because grade D has only one member score, comparable to the smallest possible cluster, which DBI favors.

We graded RD + data set as shown in Table 11. With this large data set, the grading results of all methods are totally different. Our algorithm, z score method, K-means, and PAM methods yielded DBIs of 0.345, 0.529, 0.486, and 0.487, respectively, meaning that our algorithm defeated the others.


ScoreGapGrade

89.47<A,A,A,A>
87.12.37<A,A,A,A>
82.734.37<B,A,A,A>
82.530.2<B,A,A,A>
82.530<B,A,A,A>
82.170.36<B,A,A,A>
80.71.47<B,A,A,A>
80.50.2<B,B,A,A>
79.970.53<B,B,A,A>
79.430.54<B,B,A,A>
79.30.13<B,B,A,A>
78.90.4<B,B,A,A>
78.470.43<B,B,A,A>
78.270.2<B,B,A,A>
77.870.4<B,B,A,A>
77.870<B,B,A,A>
75.732.14<C,B,B,A>
74.571.16<C,B,B,A>
73.31.27<C,B,B,A>
73.20.1<C,B,B,A>
73.10.1<C,B,B,A>
72.830.27<C,B,B,A>
72.630.2<C,B,B,A>
72.10.53<C,B,B,A>
71.830.27<C,B,B,A>
71.770.06<C,B,B,A>
70.80.97<C,C,B,A>
70.40.4<C,C,B,A>
70.230.17<C,C,B,A>
70.20.03<C,C,B,A>
70.20<C,C,B,A>
69.430.77<C,C,B,A>
69.170.26<C,C,B,A>
69.170<C,C,B,A>
69.10.07<C,C,B,A>
68.770.33<C,C,B,A>
68.60.17<C,C,B,A>
68.270.33<C,C,C,A>
67.870.4<C,C,C,A>
67.770.1<C,C,C,A>
67.630.14<C,C,C,A>
67.630<C,C,C,A>
67.570.06<C,C,C,A>
67.330.24<C,C,C,A>
67.10.23<C,C,C,A>
670.1<C,C,C,A>
66.770.23<C,C,C,A>
66.730.04<C,C,C,A>
66.40.33<C,C,C,A>
66.370.03<C,C,C,A>
66.370<C,C,C,A>
66.10.27<C,C,C,A>
65.870.23<C,C,C,A>
65.80.07<C,C,C,A>
64.771.03<C,C,C,A>
64.730.04<C,C,C,A>
64.730<C,C,C,A>
64.570.16<C,C,C,A>
64.570<C,C,C,A>
64.30.27<C,C,C,A>
64.170.13<C,C,C,A>
64.130.04<C,C,C,A>
63.930.2<C,C,C,A>
63.90.03<C,C,C,A>
63.570.33<C,C,C,A>
630.57<C,C,C,A>
62.830.17<C,C,C,A>
60.632.2<D,D,D,A>
60.330.3<D,D,D,A>
59.830.5<D,D,D,A>
58.930.9<D,D,D,A>
58.870.06<D,D,D,A>
58.530.34<D,D,D,A>
58.470.06<D,D,D,A>
58.270.2<D,D,D,A>
57.530.74<D,D,D,A>
570.53<D,D,D,A>
56.770.23<D,D,D,A>
551.77<D,D,F,A>
54.80.2<D,D,F,A>
54.570.23<D,D,F,A>
54.50.07<D,D,F,A>
54.50<D,D,F,A>
54.430.07<D,D,F,A>
54.370.06<D,D,F,A>
53.80.57<D,F,F,A>
53.730.07<D,F,F,A>
53.370.36<D,F,F,A>
53.370<D,F,F,A>
52.870.5<D,F,F,B>
52.470.4<D,F,F,B>
52.10.37<D,F,F,C>
520.1<D,F,F,C>
51.970.03<D,F,F,C>
51.80.17<D,F,F,C>
50.90.9<D,F,F,D>
50.70.2<D,F,F,D>
50.20.5<D,F,F,D>
50.10.1<D,F,F,D>
455.1<F,F,F,F>

WD data set was graded as shown in Table 12. Our algorithm, z score method, K-means, and PAM yielded DBIs of 0.403, 0.452, 0.449, and 0.449, respectively. Although our algorithm outperformed the others in terms of DBI, recall that WD data set had the exceptional pattern of so significant gap that assigning 5 grades completely may not be plausible. As shown in Table 12, only z score method is capable of automatically skipping grades C and D.


ScoreGrade

98<A, A, A, A>
97<A, A, A, A>
93<B, A, B, A>
92<B, A, B, A>
91<B, A, B, A>
90<B, A, B, A>
89<B, A, B, A>
87<B, A, B, A>
87<B, A, B, A>
86<B, A, C, A>
85<B, A, C, A>
85<B, A, C, A>
84<B, A, C, A>
83<B, A, C, A>
83<B, A, C, A>
82<B, A, C, A>
81<B, B, C, B>
81<B, B, C, B>
79<C, B, C, C>
30<D, F, D, D>
28<F, F, D, D>
27<F, F, D, D>
26<F, F, D, D>
25<F, F, D, D>
23<F, F, F, D>
22<F, F, F, D>
21<F, F, F, D>
21<F, F, F, D>
20<F, F, F, D>
18<F, F, F, F>
17<F, F, F, F>

8. Result Analysis, Finding, and Discussion

Figure 7 comparatively projects all aforementioned DBIs with respect to each grading method and data set. They can be analyzed as follows. Our algorithm has DBIs’ μ = 0.329 and σ = 0.058. z score has DBIs’ μ = 0.454 and σ = 0.109. K-means’ DBIs have μ = 0.365 and σ = 0.109. PAM′ DBIs have μ = 0.366 and σ = 0.110. The overall performance of each method is revealed in Figure 8. As the lower DBI the better clustering quality, the heights of stacks show that our algorithm performs best due to the lowest overall DBI whereas K-means and PAM produce underneath performance results by 10.90% and 11.21% higher DBIs than ours, respectively. z score method performs worst, 38.03% greater DBI than ours. These relative performance differences show the practical significance of our algorithm.

We also conducted paired (Student’s) t-test to evaluate whether the means of our algorithm’s DBI are statistically significantly different from those of the other methods. Particularly, paired t-test was employed to compare DBI means produced by our algorithm with those of z score, K-means, and PAM methods for the 6 data sets. We used the standard significance level of 0.05 and the hypothesized mean difference of 0 (i.e., null hypothesis value indicating no DBI difference between methods) to figure out value for one-tailed t-test. A smaller value means that there is stronger evidence in favor of an alternative hypothesis (i.e., there is DBI difference between methods). Firstly, DBI difference between our algorithm and z score had the value of 0.040, which was less than 0.050. Therefore, our algorithm outperformed z score with statistical significance. Secondly, DBI difference between our algorithm and K-means had the value of 0.144. Lastly, DBI difference between our algorithm and PAM had the value of 0.141. Therefore, our algorithm outperformed K-means and PAM without statistical significance. Note that, unlike the practical significance, the statistical significance only provides evidence that performance differences exist since it is a mathematical definition that does not know anything about our subject area.

Our algorithm and K-means lead to fairly similar grading results based on normal and heavily-positively skewed distributions. Furthermore, by examining Tables 712, PAM produced the most A and the least F by average.

The behavior of our proposed algorithm can be discussed in terms of the definition of (3) as follows. The algorithm performed clustering effectively in almost all cases of data sets (i.e., ND, SD+, RD−, RD+, and WD) because Algorithm 1 always selects the maximum score gaps to draw cluster boundaries, that is, maximum Δjj. Although the algorithm does not deal with the minimization of δj, it usually has less impact on DBI than Δjj since δj takes part in the summation (thus requiring the minimization of the other term ), whereas Δjj is the sole divider in (3). Nevertheless, in an exceptional case, merely maximizing Δjj is not enough as substantiated by our algorithm that performed worst when clustering SD− data set.

Key findings based on the result analysis are provided as follows. In general, Figure 7 reveals that the absolute degree rather than the positive or negative polar of skewness has more impacts on the methods’ grading performance: the greater the absolute skewness, the lower the grading performance. This is because the greater absolute skewness implies more dispersed or dissimilar scores.

Considering the nature of each method in conjunction with the above grading results leads to a guideline in Table 13 for appropriate method selection.


MethodCharacteristicSuitability

K-meansIt prioritizes intracluster similarity, that is, score similarity within each learner group.(i) This method is suitable when the same grade is always supposed to be held by learners with closely similar abilities.
(ii) As indicated in Figure 7, K-means is also suitable for heavily skewed distribution like SD+ and SD− data sets.

PAMPAM that produced the most A and the least F by average implies that the group GPA of learners tends to be high when grading with PAM.PAM is also suitable for heavily skewed distribution like SD+ and SD− data sets as indicated in Figure 7.

Our algorithm(i) In contrast with K-means, our algorithm prioritizes intercluster dissimilarity, that is, gaps between scores at the borders of different groups.(i) This method is of a good choice when different grades are supposed to distinguish learning ability divides.
(ii) Our algorithm is friendly to not only the heavily skewed distribution (i.e., SD+ and SD−) but also normal (i.e., ND) and slightly-to-moderately skewed distributions (i.e., RD−, RD+, and WD).(ii) Our algorithm is generally appropriate for all kinds of data distributions. The reason is that our algorithm’s strategy is the determination of score gaps, which draw the clear-cut boundaries of clusters.

z score(i) z score method disregards the notion of cluster (dis) similarity by engaging the even ranges of the best and the worst scores within each learner group.(i) This method should be used when all grades are supposed to encompass an equal score range. Let us consider Table 10. Grade C produced by our algorithm ranges from 42 to 63.5 points which is relatively wider than the score ranges of the other grades. This situation is avoided in z score’s results. In other words, z score tries to equalize score ranges across all grades.
(ii) z score method is not good at dealing with norm-referenced grading in general mainly because its operation is blind to inherent raw-score gaps.(ii) Unlike the other methods, z score method is recommended for grading a score set that holds some wide divide (i.e., WD) because z score method allows skippable grades.

As we had not experimented our algorithm against data sets from other application domains, we did not claim the other applications of our algorithm besides that of the norm-referenced grading. However, the potential applications of our algorithm might include resource-consumer clustering problems in real life where their practical requirements of cluster-boundary explainability are the first priority: why two contiguously ranked data points (i.e., consumer profiles) belong to different clusters (i.e., different resource allocation levels) needs to be straightforwardly acceptable by data point owners. Some concrete applications can include the nation-wide selection of government loan applicants. Otherwise, serious arguments or even protests might occur between not only data-clustering processor and data owners but also discriminated data owners themselves. The main characteristic of our algorithm meets such requirements by providing a simple and clear-cut answer based on the widest gap between cluster boundaries; the other algorithms require that data owners completely understand the complicated algorithms to get answers.

Last but not least, to have an unbiased view, we point out the limitation of the proposed algorithm as follows. Although our algorithm can justify grade changes over evaluated scores through obvious score dissimilarity, the score ranges of the grades might be relatively different unlike z score. For instance, our algorithm might yield only a few learners receiving grade B and more receiving grade C. This can be negatively translated as unfair chances to receive both grades. Furthermore, unlike z score, our algorithm cannot skip any eligible grade if no one deserves such a grade (i.e., criterion based). The example lies in Table 12. However, this drawback holds only if some sense of criterion-referenced grading is introduced instead of pure norm-referenced grading.

9. Conclusions

This paper provides the comprehension of four unconditionally norm-referenced grading methods: our new algorithm, z score, K-means, and PAM. We conducted the experiments with multiple data sets of various distribution characteristics based on DBI performance metric. Overall, our algorithm outperforms the other methods. K-means method is ranked second followed by PAM. z score is the worst but appropriate for some case. In fact, our algorithm is so simple that it is implementable by using a spreadsheet tool. We plan to conduct more experiments with constraints and apply our algorithm to other domains as well.

Data Availability

The data used to support the findings of the study are included within the article.

Disclosure

The preliminary version of this paper was published under the title “Norm-Referenced Achievement Grading: Methods and Comparison” in the Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2020.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was financially supported by the Department of Computer Science, Faculty of Science, Kasetsart University, Thailand.

References

  1. S. Wadhwa, Handbook of Measurement and Testing, Ivy Publishing House, Delhi, India, 2008.
  2. R. K. Arora and D. Badal, “Evaluating student’s performance using k-means clustering,” International Journal of Computer Science And Technology, vol. 4, pp. 553–557, 2013. View at: Google Scholar
  3. S. P. Borgavakar and A. Shrivastava, “Evaluating student’s performance using K-means clustering,” International Journal of Engineering Research & Technology, vol. 6, pp. 114–116, 2017. View at: Google Scholar
  4. Z. Parveen, A. Alphones, A. Alphones, and S. Naz, “Extending the student’s performance via K-means and blended learning,” International Journal of Engineering and Applied Computer Science, vol. 2, no. 4, pp. 133–136, 2017. View at: Publisher Site | Google Scholar
  5. S. Shankar, B. D. Sarkar, S. Sabitha, and D. Mehrotra, “Performance analysis of student learning metric using K-mean clustering approach,” in Proceedings of the 6th International Conference on Cloud System and Big Data Engineering, Noida, India, January 2016. View at: Google Scholar
  6. S. Xi, “A new student achievement evaluation method based on k-means clustering algorithm,” in Proceedings of the 2nd International Conference on Education Reform and Modern Management, Hong Kong, China, April 2015. View at: Google Scholar
  7. Z. Iqbal, A. Qayyum, S. Latif, and J. Qadir, “Early student grade prediction: an empirical study,” in Proceedings of the 2nd International Conference on Advancements, Changsha, China, July 2019. View at: Google Scholar
  8. K. Ramen and T. Joachims, “Methods for ordinal peer grading,” in Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, August 2014. View at: Google Scholar
  9. S. M. Bai and S. M. Chen, “Automatically constructing grade membership functions for students’ evaluation for fuzzy grading systems,” in Proceedings of 2006 World Automation Congress International Congress, Budapest, Hungary, July 2006. View at: Google Scholar
  10. T. Banditwattanawong and M. Masdisornchote, “Norm-referenced achievement grading: methods and comparison,” in Proceedings of Advances in Intelligent Systems and Computing 6th International Conference on Advanced Intelligent Systems and Informatics, Cairo, Egypt, 2020. View at: Google Scholar
  11. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Publishers, Burlington, MA, USA, 2016.
  12. L. Kaufmann and P. Rousseeuw, Data Analysis Based on the L1-Norm and Related Methods, Springer, Berlin, Germany, 1987.
  13. B. Desgraupes, Clustering Indices, University of Paris, Paris, France, 2017.

Copyright © 2021 Thepparit Banditwattanawong and Masawee Masdisornchote. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder
Views123
Downloads129
Citations

Related articles