Abstract

This paper first conducts knowledge point association analysis on a large amount of data collected in practical applications. Data mining includes data collection, data preprocessing, actual mining, and result analysis, establishes knowledge point association rules table, and develops college English diagnostic practice system. Then, starting from the existing paper composition mode of the system, the knowledge point association rule table is introduced, and the knowledge point association relationship mining model is constructed using the association rule algorithm to explore the potential influence relationship between different knowledge points that affect the improvement of learning quality. Finally, the data collected when the system is used is preprocessed, and the three dimensions of learning status evaluation, question-type association analysis, and college English score prediction are, respectively, modeled. Finally, after combining these submodels, a relatively complete and reliable diagnosis is obtained by evaluation model and related verification.

1. Introduction

With the increasing popularity of the Internet, a large number of English online learning systems have also emerged. These English learning platforms break the limitations of traditional paper materials and have two major advantages: a free learning mode and timely recording. They can make learners more flexible in time and space and record learners’ status at any time to help learners progress through stages. Diagnostic evaluation, also known as “teaching evaluation,” generally refers to the prediction of students’ knowledge, skills, and emotions before a certain teaching activity or during the teaching process [1]. In the process of English learning, each learner has different learning foundations and abilities. Different levels lead to different learning effects and progress. There are bounds to be different learning states, and there are gaps in the understanding of different knowledge points. Individuals have different learning disabilities. Therefore, in order to effectively improve the learning level of learners, it is necessary to evaluate the learning results in a personalized manner and periodically diagnose the learning obstacles [2]. The learning content is refined into individual knowledge points, and combining the knowledge points into a knowledge point system can help students learn more effectively and improve learning efficiency. It is proved by a large number of teaching practices in the field of pedagogy [3]. When learners encounter difficulties on a certain knowledge point, the problem may come from the knowledge point itself, or it may come from other knowledge points related to it; that is, there are front obstacles or back-end obstacle knowledge points (knowledge points that lead to learning disabilities in other knowledge points).

At present, there are many types of targeted teaching for English exams, and there are many ways to predict the knowledge points of English exams. However, most of the predictions of the content of English exams and the correlation analysis of knowledge points are based on the teacher’s own experience, and teaching practice is subjective and one-sided. The majority of candidates often cannot grasp the relationship between various knowledge points when studying and testing. Teachers or tutoring materials mostly use their own experience as a series of knowledge points. Only by using genuine data for data analysis and understanding, and revealing the connections hidden in the data, can we make up for the shortcomings of personal business. Only by using data mining technology for association analysis can we discover the association rules between knowledge points [3]. The current diagnostic practice system for college English lacks association rules about knowledge points, and there is no algorithm for recommending test questions based on knowledge point association. There are deficiencies in recommending corresponding test questions according to the user’s knowledge point mastery. Therefore, in response to the above problems, the project team believes that a large amount of real data should be used, combined with the current cutting-edge data mining technology, to conduct corresponding research on the association rules of college English knowledge points, and based on the analysis results, propose a knowledge point association rule. The algorithm of generating papers can really effectively target the English test questions and the actual situation of the majority of candidates, and effectively improve learners’ performance, especially to better realize the needs of personalized teaching [4].

This paper conducts knowledge point association analysis on a large amount of data collected in practical applications. Data mining includes data collection and data preprocessing, establishes a knowledge point association rule table, and develops a college English diagnostic practice system. Then, starting from the existing model of the system, the knowledge point association rule table is introduced. The college English diagnostic evaluation model proposed in this paper can accurately evaluate learners’ learning status and dynamically diagnose learners’ learning disabilities. Through comparative experiments, it can be obtained that the diagnostic evaluation model proposed in this paper can effectively provide better practice guidance and test question recommendation for learners’ learning status and knowledge point problem-type obstacles, can effectively improve learners’ performance, and has a wide range of application prospects and research value.

In 2013, the Gates Foundation provided a report on adaptive learning technology research. The report pointed out that more than 30 large companies are doing a lot of related research in this area. Among them, Kaewton’s new teaching model has a relatively large influence. This teaching model can detect learners’ grasp of a variety of different knowledge points, accurately discover the strengths and weaknesses of each learner, and have a low degree of mastery [5]. The knowledge points are marked, which provides targeted guidance for teachers’ teaching work and facilitates the retrieval of knowledge points for learners. In addition, Trier University in Germany and Carnegie Mellon University and Stanford Institute of Learning in the United States are all carrying out different adaptive learning research projects, including ELM-ART system, InterBook project, and personalized access project of distributed learning knowledge base [6].

In China, there are also many projects devoted to the research of diagnostic learning. In order to establish a college English test system and framework that meets the learning needs of Chinese college students, the National College English Test Committee members have researched and developed the “Computer Adaptive Test” project for many years. Dillon et al. [7] have researched and developed a college English simulation test system, which can intelligently evaluate and diagnose various types of questions and provide users with diagnostic evaluation and correction opinions. It is a relatively good English learning system with diagnostic functions. In addition, in major universities, there has been a lot of research and development of college English learning systems in recent years. For example, the B/S structure-based college English online learning system developed by Wang et al. [8], a personalized English learning system based on mobile devices researched and developed by Zhu [9], mainly discusses personalized learning for mobile devices. Wiegel et al. [10] have studied the key technologies of personalized English learning system, including classification of English sentence reading difficulty, and proposed a text retrieval ranking model based on multiple similarities. The online English learning system developed by Gan et al. [11] proposes the concept of setting diagnostic checkpoints, which is a relatively primitive diagnostic system. In 2014 alone, there were more than 10,000 papers on English learning on CNKI. It can be seen that the research on English learning systems is very hot.

Most of the English learning systems developed today also have imperfections. The most obvious one is that most systems do not consider the meaning of the knowledge point system and cannot effectively diagnose and evaluate the learner’s knowledge point mastery, without a personalized and targeted test paper formation algorithm, it is impossible to have a comprehensive grasp of the learner’s knowledge system, which will inevitably cause a certain deviation in the learning effect.

3. Diagnostic Evaluation Model

3.1. Design of Diagnostic Evaluation Model

In the college English diagnostic practice system studied in this article, diagnostic evaluation mainly refers to the fact that the system will automatically provide learners with guidance suggestions and test questions recommendations based on learners’ academic performance and specific conditions, so that learners can adjust their studies in time direction and improve their own performance in a targeted manner. The college English diagnostic practice system provides learners with two main functions: (1) practice function, including six types of test papers: random test papers, real test models, question-type test papers, knowledge points test papers, knowledge points weaknesses test papers, and score maximization test papers. Among them, the random test paper is a set of test papers randomly formed by the system [12, 13]. The real test mode is provided for learners to choose any set of real English questions over the years. According to the question type, the learner can choose freely to form the paper. The knowledge point group is for learning. The readers are free to choose knowledge points to form papers; the above four are relatively simple and intuitive paper grouping modes. Knowledge point weakness test papers and score maximization test papers are the function of generating papers with diagnostic functions. The knowledge point weakness test papers are automatically organized according to the poor knowledge points of the user, and the score is maximized. To maximize the coverage of knowledge points, test papers are organized by the system based on the frequency of knowledge points in the English test questions over the years and the user’s own mastery. (2) View the score details: this function is also an important part of the diagnostic function. In terms of the overall situation, users can view the recent test paper score rate trend chart, the distribution map of knowledge point mastery, and the distribution map of question type score rate, and the system will also give learning suggestions based on the above situation. On the other hand, the user can also view the detailed report of each test paper that has been answered, as well as the overall situation of the class, which will provide the user’s answering status of each question type and the answering status of each knowledge point.

3.2. Framework of Diagnostic Evaluation Model

The diagnostic evaluation model is a dynamic evaluation model that diagnoses learning disabilities, evaluates learning conditions, adjusts learning methods, and warns of learning status for learners. Accordingly, the framework of the diagnostic evaluation model is shown in Figure 1.

The operating process of the diagnostic evaluation model is as follows:(1)Count the learner’s test and practice information and perform analysis and calculation.Extract the learner’s test and practice information from the database of the system and use the diagnostic evaluation model for analysis and calculation after processing.(2)Analyze and evaluate the overall learning status of learners.Starting from the learner’s scores of knowledge points, question types, etc., the learners are analyzed and evaluated according to the scores of learning and the degree of learning stability. Through the quantitative judgment of learners’ learning status, the system will give early warning to learners whose status is unstable and whose learning progress is lagging behind, so as to stimulate and promote learners’ learning motivation.(3)Diagnose learners’ knowledge barriers and question barriers.The knowledge point association rule table already owned by this system has been verified to be more accurate and reasonable, so it can be directly added to the diagnostic evaluation model. Question association rules can perform association analysis from the learner’s question type scores, infer the learner’s question type obstacles, and help learners find their own defects.(4)Predict the learner’s English performance.For feature extraction of learner’s test and practice information, use random forest and multiple linear regression to train and integrate them to build an English score prediction model, and predict learners’ English scores so that learners understand their own English proficiency, and supervise learners to strengthen practice.(5)Make targeted test recommendations to learners.In order to guide and suggest learners’ learning strategies and learning pace, the diagnostic evaluation model provides learners with targeted test questions. In the fourth chapter, the author designs and verifies two paper generation algorithms based on diagnostic evaluation models for learners to use. It can be seen from the above process that we need to separately research and design the learning state evaluation model, question-type association analysis, and English score prediction model, and finally integrate the diagnostic evaluation model.

4. Association Analysis of Knowledge Points

In the preliminary research of the college English diagnostic practice system, the research on the knowledge points has been very sufficient, including the knowledge point association rules and the function of generating papers according to the knowledge point association rules [1416]. However, there is a lack of research on the types of English exams, and there is only a simple way of grouping papers according to the type of questions. As an important part of the college English test, after many years of reform, the question type has been very different from the past, so it must be taken seriously. Therefore, this section uses a large amount of raw data collected in the use of the system, data postprocessing and data stratification, two levels of association analysis on the question type data, and finally a relatively complete and reliable question-type association rule table.

4.1. Mining and Analysis of Knowledge Points Association Rules

Association rules are one of the most commonly used mining methods in the field of data mining. Association rules are the process of mining hidden information and knowledge that people do not know beforehand from massive random, noisy, and fuzzy data. The discovered knowledge or information is often expressed in the form of frequent itemsets [17, 18]. Association rules were originally proposed for the problem of shopping basket analysis, used to discover the association relationship between different commodities in customer purchase transaction data, and obtain general rules for customer purchase patterns. The mining process of association rules mainly includes two parts. First, generate frequent itemsets. By setting the minimum support, find all frequent itemsets that meet the conditions; that is, all itemsets not less than the minimum support; then generate rules. That is, according to the set minimum confidence, in each maximum frequent item set of the rules, find rules that are not less than the minimum confidence. These rules are usually called strong rules. Normally, the amount of calculation required to generate frequent itemsets is much greater than the amount of calculation generated by rules.

4.1.1. Core Concepts of Association Rules

Definition 1. item and itemset). The smallest indivisible unit in the data is called an item, represented by the symbol A. An itemset is a collection of items. Let set denote the itemset; then the number of items in the set A is i; then the itemset containing i-items is called the i-itemset.

Definition 2. affairs). A data record in the form of can be called a transaction, represents the unique identifier of the transaction, and represents the item set in the transaction. Set X to be an itemset; if , then transaction contains itemset X.

Definition 3. support of association rules). The support of association rules represents the probability of the appearance of association rules. It is a measure of the scope of application of association rules and reflects the generality or applicability of the rule in all transactions. Under normal circumstances, it is necessary to set a minimum value of measure support, which indicates the minimum degree of applicability of the item set in a statistical sense. The calculation method of support isThat is, the probability that itemsets A and B occur at the same time can be expressed asIn the association rules, if the relative support of itemset 1 can meet the predefined minimum support value, then itemset 1 can be considered as a frequent itemset. Frequent itemsets are usually denoted as Pk in association rules.

Definition 4. confidence of association rules). The confidence of the association rule indicates the extent to which the association rule is correct. It is a measure of the accuracy of association rules and reflects the concept of establishment of association rules (under the condition that the premise is established). It is usually necessary to set a minimum confidence level, which represents the minimum accuracy of the association rule. The calculation method of confidence isThat is, if itemset X occurs, then the probability of itemset Y occurrence can be expressed as

Definition 5. lift of association rules (lift)). The lift of an association rule is used to indicate the degree of correlation between the itemsets in the rule. The lift is greater than 1 and the higher it means that the rule is a valid strong association rule or positive correlation, and the lower it is less than 1 and the lower it means is an invalid strong association rule or negative correlation, and equal to 1 means that there is no correlation between itemsets; that is, they are independent of each other. It is usually expressed as the ratio of the probability of containing itemset X and the probability of itemset Y at the same time, and the probability of containing itemset Y without itemset X. Normally, when a rule reaches the threshold of support and confidence, a higher degree of promotion will be considered more meaningful. The calculation method of lift is

4.2. Data Analysis Method and Data Processing Process

The overall flowchart of association rule mining and analysis between subject knowledge points in this research is shown in Figure 2. The specific analysis process mainly includes the following steps:(1)Selectively extract data from the learning result record database, and complete the extraction of newly added dimension data to form historical data and incremental data required for analysis.(2)Perform data preprocessing and exploratory analysis on the two data sets formed in step 1, including data missing value processing, data outlier detection, data attribute specification, cleaning, and transformation operations.(3)Using the preprocessed modeling data in step 2, based on the algorithm, the association rule modeling analysis between the front and back knowledge points, and the feature analysis of the relationship between the knowledge points, complete the analysis results visualization [1922].(4)According to the operation results of the model, different association rules are obtained, and different support and service strategies are adopted to help teachers discover the in-depth influence relationship between knowledge points.

5. Application of Knowledge Point Association Analysis

The existing knowledge point weakness test paper model in the college English diagnostic practice system has not yet considered the relationship between each knowledge point, that is, other knowledge points that have front and back obstacles to each knowledge point. This chapter adopts the sample data mining and association analysis discover the relationship between the learner’s knowledge points and knowledge points. The project team has done related work in the early stage, but the amount of data is relatively low, and the resulting association rule table is relatively simple. Therefore, this chapter collects a large amount of raw data in practical applications, and after cleaning and converting the quantity preprocessing, a correlation analysis is carried out on a large number of complex data in multiple aspects, including processing continuous data, mining knowledge points with high- and low-accuracy rates separately, stratifying according to different data volumes, etc., and finally, a relatively complete and reliable table of association rules between each knowledge point is obtained [23, 24].

5.1. Data Collection

In 2019, the project team arranged for the use of the diagnostic practice system in 14 classes taught by 4 English teachers in a college and university during the initial test work. There were nearly 1000 students in the school, of which in that year students who apply for the college English test are mainly concentrated in the first and second grades of the university. The school requires that every student who signs up for the English test must use the College English diagnostic practice system to complete at least two sets of exercises. After the one-year system usage period, a total of 4,590 students have used the system, and most of them completed more than two sets of test papers. The number of various types of questions in the college English diagnostic practice system is shown in Figure 3.

5.2. Data Layering

In the past few years that the college English diagnostic practice system has been used online, the English test questions have been reformed, with some changes from the past. In the listening part, the short conversation listening question type was removed, and the short news question type was added. In the reading part, the quick reading and cloze question types were removed, and the information matching question type was added. Therefore, there are no short news and information matching question types in the early exercises of the system, so some learners have missing scores on these two types of questions. After sorting according to the proportion of the number of people from high to low, the number of learners of each question type and the proportion of the total number of students are shown in Figure 4.

It can be seen from Figure 4 that, except for information matching of new question types and short news, the sample numbers of other question types are all above 98%. Because the current English test has adopted new question types, the new question type and the old question type should be separated for correlation analysis. However, since the reform of English question types has just begun, there are not many real questions and the quality of simulated questions is uneven. Old question types such as short conversation listening and so on still have practice value. English proficiency can still be improved by practicing these old question types, so new question types should be added when the old question types are stratified, the relationship between the new and old question types should be explored, and the system question bank can be fully utilized when the number of real questions is limited.

5.3. Algorithm Parameter Setting

In the correlation analysis, there is no fixed value for the minimum support and the minimum confidence, which need to be determined according to the training set and the scene. When these two values are set larger, the correlation of the obtained frequent itemsets is better, but the number of frequent itemsets is also less, so the two parameter values need to be determined according to the number of frequent itemsets and their correlation [19]. In the early stage of this study, first set the minimum support to 30%, the minimum confidence to 90%, and the maximum number of antecedents to 3. The results are shown in Figure 5.

It can be seen from Figure 5 that, according to the current parameter settings, the number of association rules between the first and second levels is quite different, which is not conducive to the diagnosis of the learner’s question type obstacles, and it is not conducive to the subsequent test question recommendation. Therefore, the author decided to set the parameters for each layer to mine, so as to obtain accurate and reliable association rules.

5.3.1. Analysis of the First Layer

After many experiments, in the correlation analysis between the question types of the first layer, the minimum support is set to 40%, the maximum confidence is 90%, and the maximum number of antecedents is 3. It can be seen that the number of association rules analyzed for T is more in line with expectations, while the number of rules analyzed for F is relatively small. Therefore, in order to obtain more association rules, the minimum support value can be reduced to obtain. Therefore, in the analysis of F, when the minimum support is reduced to 35%, the number of rules that can be obtained is 25.

5.3.2. Analysis of the Second Layer

After many experiments, in the correlation analysis between the question types of the second layer, the minimum support is set to 20%, the maximum confidence is 90%, and the maximum number of antecedents is 3. The number of association rules obtained is shown in Figure 6.

It can be seen that the number of association rules for T analysis is more in line with expectations, while the number of rules for F analysis is too large. Therefore, the number of association rules can be reduced by appropriately increasing the minimum support. Therefore, in the analysis of F, when the minimum support is increased from 20% to 25%, there are 28 association rules that can be mined.

5.4. Result Analysis of Association Rules of Question Type

After the abovementioned data processing, data stratification, and association analysis, as shown in Figure 7, the final number of association rules is obtained.

According to Figure 7, a total of 129 rules have been obtained. In the 129 association rules, the number of occurrences of each question type before and after items is shown in Figure 8.

Due to space limitations, it is not possible to show and explain all the 136 association rules one by one. There are 4 levels of analysis and F analysis. The association rules with the highest confidence and support for each level are selected. These rules are shown in Figure 9.

The support degree in Figure 9 represents the probability of the antecedent and the latter term appearing together, and the confidence degree represents the current latter term. When it appears, the probability that the antecedent also exists. The association rules of the first level in the T analysis indicate that the total number of samples in the first level is 541; that is, in the 541 sample data, the score rate of the subsequent item type reading selection is matched with the previous item information and fast reading. The score rate of the question type is greater than or equal to its average score rate. The proportion of all samples is 42.68%, and the score rate of the two types of questions in the preceding information matching and fast reading are both greater than or equal to their average score rate. At the time, the probability that the score rate of the subsequent reading selection is greater than or equal to its average score rate is 96.18%.

The second-level association rules of the T analysis indicate that the total number of samples in the second level is 524; that is, in the 524-sample data, the score rate of the reading selection of the subsequent item type and the score rate of the previous item type information matching both are the same. The average score rate of greater than or equal to 47.18% of all samples is 44.13%, and when the score rate of the previous item information matching is greater than or equal to its average score rate, the score rate of the second item type reading selection is also greater than. The probability of being equal to or equal to the average score rate is 13.41%.

The association rules of the first level in the F analysis indicate that the total number of samples in the first level is 524; that is, in the 524-sample data, the score rate of the latter question type short dialogue listening and the preceding question-type long dialogue listening and short essays. The score rate of listening comprehension is lower than its average score rate at the same time. The proportion of all samples is 46.14%, and when the score rate of the former question-type long conversation listening and short essay listening comprehension are both lower than its average score rate, the latter question type. The probability that the score rate of short conversation listening is also lower than its average score rate is 98.21%.

The second-level association rules of the F analysis indicate that the total number of samples in the second level is 524; that is, in these sample data, the score rate of the subsequent item type information matching and the score rate of the previous item type reading choice and reading choice. At the same time, the proportion of all samples is 31.71%, which is less than its average score rate, and when the scores of the previous item type reading choice and reading word choice are both less than their average score rate, the latter item type information matches the selected score. The probability that the rate is also less than the average score rate is 51.14%.

From the above four rules with the highest confidence and support, it is not difficult to find that they belong to the same big category. For example, listening questions are very closely related. When the learner has a relatively high error rate for the long conversation listening and short text listening comprehension questions, the listening error rate of short conversations is also high. This also verifies the previous prediction that the old question type and the new question type of the same big type are closely related. By practicing the real questions of the old question type, you can also enhance your grasp of the new question type.

6. Conclusion

After analyzing the personalized learning system, this article explores the defects of the diagnostic evaluation module of the current college English diagnostic practice system. At the same time, it conducts research from the three aspects of learning status evaluation, question-type association analysis, and college English score prediction. The existing knowledge point association rules of the system jointly construct the diagnostic evaluation model and the establishment of the diagnostic evaluation model. Through the practice data of more than 500 people collected during the use of the system, these data are modeled from the three perspectives of learning status evaluation, question-type association analysis, and English score prediction, plus the existing knowledge point association rules of the system. Finally, a relatively complete and reliable diagnostic evaluation model was obtained. The model will dynamically diagnose and evaluate according to the user’s practice situation, personally adjust the learning pace, and timely warn the learning state, and as users use the system for more exercises, the evaluation accuracy of the diagnostic evaluation model will also improve. In the future development of the college English diagnostic practice system, there is still a lot of work that needs to be continued. The specific manifestation is the automatic update of the diagnostic evaluation model. The designed knowledge point association analysis and English score prediction model are all manually collected and processed data, data mining, and machine learning. In subsequent research, it is hoped that the system administrator can add the function of automatically constructing the model to update the diagnostic evaluation in real time. Collect more experimental data to optimize the paper generation algorithm based on the diagnostic evaluation model. Due to the lack of time, this article only used two classes for comparative experiments. As the system is used throughout the school, more user data and user feedback can be collected to improve the test paper algorithm.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no known conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.