Research Article  Open Access
A MOOC Video Viewing Behavior Analysis Algorithm
Abstract
MOOCs (massive open online courses) are developing rapidly, but they also face many problems. As the MOOC’s most important resource, the course videos have a very important influence on the learning. This article defines the ratio (), which reflects the popularity of the video. By analyzing the relationship between the video length, release time, and , we found a significant negative linear correlation between video length and and video release time and . However, when the number of videos is less than the threshold, the release time has less influence on . This paper presents a video viewing behavior analysis algorithm based on multiple linear regression. The residual independence test proved that the algorithm has a good approximation to the data. It can predict the popularity of similar course videos to help producers optimize video design.
1. Introduction
MOOCs [1] are the massive open online courses, and these courses are usually provided by university and shared in the network [2, 3]. Coursera, Udacity, and edX [4] are currently the three largest providers in the world. The Chinese University MOOC [5] is the largest operator in China, which is launched by the Higher Education Agency of China.
MOOCs [2–4] tend to have a large number of learners. These learners are brought a new experience and given a chance to get involved in high education. At the same time, the feedback from the learners can also help the university improve courses.
MOOCs are developing rapidly, but they still have to face many problems, such as high dropout rates, low resource utilization, and lack of effective profit model [6]. Sometimes, the number of candidates can reach tens of thousands, but only 1% of the students completed the course. The Advanced Mathematics in the Chinese University MOOC is taken as an example. Its number of students is 4317 in 2017, while the number of learners who had watched more than 50% of videos is only 18. The low utilization of course videos is almost beyond imagination. How to make MOOC video more attractive is a very important research topic.
Philip J. Guo [7] studied the influence of teachers’ images, video length, and speech speed on learners. Their research shows the video within 6 minutes is the best, which comes from data analysis and questionnaires. Many researchers explore the rules of online learning by analyzing learning behavior data. Among them, viewing video behavior data is a research hotspot. And some scholars have done a lot of valuable work. Tanmay Sinha [8] operationalizes video lecture clickstreams of students into cognitively plausible higherlevel behaviors. Their results illustrate how such a metric inspired by cognitive psychology can help answer critical questions regarding students engagement, their future click interactions, and participation trajectories that lead to invideo and course dropouts. Nan Li et al. [9] present a video interaction analysis to provide empirical evidence about this issue. They find out that speed decreases, frequent and long pauses, and infrequent search with high amount of skipping and rewatching indicate higher level of video difficulty. Geza Kovacs [10] analyzes how users interact with invideo quizzes and how invideo quizzes influence users lecture viewing behavior. Through data analysis, they found the peak period for students to think about issues. Juho Kim et al. [11] identify five student activity patterns that can explain peaks: starting from the beginning of a new material, returning to missed content, following a tutorial step, replaying a brief segment, and repeating a nonvisual explanation. Christopher G. Brinton [12] studied student videowatching behavior and quiz performance. It is found that some of these behaviors are significantly correlated with changes in the likelihood that a student will be Correct on First Attempt or not in answering quiz questions, and in ways that are not necessarily intuitive. Qing Chen et al. [13] introduce a comprehensive visualization system called Peak Vizor. This system enables course instructors and education experts to analyze the peaks or the video segments that generate numerous clickstreams.
These studies provided many analysis methods for video viewing behavior and proposed some options to optimize video design. However, they are preliminary statistics and processing of the data, and there is no indepth analysis of the functional relationship. The purpose of our research is to build a mathematical model, which can describe the impact of video length and release time on attraction.
The parameter is defined in the article, which is defined as ratio of average viewing duration to video length. Its value reflects the popularity of the video. Obviously, the larger the value of , the higher the popularity of the video. Our research shows that there are significant negative linear correlations between and video length, and between and release time. Oversized videos tend to have small values, because learners cannot keep an interest in video for a long time. At the same time, the learner’s interest later is lower than the initial stage. However, there is a threshold for this. When the number of videos is less than a certain threshold, the learner’s interest in all course videos does not fluctuate significantly.
The multiple regression is used to analyze viewing video data in this article. By comparing the standardized regression coefficients, the release time has a much greater effect on than the video length. If the length of a single video is reduced, the total number of videos will be increased. So, it is unwise to emphasize video segmentation to shorten the length of video. We need to choose a balance between the number of videos and the length of the video. The algorithm proposed in this paper can predict the attractiveness of course video and help providers optimize video segmentation.
2. Data Description and Preprocessing
MOOC learning behavior data [14, 15] mining has become an important research field. Our data are provided by the Chinese University MOOC. The total time each learner watches each video is recorded. Only the impact of video length and number on attraction is studied here, and course content and teacher performance are ignored. Advanced Mathematics course is chosen as a research object. The time interval for data collection is 2016.09.26 to 2016.11.26. Although the content of each video is also different, and the teacher’s performance will fluctuate, these differences are not considered in this paper.
Table 1 lists the basic information of the course. The visiting site for the course is https://www.icourse163.org/course/NUDT9004. It contains 129 videos and the number of students enrolled in this class reached 27664. Many of these classmates did not watch videos, so the data contains 41033 viewing video records.

The sampling period for the data is 10 seconds. Thus, the viewing duration is an integer multiple of 10, and the unit is seconds. The data collection method is to record the total duration of watching a video. In this way, each learner has a record for a video. Then everyone’s record is an dimensional vector, where . Let the number of learners be , where . Therefore, the viewing duration data is a matrix .
Since too short viewing time is not enough for learning, the data must be preprocessed to remove invalid data. A histogram is drawn to analyze those invalid behaviors, which contained 41033 records.
As shown in Figure 1, there are many learners watching video with a total time of less than 60 seconds (red bar). These people may have just browsed the video or accidentally opened it. For noneffective learning behaviors, these data should be filtered out.
3. Ratio of Viewing Duration to Video Length
To analyze the relationship between the viewing duration and the length of the video, we calculate the average viewing duration of each video. This average viewing duration is calculated after removing the invalid data. Let be the average viewing time of th video. Its definition iswhere and .
The average viewing duration of 129 videos and the length of the video are plotted in Figure 2. The blue curve in Figure 2 represents the learner’s average viewing duration, and the red curve represents the video length. Observing Figure 1, almost all the average viewing time is less than the video length. But the average viewing time is sometimes greater than the length of the video, which is caused by the learner repeatedly viewing the same video. Because the length of the video will affect the average viewing duration, it is unreasonable to directly compare the average viewing times to determine the popularity of the video. Obviously, the ratio of the average viewing duration to the length of the video can reflect the students’ preference for this video. Let the length of the th video be and the average viewing duration of the th video be . The ratio of the th average viewing duration to the video length is
Two rules can be found based on Figure 2. One is that the longer the length of the video, the smaller the ratio . Another one is that the ratio of the later period of the course is smaller than that of the early stage. Because the number of the video represents the order of release, the stage of the lesson can be represented by it. So, we can conclude that is negatively related to the video length and video number. However, these judgments are very simple and require more accurate quantitative analysis.
4. Correlation Analysis
The Pearson correlation coefficient has been widely used in many research fields [16], which is a statistical indicator used to reflect the strength of linear relationship between variables. In our study, the correlation coefficient will be used to analyze the relationship between video length, video number, and . This work also verified the rationality of the regression analysis.
Letbe the correlation coefficient of , where is the covariance of and is defined as and and are the variance of and . The coefficient of correlation describes the strength of linear relationship between . For the Pearson correlation coefficient, the following conclusions are made:When is large, the mean squared error is small. So and show a strong linear relationship.
In order to analyze the relationship between and the video length and the video number, the scatter plots between them are drawn.
Figure 3 is a plot of and video length. The abscissa is . The ordinate is the video length. There is a clear negative linear correlation between them. The Pearson correlation coefficient is equal to . There was a significant negative correlation at the level of significance of 0.01 ().
Because the video number is the corresponding release order, it can be used to reflect the progress of the course. The relationship between the video number and can be observed in Figure 4. The abscissa of Figure 4 is , and the ordinate is the video number. The values of at the end of the course are smaller than the beginning. The Pearson correlation coefficient is equal to . It shows that there is a significant negative correlation between and video number ().
Because the absolute value of the correlation coefficient represents the strength of correlation, the video length has a stronger linear correlation with than the video number. This conclusion can be obtained by comparing the correlation coefficients ( and ).
Figure 5 is a scatter plot of video length and video number. Obviously, there is no linear relationship between them. In fact, the length of the video is contentrelated. So, the video length and video number should be independent.
Figure 6 shows a threedimensional scatter plot of , video length, and video number. The vertical coordinate is , the abscissa is the video length, and the ordinate is the video number. From the image it can be found that the distribution of points is nearly linear. Therefore, the multiple linear regression [17] can be used to establish the functional relationship between video length and number and .
5. Algorithm
5.1. Multiple Linear Regression Algorithm
is affected by the length of the video and the time of the release. From Figures 4 and 5, it can be seen that they exhibit a significant negative linear correlation. Here, is used as a dependent variable . Independent variables are video length and video number. The dependent variable and the independent variable can be expressed as a threedimensional vector. There are sets of data, and the th set of data is recorded as . When there is a linear relationship between and , and , respectively, a multiple linear regression model can be used to describe the relationship between them.where is a constant coefficient and is a partial regression coefficient. represents the residual. The residual is the difference between the actual measured data and the estimated value of the dependent variable, which is not determined by the independent variable. The following expression calculates the sum of squares between the predicted and measured values.By solving , let takes the minimum value.
5.2. Abnormal Data Filtering
By observing Figures 3 and 4, it can be found that there are several abnormal data. These abnormal data may have an influence on the calculation of the regression coefficient. Therefore, we should design an algorithm to filter them out. According to the DurbinWatson test theory [18], those points will be filtered out that fall outside the interval . is the mean of the residuals and is the variance of the residuals. The process can be divided into the following 5 steps.(a)Calculate the regression coefficients .(b)Calculate the residuals .(c)Calculate the mean and standard deviation of the residuals.(d)Find abnormal data; it meets .(e)Filter out abnormal data, including set , video length set , and video number set , where .
Let the data collection after filtering be , , , where . Obviously, here.
5.3. Regression Coefficient and Residual Test
After the previous 5 steps, the abnormal data are filtered out. Multiple linear regression is performed using data sets , , and . Let the regression function be
It is not enough to just calculate the regression equation. It is necessary to verify the correctness of the model. The verification method is to test the residual for independence. When the residuals obey the normal distribution , it shows that the model has a higher degree of approximation. KolmogorovSmirnov [19] test can be used here.
6. Experiment
6.1. Abnormal Data Filtering
Firstly, a multiple linear regression is performed on the preprocessed data; then the residuals are calculated. However, the video number and the video length are two variables with different dimensions. Their influence on cannot be directly compared. The standardized regression coefficient can be used to analyze the contribution of the independent variable to . The standardized formulae for the variables are , . The standardized regression equation is . Note that the coefficient here is 0. The results are shown in Table 2.

By comparing the two standardized regression parameters in Table 2, we can get a clear conclusion. Compared to the length of the video, the video number has a greater influence on . The coefficients in Table 2 cannot be used for data fitting. Its purpose is to calculate residuals and find abnormal points.
Table 3 lists the statistic residuals. The mean of the residuals is less than and the variance is 0.1053. According to the previously proposed algorithm, abnormal data are removed. In this course, there is only one data set that needs to be removed. After filtering, there are 128 sets of data, which are marked , , and . They will be used to do multiple regression analysis to get new coefficients.

6.2. Calculate the Regression Coefficient
The nonstandardized regression coefficients can be used to predict , while the standard regression coefficients can be used to compare the influence of video length and number on . Here, they are all calculated.
Observing the data in Table 4, it was found that the standardized coefficients of the video number and video length variables are 0.622 and 0.155, respectively. For this course, the video number had almost 4 times more influence on than the video length. The regression equation can be used to calculate .

The blue line in Figure 7 is calculated from the data, and the red line is the curve predicted by the regression equation. As can be seen from Figure 8, the predicted value reflects the change trend of . The regression equation can be used for video design of similar courses. After providing the video length and number, the values of can be predicted. The predicted values can serve as important reference. The video’s recorder can optimize the design by video segmentation or integration to increase the attractiveness of the video. Because the appeal of the video is influenced by the course content, it is important to emphasize that the parameters used for prediction should come from similar courses.
6.3. Residual Test
The residual independence test is a very important step in the algorithm. When the residual satisfies the normal distribution [20] of , the accuracy of the mathematical model can be guaranteed.
Formula is used to calculate residual . The residual set contains 128 data sets (). The method of residual independence test is to test normal distribution of .
Figure 9 is a histogram of the residuals, which is very close to the normal distribution graphically. The KolmogorovSmirnov test [19] is used to examine whether the residuals are normally distributed. The result shows that the residuals follow a normal distribution at a significant level of 0.05. The residuals’ mean value is , and the standard deviation is equal to 0.12374. This means that the mathematical model has a high degree of approximation to the data.
7. Conclusion
This article analyzes the relationship between video length and video number and . Our research shows that there is a significant negative linear correlation between video length and , and between video number and ratio . The video length and video number are independent. Based on these results, the multiple linear regression equation is used to establish the functional relationship between them. The DurbinWatson [18] test theory and principles helped us effectively remove abnormal data. Because the effect of abnormal data on small samples is significant, it is necessary to remove abnormal data. The standardized regression coefficients in Tables 3 and 4 all indicate that the video number has a greater influence on .
In order to reduce the length of the video, the content of the video needs to be decomposed. However, decomposition will bring about an increase in the number of videos. Therefore, reducing the video length and reducing the number of videos cannot be achieved at the same time. We need to find a balance between these two issues.
We also found an interesting rule. When the number of course videos is small, video appeal is hardly affected by the release time. For courses with less than 40 total videos, the value is hardly affected by the video number. This means that learners have a threshold for the “patience” of the video number.
Two courses of Chinese University MOOC were chosen to illustrate this phenomenon, which are the Game Theory and the First Aid Knowledge. There are 38 videos in the Game Theory. The First Aid Knowledge contains 19 videos. It can be seen from Figure 9 that is hardly affected by the release time. There is a threshold for learners’ patience with video number. Therefore, the number of videos for a course should be controlled within . However, we do not think that the threshold is 40; it requires more indepth study. When the content of the course is really large, you can consider splitting it into subcurriculums release. These subcurriculums independently assess and issue certificates, which can guarantee learners’ interest.
So the following suggestions are given for video design.(i)Do not publish videos that are too long. The teaching content must be streamlined to reduce the length of the video. Philip J. Guo [7] thinks the video within 6 minutes is the best.(ii)Try to control the number of videos within . When a certain limit is exceeded, the release time of the video will seriously affect .(iii)Do not publish long video later in the course. The value of is often very low.(iv)After the video is recorded, it must be clipped to balance the number and length of the video. When there are too many videos, you can consider splitting them into several courses and publishing them separately.(v)The method proposed in this article can predict . MOOC operators can use this method to establish multiple linear regression models for various courses. It will help video producers optimize the video design and make the videos more attractive. In order to ensure accuracy, please use the regression equation of similar courses to predict.
In future work, we should research the best number of videos for a course. The courses studied in this article are all taught in Chinese. Whether language has an influence on the regression coefficient is another issue that we are concerned about. We expect that researchers with such data sets will be able to give answers.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This study is supported by the Open Research Fund of National Ministry of Education Higher Education Research Center Grant NO. 2017008 and supported by the Open Research Fund of Hunan Provincial Key Laboratory of Network Investigational Technology Grant NO. 2017WLZC003.
Supplementary Materials
The supplementary materials are provided by the Chinese University MOOC, which are viewing video behavior data of three courses. Advanced Mathematics contains 129 videos. The other two courses are Game Theory which contains 38 videos and First Aid Knowledge which contains 19 videos. (Supplementary Materials)
References
 L. Pappano, “The year of the mooc,” The New York Times, vol. 2, no. 12, 2012. View at: Google Scholar
 P. Adamopoulos, “What makes a great MOOC? An interdisciplinary analysis of student retention in online courses,” in Proceedings of the International Conference on Information Systems, ICIS 2013, pp. 4720–4740, Association for Information Systems, Milano, Italy, December 2013. View at: Google Scholar
 J. Kay, P. Reimann, E. Diebold, and B. Kummerfeld, “MOOCs: So many learners, so much potential.,” IEEE Intelligent Systems, vol. 28, no. 3, pp. 70–77, 2013. View at: Publisher Site  Google Scholar
 R. McGuire, “The best mooc provider: A review of coursera, udacity, and edx,” http://www.skilledup.com, 2014. View at: Google Scholar
 M. Zhou, “Chinese university students' acceptance of MOOCs: A selfdetermination perspective,” Computers & Education, vol. 92, pp. 194–203, 2016. View at: Publisher Site  Google Scholar
 M. Formanek, W. Matthew, S. Buxner, and C. D. Impey, “Motivational differences between mooc and undergraduate astronomy students,” American Astronomical Society Meeting Abstracts, vol. 231, 2018. View at: Google Scholar
 P. J. Guo, J. Kim, and R. Rubin, “How video production affects student engagement: An empirical study of MOOC videos,” in Proceedings of the 1st ACM conference on Learning@ scale conference, pp. 41–50, ACM, Atlanta, GA, USA, March 2014. View at: Google Scholar
 T. Sinha, P. Jermann, N. Li, and P. Dillenbourg, “Your click decides your fate: Inferring information processing and attrition behavior from mooc video clickstream interactions,” 2014. View at: Google Scholar
 N. Li, L. Kidzinski, P. Jermann, and P. Dillenbourg, “How do invideo interactions reflect perceived video difficulty?” in Proceedings of the In Proceedings of the European MOOCs Stakeholder Summit 2015, pp. 112–121, PAU Education, Mons, Belgium, 2015. View at: Google Scholar
 G. Kovacs, “Effects of invideo quizzes on MOOC lecture viewing,” in Proceedings of the 3rd Annual ACM Conference on Learning at Scale, L@S 2016, pp. 31–40, ACM, Scotland, Uk, April 2016. View at: Google Scholar
 J. Kim, P. J. Guo, D. T. Seaton, P. Mitros, K. Z. Gajos, and R. C. Miller, “Understanding invideo dropouts and interaction peaks in online lecture videos,” in Proceedings of the 1st ACM conference on Learning @ scale conference, pp. 31–40, ACM, Atlanta, GA, USA, March 2014. View at: Google Scholar
 C. G. Brinton, S. Buccapatnam, M. Chiang, and H. V. Poor, “Mining {MOOC} clickstreams: videowatching behavior vs. invideo quiz performance,” IEEE Transactions on Signal Processing, vol. 64, no. 14, pp. 3677–3692, 2016. View at: Publisher Site  Google Scholar  MathSciNet
 Q. Chen, Y. Chen, D. Liu, C. Shi, Y. Wu, and H. Qu, “PeakVizor: Visual Analytics of Peaks in Video Clickstreams from Massive Open Online Courses,” IEEE Transactions on Visualization and Computer Graphics, vol. 22, no. 10, pp. 2315–2330, 2016. View at: Publisher Site  Google Scholar
 L. P. Rieber, “Participation patterns in a massive open online course (MOOC) about statistics,” British Journal of Educational Technology, vol. 48, no. 6, pp. 1295–1304, 2017. View at: Publisher Site  Google Scholar
 S. Yin, X. Yang, and H. R. Karimi, “DataDriven Adaptive Observer for Fault Diagnosis,” Mathematical Problems in Engineering, vol. 2012, Article ID 832836, 21 pages, 2012. View at: Publisher Site  Google Scholar  MathSciNet
 T. K. Koo and M. Y. Li, “A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research,” Journal of Chiropractic Medicine, vol. 15, no. 2, pp. 155–163, 2016. View at: Publisher Site  Google Scholar
 D. J. Olive, “Multiple linear regression,” in Linear Regression, pp. 17–83, Springer International Publishing, Cham, Switzerland, 2017. View at: Publisher Site  Google Scholar
 M. L. King and D. E. A. Giles, Specification Analysis in The Linear Model, Taylor & Francis Inc, Bosa Roca, USA, 2018.
 J. Rajan, A. J. Den Dekker, and J. Sijbers, “A new nonlocal maximum likelihood estimation method for Rician noise reduction in magnetic resonance images using the KolmogorovSmirnov test,” Signal Processing, vol. 103, pp. 16–23, 2014. View at: Publisher Site  Google Scholar
 V. Bignozzi and A. Tsanakas, “Parameter Uncertainty and Residual Estimation Risk,” Journal of Risk and Insurance, vol. 83, no. 4, pp. 949–978, 2016. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2018 Yong Luo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.