Abstract

Industrial Internet of Things (IIoT) has attracted much attention from global researchers and has been applied into many fields, such as medical treatment, transportation, and education. This paper pays attention to an IIoT-oriented education problem and gives the corresponding solution. Heterogeneous educational resources have multisource target data, so it is necessary to integrate the repetitive data and data with the same attributes. However, due to the poor tracking effect of the model constructed by traditional methods, the mining technology loses a part of the data characteristics and affects the multisource foreign language education data integration. So this article studies the integration mechanism of foreign language heterogeneous educational resources based on time series analysis. The mechanism adopts a data cleaning and fusion method based on the time series similarity measurement. This method uses approximate symbol aggregation, European algorithm, and similar sequences with adjusted similarity weights to complete the data cleaning of foreign language heterogeneous educational resources. After that, it uses multiple heterogeneous data fusion algorithms to complete data integration. Experiments with foreign language education resources at all levels in a certain city show that the mechanism can detect abnormal data of foreign language education resources, fill in vacant data, reduce data redundancy, and integrate heterogeneous data. After the data are cleaned by multisource heterogeneous data fusion algorithm, the credibility of the measurement data is reflected, and the mean absolute percentage error is only 6.25%. The data quality is improved as a whole, and it provides reliable basic data for the application of foreign language education resources.

1. Introduction

In recent years, Industrial Internet of Things (IIoT) has been widely applied with the explosive increase of mobile devices and cloud platforms. In fact, during the process of building smart city, IIoT plays an important role, for example, improving the level of medical treatment, the efficiency of transportation, and the quality of education. Especially during the period of COVID-19, online education has become more are more significant. Therefore, the IIoT-oriented education resource allocation issue has attracted much attention from global researchers.

With the advancement of the construction of foreign language education information, most schools, education departments, and related institutions have established their own foreign language education and teaching resource systems, but such resource systems have been established due to different establishment periods and lack of unified technical specifications and education resources. The unified understanding of utilization has led to repeated investment in hardware facilities, repeated development of software platforms, and repeated construction of online courses, resulting in heterogeneity such as uneven distribution of digital foreign language education resources, low standardization, and difficulty in integration and sharing. The problem seriously hinders the effective use and reasonable distribution of digital foreign language education resources. In view of the current heterogeneity of foreign language digital education resources and the need for the rational allocation and effective use of educational resources in the industrialization of foreign language digital education, the research and realization of the integration of foreign language heterogeneous educational resources have become the current digital educational industrialization project. Only by establishing a reasonable development strategy for digital foreign language education resources can we better promote the construction of digital foreign language education resources, give full play to its maximum benefits, and serve the construction of educational information [1].

Data cleaning is a method used to detect and eliminate errors and inconsistencies in data [2]. In recent years, researchers have proposed a variety of data cleaning technologies to improve data quality, such as missing data attribution, object repeated detection, anomaly detection, logical error detection, and data inconsistency detection [3]. However, current data cleaning methods have high computational complexity and inaccurate detection of missing data. In 2017, the University of California, Riverside, Keogh, group [4] by based on changes in the field of feature representation, symbol, and piecewise linear representation to complete the feature extraction, but as a result of data in the process of feature extracting large dimensions, high computational complexity does not have scale invariance defects for multivariate time series data cleaning effect.

Similarity measurement technology is the basis of sequence data analysis. Sequence analysis can reflect the characteristics and relationships between data and judge data outliers based on the relationships mined, attracting a large number of scholars to conduct in-depth research. At present, in addition to the longest common substate distance and edit distance, similarity measurement methods mainly include Euclidean distance [5], dynamic time warping distance, and singular value decomposition and point distribution-based methods. The commonly used methods for data correction include the interpolation model [6], random replacement model, mean replacement model, and regression model [7]. When it comes to the identification of abnormal data, there may be a problem that the quoted “correct” sequence does not exist.

Data integration is essentially the collaborative processing of data from multiple parties to achieve the purpose of reducing redundancy, comprehensive complementation, and capturing collaborative information. This technology has become a research hotspot in the fields of data processing, target recognition, situation assessment, and intelligent decision making. In [8], Yu et al. studied multisensor data integration technology based on statistics and artificial intelligence (AI) methods; in [9], Lai et al. studied the organization and management of multisource heterogeneous data in mobile geographic information systems and established a multisource heterogeneous data fusion model; in [10], Premkumar and Ganesh combined wireless sensor network and data fusion technology and proposed a Kalman filter batch estimation fusion algorithm; in [11], Lasheng and Yiquang studied a massive multisource heterogeneous data fusion method in the Internet of Things environment and successfully applied it in the process of target positioning and tracking; in [12], Zhang et al. studied the intelligent maintenance decision-making architecture of the high-speed rail signal system based on heterogeneous data fusion, which improved the accuracy and effectiveness of decision making; in [13], Wen et al. studied many aspects of the digital mine construction process. Source heterogeneous data fusion technology ensures the safety, stability, and efficiency of the basic information platform in the construction of digital mines.

In view of this, in order to improve the quality of foreign language education resource data and support the large-scale collection and storage of data, this article focuses on the data cleaning and fusion in the construction of foreign language heterogeneous resource integration system and conducts a preliminary analysis of the characteristics of foreign language education multisource data. And a practical value-based foreign language education resource data cleaning and fusion algorithm based on time-series similarity measurement is proposed, and experiments have shown that it can achieve a better cleaning and fusion effect.

The rest paper is structured as follows. In Section 2, the integration of foreign language education resources is studied. In Section 3, the time series analysis-based integration mechanism of heterogeneous foreign language education resources is studied. The experimental results are presented in Section 4, and Section 5 concludes this paper.

2. The Integration of Foreign Language Education Resources

At present, the data of foreign language education resources present the characteristics of “two” (diversified data sources and data types) and “two” (high heterogeneity dimension and high overall value of data), and its greatest value lies in the realization of cross system and cross platform data exchange and sharing. The integrated application of big data in total foreign language education aims to break the “data island,” establish the data governance system of foreign language education resources, and form the total data assets of foreign language education in the smart city ecosystem. The outstanding problems existing in the data integration of foreign language education resources are as follows: no unified data standard, data source is not clear, out of sync data exchange, data storage and scattered in disorder; this series of factors has resulted in unfavorable situations of low data quality, chaotic data flow, insufficient data sharing, and unsmooth data lifecycle management, which greatly restricts the height that foreign language education big data-assisted smart application terminals can achieve [14, 15].

The integration and application of foreign language resource education data focus on three aspects of “management + governance + application,” and the key problems are mainly reflected in the following three aspects:(1)Business data cannot effectively follow a unified data standard. Data standards regulate the consistency and accuracy standards of data used and exchanged within and outside regions at all levels, restrict the normative documents of data standards, and carry out data standardization control and data standard management organization to provide data for foreign language education resources at all levels. The platform provides a unified data definition standard and logical model. However, due to the different construction ages and different structural levels of various platforms, they were not defined in accordance with a unified data standard at the initial stage of construction, which brought a lot of inconvenience to the exchange and sharing of foreign language education resource data. In response to such problems, we should first start with the top-level design of informatization and intelligent information services, formulate unified data standards, establish a scientific and standardized data application assessment and evaluation mechanism, and carry out transformations in stages and steps; source data carries out all aspects of data cleaning.(2)The data source of foreign language education resources is not unique, and the data flow is unreasonable. The producer of the data must determine the focal point. The focal point is the uniqueness of the data source. The content of the data cannot be maintained by multiple systems at the same time; otherwise, the uniqueness and accuracy of the data source cannot be guaranteed. The flow of data is aimed at achieving exchange and sharing, and the public data platform (public data pool) completes the cross-business data interaction [16]. This type of problem requires the establishment of relevant organizational structures through administrative management methods, determining the authority for data generation, clarifying the data responsible unit, constructing a data flow relationship table, and providing a complete data flow for the data source connected to the system and the data interface released by the system. The unified combing of foreign language education resource data application requirements is managed and completed.(3)The quality of business data is not high, and there are certain phenomena of “lack of data” and “wrong data.” Data quality describes the applicability of the data, that is, the suitability of the data to meet the needs of users. Data quality measures data through multiple dimensions such as completeness, consistency, accuracy, timeliness, and legitimacy. In the business platform, data quality provides clean and structured data for it. It is a necessary prerequisite for the data platform to develop data products, provide data services, and play the value of big data. It is also a key factor in the management of foreign language education data assets at all levels and regions. Currently, data quality is generally not high in all levels and regions. On the one hand, it is necessary to improve data quality through in-depth data governance (analysis, correlation, cleaning, and exchange of multisource heterogeneous data), and on the other hand, it is necessary to establish data quality and improve the process and assessment system.

3. Integration Mechanism of Heterogeneous Foreign Language Education Resources Based on Time Series Analysis

In view of the lack of data and wrong data in the foreign language education resources described above, this paper proposes a data cleaning method based on time-series similarity measurement to detect abnormal data and fill in missing data in foreign language materials.

The data cleaning and fusion process is mainly divided into four steps: first, the approximate symbol aggregation algorithm is used to discretize and symbolize the foreign language resource data; second, the Euclidean distance algorithm is used to calculate the similarity between the symbol sequences; then, it is fitted according to the similar sequence. The curve of foreign language data completes the identification and correction of abnormal data and the filling of missing data; finally, the cleaned data are fused.

3.1. Approximate Symbol Aggregation Algorithm

In recent years, the symbolic aggregation approximation (SAX) algorithm is a new method of discretizing time series data. The basic idea of this method is to convert numerical time series data into discrete symbol sequences [17]. Through the specified mapping rules, the SAX algorithm can weaken the influence of abnormal and missing data in the time series on the local fluctuations and can also generate a smaller-sized symbol and nondigital sequences, which can improve further aggregation efficiency and strengthen the comparison of similarity in the later stage.

SAX is a sequence of equal-length partition based on the piecewise aggregate approximation (PAA). If the partition length is long, there may be a large internal difference, and the mean value is equal. The key point improvement method can be adopted to achieve the purpose. But the algorithm complexity is improved. In this paper, SAX is used to reduce a time series of arbitrary length to a string of length (), usually with English sentences of no less than 26 letters.

SAX first converts the data to PAA representation, reduces the time sequence from dimension to dimension, then maps all PAA coefficients to equal probability intervals, and the last SAX symbolizes the PAA representation into a discrete string. The following is a brief execution process of SAX on the original time series .(1)Performance of normalization processing: normalization is to convert the average value of each time series to 0 and the standard deviation to 1, which is expressed as . The element iswhere is the average value of the original time series; is the standard deviation.(2)The dimensionality of normalized sequence is reduced by PAA to reduce the original time series vector of dimension to dimension. In the process of dimensionality reduction, the dimensional time series . The element in is calculated aswhere is the mean value of the original time series vector divided into segments; is called the compression rate; is the interval length of each segment.

After converting the time series set into PAA, it is further converted into a discrete symbol form; that is, elements in the PAA representation form of the time series are mapped to equal probability symbols. Since the normalized time series has a highly characteristic Gaussian distribution, the “break point” is determined by looking up the Gaussian distribution statistics table, thereby generating equal sizes, that is, regions with the same probability distribution, where is a series of the ordered list of values, and all areas are .

After querying and comparing the breakpoint , the time series collection is transformed into the string collection , namely,where is the alphabet; the element of the dimensional time series is between and ; the element of the alphabet can be expressed as the element in the string .

3.2. Similarity Measurement

Euclidean distance is one of the most widely used algorithms in similarity measurement. In the application process, the sequence to be compared is required to have the corresponding length and point, and the difference between the two sequences corresponds to each other [18]. The Euclidean distance can quickly calculate the similarity of SAX symbolic expressions with low computational complexity. The greater the distance between the SAX expressions of two foreign language education data sequences, the lower the similarity. Therefore, the similarity of the two foreign language education data time series curves iswhere and are the two time series, respectively; is the point of the sequence; is the point of the sequence.

3.3. Similarity Curve Adjustment

After approximate aggregation of symbols and similarity measurement of the time series, similar time series and similar time series SXA to be cleaned up are obtained, where is the original series and is usually 30 time series in a month. The similar time series is adjusted by the weighted adjustment method (fitted curve algorithm) to obtain the reference curve relative to the original time series . If there is a missing value in the original time series, it is filled with the value of the corresponding point in the reference curve.

Judge whether a word in the data is abnormal by comparing within the reference curve, which is calculated by weighting similar letters in all foreign language education resources. This article uses an improved maximum threshold method to determine whether a foreign language word is abnormal data, and this method uses a more accurate weighted average of similar time series to calculate, and calculates of the threshold , that is,

If does not meet the following criteria, it is considered abnormal data.

3.4. Integration of Foreign Language Heterogeneous Educational Resource Data

A foreign language multisource heterogeneous data integration focuses on computing structured and comparable foreign language heterogeneous data, with the goal of improving data quality and obtaining more significant data characteristics. Kalman filter is an efficient recursive filter, which uses a series of data obtained from the data measurement process to estimate the state vector of the dynamic system, which is more effective for heterogeneous structured data [19]. In this paper, the concept of information pair is introduced into the Kalman filter algorithm, that is, the distributed Kalman algorithm. The cleaned data are exchanged and merged with the adjacent data sequence. The information matrices X1 and X2 used are respectivelywhere is the posterior estimated covariance matrix at time and is the estimated state value at time . The recursive form of the distributed Kalman filter iswhere and are the covariance matrices of system noise and observation noise, respectively. In order to improve the accuracy of local fusion, the total dataset is , and each series can send its local posterior covariance to the adjacent series and perform data fusion with local posterior covariance matrix of , and the fusion calculation is, respectively,where is the combined weight and positive value, satisfying any node, can be obtained from the following formula:

The multisource heterogeneous data fusion algorithm flow is as follows:(1)Initialize data, and time series are data space(2)Observe the status of the integrated data and update the information pair of the time series X1 through equation (8)(3)The information pair of the time series is transmitted to the adjacent time series . If the time series data are completed and safe, the time series receives the information pair sent from the time series (4)Through formulas (9) and (10), the local data pair and the information pair of adjacent data are used for data fusion to obtain the fused information pair(5)Update the local filter value(6)Return to Step (2)

4. Experimental Results and Analysis

Section 3 describes the SAX method, similarity measure method, and data cleaning and data fusion method used in this experiment, respectively. In this paper, foreign language resources at all levels in a city were selected as the experimental dataset, and data cleaning and integration experiments were carried out on the experimental dataset. The experimental data consisted of 16 series of three data types mentioned above, with a collection interval of 1 min and a sampling frequency of 10 kHz for multisource data.

In order to prove the effectiveness of the multisource heterogeneous data fusion algorithm proposed in this paper, in addition to the traditional Kalman filtering algorithm [20], this paper also compares the case of not adding a filtering algorithm and draws corresponding conclusions. Table 1 shows the ratio of effective data obtained after data fusion using three data fusion methods. The three methods are the distributed Kalman filter algorithm without data fusion, the Kalman filter algorithm, and the edge calculation. The Kalman algorithm of each fused iteration of the dataset and the neighbor nodes are smaller, and the effect is better. With the rapid increase in the size of the dataset, the integration of data efficiency declines rapidly. This is because there are many series, large resource consumption, larger dataset, low data fusion ratio, and high data redundancy. The multisource heterogeneous data fusion algorithm can effectively reduce redundant data information so as to get closer to the actual effective data value. Compared with the unfused data method, it reduces a lot of resource consumption and can align different data more effectively and resolve feature conflicts between data.

Select the data sequence with no more than 1% gap as the experimental set, and set the data to be empty according to a random ratio, then use the method in this article to clean the data, compare the gap prediction value with the original value, and calculate the average absolute percentage error and mean square root error and standard root mean square error. Mean absolute percentage error (MAPE) is the average absolute value of the relative error ratio, which can reflect the credibility of the measurement data, namely,

In addition to the traditional Gaussian filter algorithm [21], this paper also compares the data cleaning of wavelet threshold value, as shown in Table 2. The average absolute percentage error deviation of the algorithm proposed in this paper is 6.25%, which shows that the prediction results in this paper are relatively accurate, and the root mean square error is within the numerical range of the standard point, which meets the requirements.

At the same time, the error analysis of the integrated foreign language education data indicators is carried out, and the accuracy of the algorithm in this paper is evaluated according to the RMSE, MAPE, and t test [22]. Table 3 reflects the RMSE, MAPE, and t-test results of the two algorithms from a quantitative perspective. From the analysis of the data results, the error of the algorithm in this paper tends to be flat or reduced, which proves that the algorithm in this paper has good performance and can meet the accuracy requirements.

5. Conclusions

In the era of AI education, building a reusable and sharable educational data model is one of the urgent problems to be solved in the development of foreign language education. The model needs to standardize the multisource and heterogeneous foreign language education data in the AI education environment, so as to achieve a high degree of sharing of heterogeneous foreign language data. In this paper, an IIoT-oriented environment is considered, and a method based on time-series similarity measurement is proposed to clean and integrate heterogeneous data of foreign language education resources. SAX by piecewise reduce the dimension of time series data, so as to achieve the aim of reducing noise and is calculated using Euclidean distance similarity measurement method, and in smaller time complexity to find a similar set of time series, using the maximum threshold method to detect outliers, the reference curve is obtained by the weighted adjustment method, according to the reference curve filling vacant values. It is more accurate than the traditional method using the maximum threshold of the average value of similar days. The data after cleaning adopt the multisource heterogeneous data fusion algorithm. Through data fusion between adjacent series, redundant data can be better fused to ensure data quality and provide more practical value for the high sharing of heterogeneous foreign language education resources.

This research still has a lot of work that needs to be further explored, such as the way to increase the integration of different types of foreign language education resources and to integrate different types of foreign language education resources such as online teaching, online vocational training, online examination, and synchronized online education guidance. The same platform is the future research and development direction of the foreign language education resource integration platform. I believe that foreign language education can be improved through different angles and ways to improve its fairness, universality, and sharing. This is a beautiful goal that we all have been working on together.

Data Availability

All data used to support the findings of the study are included within the article.

Conflicts of Interest

The author declares no conflicts of interest.

Acknowledgments

This paper was supported by the 13th Five-Year Social Science Research Project of the Education Department of Jilin Province (no. JJKH20200438SK) and the Social Science Fund Project of Jilin Province (no. 2020C113).