Abstract
Realtime data stream mining algorithms are largely based on binary datasets and do not handle continuous quantitative data streams, especially in medical data mining field. Therefore, this paper proposes a new weighted sliding window fuzzy frequent pattern mining algorithm based on interval type2 fuzzy set theory over data stream (WSWFFPT2) with a single scan based on the artificial datasets of medical data stream. The weighted fuzzy frequent pattern tree based on type2 fuzzy set theory (WFFPT2tree) and fuzzylist sorted structure (FLSS) is designed to mine the fuzzy frequent patterns (FFPs) over the medical data stream. The experiments show that the proposed WSWFFPT2 algorithm is optimal for mining the quantitative data stream and not limited to the fragile databases; the performance is reliable and stable under the condition of the weighted sliding window. Moreover, the proposed algorithm has high performance in mining the FFPs compared with the existing algorithms under the condition of recall and precision rates.
1. Introduction
Frequent itemset mining (FIM) is the primary processing for the association rule mining (ARM, Apriori 1993) algorithm, which mines associations in the form of rules, such as “IF eat more than regular THEN have obesity,” for a dataset. The ARM algorithm has an irreplaceable advantage in realworld applications, especially in medical diagnosis and health prediction. ARM algorithm can mine potential diseaserelated rules. Also, traceability for disease relevant results, as ARM algorithm can easily find hidden associations between diseases and their related features in a dataset. For this reason, this paper will focus on mining frequent patterns in studying of computeraided medical diagnosis and health prediction. In a database, the number of times an item appears refers as the item’s support and the item is a frequent item (FI) if the items’ support exceeds a defined threshold. Apriori [1], Eclat [2, 3], and FPGrowth [4, 5] algorithms are the three most classical algorithms in FIM.
A significant number of improved algorithms, such as AFOPT [6–8], FPGrowth [9–11], CPFGrowth [11], and BFPGrowth [12], emerged after the classic FPGrowth algorithm [4, 5] was proposed. The key step of the association rule mining (ARM) algorithm is the FPM, which can process and analyze the association relationship corresponding to Boolean variables, sequences, or transactions, but cannot directly process the quantitative values represented by floating point numbers [13, 14].
In order to solve the problem, researchers began studying the discretization of the attribute features of continuous data into category attributes to discover the implicit association rules. Lee (1997) introduced fuzzification into the ARM algorithm [15] that the user would influence the process of discovery and obtain the details of rules discovered. The aim of fuzzy transformation is to convert the quantitative data into fuzzy data.
The ARM has been improved to process the continuous data by implementing new rules or integrating several algorithms, such as incorporating a new “interest” metric [16] and inserting additional rule templates [17], implementing the bottomup grouping method [18, 19], and introducing genetic algorithm [19, 20]. Moreover, in order to validate the ARM [21–23], the threshold condition could be continuously practiced based on minimum support and confidence.
The approach used above is to directly transform the continuous data into a crisp category. Sharp edges will emerge [24] if the transformed numerical attributes are located near the interval segmentation points, causing the issue of matching the segmentation points and the precision of the algorithm.
The research gap is how to minimize the issue of sharp boundaries; researchers started to implement the fuzzy association rule mining (FARM) algorithm in their research. These methods [25–28] apply the fuzzy theory to expand the clear interval to membership value interval for the mining the fuzzy association rules dealing with quantitative data. The fuzzy sets [29, 30] that correspond to the membership preserve the interval of the numerical attributes in the fuzzification process [31], which increases the amount of information on the transformed numerical data and decreases the sharp boundaries issue.
Hong et al. [32] introduced two different support thresholds instead of conventional thresholds to divide the candidate frequent itemsets into parts of different meanings, thus minimizing the workload of rescanning the initial dataset. The process of fuzzification is to extend the space to the fuzzy space. The main reason for the imbalance of fuzzy sets is that the amount of data in the central fuzzy set obtained by the fuzzy transformation increases, thereby increasing the additional fuzzy membership. Hong et al. [33] proposed a FARM based on the type1 fuzzy set theory. The type2 fuzzy set theory [34] was introduced to solve the problem of FARM in the regular database [35]. The methods of literature [32–35] inspired the motivation to solve the fuzzy association rule mining over medical quantitative data stream based on type2 fuzzy set theory.
To the best of the authors’ knowledge, there is no relevant research focused on the weighted sliding window to mine the FFIs over quantitative data stream based on type2 fuzzy set theory to bridge the understanding between computers and human beings without the expert suggestions. This paper improves the model and proposes a new weighted sliding window quantitative data stream frequent pattern mining algorithm based on type2 fuzzy set theory.
In this paper, the WSWFFPT2 algorithm is proposed to efficiently mine the FFIs with one scan over the medical data streams. For the dynamic FFPs mining process, a fuzzylist sorted structure and a WFFPT2tree are suggested. The main contributions of this article are as follows: (1)In order to distinguish the weights of historical and recent transactions, the weighted sliding window with decay factor is adopted that simultaneously overcomes the problem of unexpected concept drift in the number of items in data streams. Recall and precision rates are applied to control the influence of the decay factor on the frequent patterns and the critical frequent patterns of the sliding window(2)A novel WSWFFPT2 algorithm over medical data streams is proposed to efficiently mine the FFPs with only one scan using a fuzzylist sorted structure (FLSS) and WFFPT2tree. The construction and the deletion strategies are used in the form of a linked list to construct and maintain the WFFPT2tree. Various examples and figures are provided to better understand the interval type2 fuzzy sets theory mining process over the quantitative data stream(3)The proposed WSWFFPT2 algorithm is compared with the previous related algorithms to evaluate its performance. Several synthetic data streams applying decay factors with sliding window are used in the performance evaluation. The experimental results show that the proposed algorithm presents more outstanding performance than the previous ones
The rest of this paper is organized as follows. Related works and preliminaries are introduced in Sections 2 and 3, respectively. The proposed algorithm is presented in Section 4. An experimental study is illustrated in Section 5. Finally, Section 6 concludes this paper.
2. Related Works and Research Problems
This section briefly reviews the studies concerning sliding window [36] with fading factor, type2 fuzzy sets theory, and the PFtree algorithm from the quantitative data streams.
2.1. Sliding Window with Fading Factor
Under some conditions, especially in the data stream, new transactions are much more important than the historical transactions when mining the data streams. In order to distinguish the new and the old transactions and consider the importance of the latest data, the literature MSW [37] and SWPtree [38] set different weights for the old and the new transactions. Obtaining a more reasonable frequent itemset will produce a large number of redundant patterns. The classical algorithms based on the window model include lossy count [39] and DSM_FT [40]. The most common window model is the sliding window model. For example, the algorithms MSW [38], MFICBSW [41], MFITimeSW [42], and TMFI [43] use the sliding window models to mine the frequent patterns in the data streams. The classical algorithms based on the attenuation model are estDec [44] and estDec+ [45]. According to the characteristics of the pattern, it can be divided into frequent patterns, frequent sequence patterns, frequent tree patterns, and frequent graph pattern mining methods.
2.2. Interval Type2 Fuzzy Set Theory
Based on the pattern growth mechanism, Lin et al. [46] used the fuzzy frequent pattern tree method to discover the set of FFIs. When the transaction volume is large, this method may require a significant amount of calculation. Therefore, the compressed fuzzy frequent pattern (CFFP) tree algorithm [47] has been used to reduce the size of the frequent pattern tree. Compared with the FFPtree algorithm, the CFFP algorithm significantly reduces the number of tree nodes in the CFFP tree. However, it is necessary to reserve an additional array for each node to store the current processing node and the membership value. Therefore, the CFFP algorithm requires a large of memory to store this information, which is inefficient in a very sparse database. In order to solve this problem, the UBFFPTtree [48] algorithm not only retains the simplified tree structure but also mines the fuzzy frequent itemsets in limited memory, which can effectively mine the FFIs with the same tree node size as the CFFPtree algorithm. The above algorithm only uses one language word to represent the item being processed in the database, so the information found may be incomplete and insufficient. The above method uses type1 fuzzy set theory without considering the uncertainty factor.
This concept has also been widely used in intelligent information systems. In data mining, when dealing with a transaction database with sales quantity information, the idea of fuzzy sets can be used to transform the number of items into language. It uses the semantic terms to express the concepts, which is more in line with human thinking. Fuzzy data mining algorithms have been proposed aiming at language association rules [49] and other issues.
To solve this limitation, the researchers designed and implemented a type2 fuzzy set theory that considers the uncertainty problem. In 1965, Zadeh [50] proposed type1 fuzzy set (T1FS theory) and then the type1 fuzzy logic system (T1FLS) based on T1FS to solve the problem of uncertainty, fuzzy, and inaccurate modeling method. Since the T1FLS is represented by an accurate membership function, its ability to directly deal with the uncertainty of fuzzy rules is limited. Karnik and Mendel [51] proposed a type2 fuzzy set (type2 FS, T2FS) that blurred the membership degree in the fuzzy set, that is, the membership degree itself was T1FS, and T2FS was enhanced. Chen et al. [52] proposed a prioribased standard method to discover FFI with type2 fuzzy sets. The algorithm used a generation test mechanism based on fuzzy twocategory membership functions to find the candidate objects hierarchically.
Interval type2 fuzzy set [53] is an extension of traditional type1 fuzzy set. Its essence is to defuzzify the membership value in the fuzzy set. The interval type2 fuzzy set theory can directly deal with the uncertainty of fuzzy rules. Using the membership function of the second type of fuzzy set theory to mine the FFPs requires the use of a central mechanism to convert the fuzzy value (interval) of language terms into fuzzy values, which requires more time complexity to generate the candidate identification codes. The proposed algorithm adopts type2 fuzzy set theory.
Type2 FSs are described by the MFs that are characterized by more parameters than the MFs for type1 FSs. Hence, the type2 FSs provide more design degrees of freedom [54]. Therefore, the use of type2 fuzzy sets has the potential to outperform the use of type1 fuzzy sets, especially, in uncertain environments.
3. Preliminaries
At present, most FFPs algorithms are based on the framework and algorithm evolution of Apriori [1] and FPGrowth [4]. However, these methods cannot handle the data stream immediately. In the process of the data stream mining, some projects rarely appear as time goes by, but as new transactions continue to increase, they may become more frequent in the future. According to the dynamic changes of the data, the importance of the data itself also changes. Therefore, different data processing models should be adopted according to the changing law of the data. When the importance of the data gradually decreases, the attenuation model should be used for data processing. The important factor in the attenuation model is the attenuation rate factor.
Definition 1 (quantitative data stream (QDS)). Given is the set of items, the QDS is a sequence that arrives continuously at a certain speed represented by quantitative values, where represents the th transaction , for any given has .
Definition 2 (weighted sliding window [36]). The start and end of the sliding window are determined by the batch and the current time, and each has its decay factor during the data stream mining process. The window contains 3 batches, and each batch contains two time series. As shown in Figure 1, the first sliding window contains three batches like . The batch contains two and . The number of batches in the algorithm implemented in this paper can be adjusted according to the actual situation of the actual data stream, meaning it is a variable dynamic weighted sliding window.
Definition 3 (decay factor [55, 56]). The decay factor refers to the attenuation degree of data over time. In the sliding window model where data elapses over time, the knowledge contained in the patterns in the latest window is more important or valuable than the patterns in the historical window. For example, the reference value of the current monitored medical data is larger than in the past, and the monitoring data has a more important reference value for medical diagnosis and prediction.
Definition 4 (frequent pattern). is the number of current transaction items in the data stream. Let be the minimum support threshold. If the pattern satisfies Equation (1), is a frequent pattern.
Definition 5 (error factor). The error factor is the maximum allowable error threshold . If the pattern satisfies Equation (2), then is a critical frequent pattern. If , then is an infrequent pattern.
Definition 6 (lingual variable [35]). There is the following relationship ; the value of the lingual variable is determined according to the membership value given by the user.
Example: the membership relationship in Figure 2 can be transformed into the interval represented by through the type2 fuzzy process. For example, Figure 2 shows 10 sequences through which the data stream passes, and the corresponding data items are (), (), (). The value of is represented by , so the items in : (), (), and () are transformed as , , and , respectively.
4. Proposed WSWFFPT2 Algorithm
The proposed WSWFFPT2 algorithm consists of four phases for mining the FFPs. The first phase is the data stream input whose value is the object’s quantitative data, not the crisp database. The second phase consists of two steps, the weighted sliding window and the fuzzypreprocessing, based on the interval type2 fuzzy theory including the original fuzzification and fuzzy compression process. From quantitative data to linguistic value, the membership attribute was transformed and further details would be presented in the second phrase. In order to avoid the problem of concept drift [57], the decay and the error factors are used during the weighted sliding window over the data stream process to generate the frequent patterns and the critical frequent patterns. The third phase is the constructdelete process of WFFPT2tree with fuzzylist sorted structure. The WFFPT2treebased algorithm for generating FFPs is the last step. Figure 3 presents the structure of the WSWFFPT2 algorithm.
4.1. Preprocessing Fuzzification
4.1.1. Prefuzzification Based on Interval Type2 Fuzzy Theory
For the designed WSWFFPT2 algorithm to mine the FFPs on a data stream, the quantitative attribute should be first fuzzified by the defined membership functions based on the interval type2 fuzzy theory. The step is called the preprocessing of the fuzzification because the fuzzified data should be compacted after the first step. In the preprocessing of the fuzzification, the quantitative attribute is transformed into a linguistic value (low, middle, and high) as shown in Figure 4. Equations (3)–(5) are the transformed functions that have three groups like , , and , and the value of distance “” can be defined by the user according to the actual demand. For example, the original quantitative data shown in Figure 2 is then transformed into membership attribute in Figure 4, and the fuzzified results are listed in Figure 5.
The membership function construction methods of two interval fuzzy sets are the individual member function method and the interval endpoint method. Figure 4 shows the membership of type2 fuzzy set theory. The abscissa is the size of different transactions, and the ordinate is the degree of membership. The language terms low (), medium (), and high () are used to denote the division of different intervals. At the same time, the users can formulate expressions of language terms according to the actual application requirements.
After the prefuzzification of interval type2 fuzzy set theory, the fuzzified data structure is presented as , where the Iname represents the name of the item and is the membership value calculated from Equations (3)–(5). In order to unify the abbreviation, after the initial fuzzy conversion, the linguistic attributes of (fuzzy type) like low, mid, and high are represented as , , and , respectively.
Definition 7. The linguistic term set is represented as , and the corresponding membership degree or called fuzzy value could have the membership degree relationship of type2 fuzzy theory which is defined as in the following:
For instance, the item () in Tid_{1} has a corresponding value of 4, which can be converted as follows:
Therefore, the items (), (), and () in are transformed from quantitative data to linguistic terms (, , , , , and ), and the conversion results of all itemsets in Figure 2 can be indicated in Figure 5 based on the defined split point number.
4.1.2. The Fuzzy Compression Process
To reduce the complexity of the preprocessing step, the interval value is represented by the fuzzy compression process as defined in Definition 8. For example, in the transaction shown in Figure 5, the item () with its quantity 4 is then transformed as . Then, the interval is transformed as using the typereduction method with the defined split point number according to Equation (9). After that, both and are reduced as in Figure 6.
Definition 8. The membership degree of a linguistic term in the fuzzified dataset is denoted as [35] and defined as follows:
4.2. Weighted Sliding Window Model
In this paper, the sliding window model based on the average decay factor (ADF) is applied to construct the WFFPT2tree of the type2 fuzzy set [51] to mine the fuzzy frequent patterns (FFPs). The ADF was proposed by Han and Ding [55] for mining the FPs on the transaction data stream, but not for the quantitative data stream. The combination of the sliding window and the ADF not only considers the latest data in a limited time window but also fully considers the influence of historical data. The mining process of the FFPs on the data streams can effectively improve the integrity of the regular patterns. Moreover, the sliding window uses an improved decay factor to assign different weights to the new and the old data in different periods. When the data stream continues to arrive, the itemset changes steadily, avoiding the problem of data concepts drifting in the sliding window with the error factor.
Based on the continuity of the data stream, the knowledge contained therein will change over time. Usually, the importance or the value of the recent transactions is greater. Therefore, a decay factor is used in this paper to weight the transactions. According to the definition of the decay factor , it can be seen that the smaller the value, the less important the corresponding data, and the faster the attenuation. Conversely, the larger the value, the greater the importance and the slower the decay.
In the time decay model [58, 59], the data of the sliding window has the same decay factor . In the sliding window , the support value of the item can be represented as follows:
Literature [55] used a time decay model to mine the frequent patterns of the data stream closure through experiments. It is found through experiments that the mode support number after using the decay factor in the time attenuation model is much smaller than the support number without attenuation. If it is performed according to the conventional minimum support, the mining will lose possible frequent patterns. In order to solve this problem, the maximum allowable error factor is . In other words, both the FPs and the critical FPs (Definitions 4 and 5) will be saved during the mining process.
The decay factor of the sliding window is used to divide the value based on the fuzzification process including the three batches , , and . Each batch contains two itemsets in chronological order in the data stream, and the size of the sliding window can be changed.
After determining the minimum support and the maximum allowable error thresholds according to the literature [38], the assumed 100% recall and 100% accuracy rates are used to estimate the decay factor . That is, when the recall rate is 100%, Equation (14) is satisfied, which is called the lower limit . While, when the accuracy rate is 100%, Equation (15) is satisfied as the upper limit .
Since both recall and accuracy cannot be considered at the same time, the choice of decay factor needs to balance the two. Therefore, this article uses the average value of the upper and the lower limits as used in the literature [59].
Definition 9. The support value of a linguistic term is denoted as , which is the scalar cardinality of each fuzzy frequent item [35] and is defined as follows:
According to Definitions 4 and 5, when the minimum support , the average decay factor , and the error factor are known, the frequent pattern and the critical frequent patterns weighted by the sliding window can be obtained according to Equations (11) and (13), respectively. Then, the patterns are sorted in descending order according to the accumulated count value of the frequent patterns and the critical frequent patterns as shown in Figure 7. Despite the prefuzzification step performed first, the weighted sliding window method will significantly change the count value of the item after multiple iterations due to the variability and the continuous characteristics of the data stream. The FFPs and the critical FFPs are listed in Figure 7. After the weighted sliding window process, there are critical FFPs that reflect the increment part based on the minimum support value designed by the user, which can reduce the concept drift problem.
4.3. The ConstructionRenew Process of WFFPT2Tree
This paper uses fuzzylist sorted structure and WFFPT2tree to generate fuzzy frequent patterns. The fuzzylist sorted structure is used to store the attributes of the fuzzy frequent items after the prefuzzification. The attributes of FFIs include the name of the fuzzy frequent item, the count, and the link node. The construction process of WFFPT2tree consists of adding transactions and deleting the transactions based on the fuzzylist sorted structure. The proposed WFFPT2tree is improved from the FPtree [5]. The WFFPT2tree with fuzzylist sorted structure will reduce the memory usage and speed up the mining process.
In the WFFPT2tree structure, each node represents a fuzzified semantic tree transaction, which includes the weighted fuzzified semantic representation of itemsets and the corresponding membership and count values. Since the list is sorted in descending order first, the cache is stored in the form of a queue. Figure 8 shows the construction process of the WFFPT2tree according to Figure 7.
The idea of generating fuzzy pattern trees in descending order adopted in this paper is derived from the tail expansion strategy. This paper adopts the horizontal link to quickly find the corresponding node on the node of the prefix pattern tree. The horizontal link will link the nodes with the same fuzzy linguistic in the descending prefix pattern tree, but the count value will be different, which is convenient for searching and inserting new subtrees or single nodes during the sliding window process.
4.3.1. Inserting Phase of the New Tree Transaction
The construction of the WFFPT2tree is based on the fuzzylist sorted structure. When the new tree transaction generates, it will be added in the tree with the link node and sorted in the fuzzylist structure. The specific implementation process is as follows. The fuzzylist sorted structure table contains the frequent and the critical frequent patterns after the weighted sliding window and sorted by the count value as shown in Figure 9.
When the WFFPT2tree is empty, the first fuzzified transaction through the weighted sliding window contains three fuzzy frequent items , which are inserted into the root header in descending order. When the comes, the descending subtree has the same patterns and then the count increases as shown in Figure 8. When enters, there is no identical node in the current prefix pattern tree, so a new subtree is established. Figure 8 shows the detailed process of adding new branches to the fuzzy frequent pattern tree.
A similar process can be obtained when , , and are added to WFFPT2tree in Figure 9. Figure 9 shows the sorted list structure of all items in the sliding window , including the fuzzy semantic names, the decay factor, and the link node.
4.3.2. Deleting Phase of the Old Tree Transaction
This operation occurs during the sliding phase of the weighted sliding window over the data stream. For example, after the establishment of tree transactions of shown in Figure 9, the fuzzy frequent transactions enter from to stage, which should be adding to the WFFPT2tree shown in blue in Figure 10 and the old tree transactions should be deleted shown in red in Figure 10. First, pop a piece of data from the front of the fuzzylist sorted structure table, and use the pointer to point to the end of the tree transaction in the WFFPT2tree to find the transaction which needs to be deleted in the tree. Then, from the tail of the WFFPT2tree to the root, update the count value of the node along the path. It can be seen from the path that the red part whose weighted count value does not meet the minimum support needs to be deleted from the tree. Usually, their horizontal links are also deleted at the same time.
Figure 10 shows the new node in the sliding window in blue and the results of the deletion. Figure 11 shows the current WFFPT2tree of the sliding window after deleting the node. Figures 12 and 13 show the processes of adding and deleting after the sliding window has passed, respectively. Figures 8–13 show the example of WSWFFPT2 algorithm and illustrate the entire process of the sliding window based on the fuzzylist sorted structure in the proposed algorithm, including the processes of construction and adding and deleting the fuzzy node.
4.3.3. FFPs Mining Phase
The mining process of frequent patterns is performed by querying the mining results of the current window. In the WFFPT2tree, all frequent patterns can be mined through recursion to facilitate the finding of the entire WFFPT2tree, and the patterns with support greater than the threshold are frequent patterns. Although the WSWFFPT2 algorithm proposed in this paper mines frequent patterns with a fixed window size, the processes of inserting the new nodes and deleting the nodes in the algorithm are independent of each other. Thus, increasing and reducing the number of sliding windows will not affect the correctness of the algorithm. This part is also tested and verified in subsequent experiments. The algorithm can easily be applied to the variablelength sliding window model.
4.3.4. The Pseudocode of the Proposed WFFPT2Tree Algorithm

5. Experimental Study
5.1. Experimental Environment and Data Stream
This section presents the experimental results and the performance analysis of the WSWFFPT2 algorithm. All the experiments are performed on Windows 10/64bit system, which is configured as Intel Premium, and all algorithms are implemented in Java (IntelliJ IDEA 2019.3.3 x64) [60]. The data stream studied in this article is the real artificial data stream of the breast.w.arff (Wisconsin Prognostic Breast Cancer Database) to support the findings of this study.
The dataset contains 10 attributes: Clump_Thickness numeric, Cell_Size_Uniformity numeric, Cell_Shape_Uniformity numeric, Marginal_Adhesion numeric, Single_Epi_Cell_Size numeric, Single_Epi_Cell_Size numeric, Bare_Nuclei numeric, Bland_Chromatin numeric, Normal_Nucleoli numeric, Mitoses numeric, and Class {benign,malignant}. The benign and malignant give the results of the breast cancer. The data of the attributes are integer [1, 10].
5.2. Performance Analysis
Figures 14 and 15 show the time and the memory consumption of the proposed algorithm on the breast cancer database, which is generated as a data stream. The number of the transactions is 64000. During the memory consumption and the running time, the window size gradually expands from 4 to 250. It can be seen from Figure 14 that as the window increases during the operation of the algorithm, the time consumption changes within a certain interval, but the overall trend is stable. The reason is to maintain a record of all tree transactions in the sliding window. The algorithm needs to add the newly added sliding window list and delete the list of nodes that are not in the window of the tree transaction. Since the length of the transaction itemset in the data stream is uncertain, it will bring time in the obfuscation process. Uncertainty is caused by the characteristics of the data in the data stream. At the same time, the first step in the calculation of the algorithm is to perform type2 fuzzification on the digital data stream. This will also increase the number of nodes that need to be processed and increase the time overhead of the algorithm. However, the overall running time is stable and concentrated within an acceptable range.
Figure 15 shows that the memory increases with the continuous increase of the window size, which indicates that the tree transaction and the data in the list structure to be processed are increased with the increase in the sliding window. After the transactions in the sliding window are fuzzified, the fuzzification results increase the number of original transactions in a linear relationship of times. Therefore, it can be seen that the window and memory consumption increase in a linear manner. The problem of excessive memory consumption can be avoided by reasonably selecting the appropriate sliding window for the data stream. The authors will continue to study how to dynamically adjust the sliding window for the data stream.
The above experimental analysis shows that the sliding window continues to increase, and the memory consumption of the algorithm increases linearly. Further experiments are conducted to verify the relationship between the number of transactions and memory and the operating consumption when the sliding window is fixed. Figures 16 and 17 show the relationship between running time and transaction volume when the sliding window is fixed at 4 and the number of transactions continues to increase.
As can be seen from Figure 16 that running time continues to increase, as the volume of the transactions increases. This is because the volume of the transactions that need to be processed increases, and the sliding window needs to be fixed. After the algorithm needs to update the fading factor of each sliding window, the time of transaction obfuscation also increases. The other reason is that the more the transaction volume, the longer the running time. Figure 17 shows that under the same sliding window, the memory consumption tends to be stable as the transaction volume increases. During the creation of the sliding window, as the window continues to move, new data is added to the list and the tree is created. When the sliding window is across the data stream, the nodes that are not in the current window need to be clear from the list structure, and the input and the output operation brought about makes the memory consumption fluctuate. However, because the amount of data in the sliding window tends to be stable during the Java programming process, the memory consumption of the algorithm operation tends to change steadily within a specific time interval. It is also the benefit of the sliding windows, and the memory consumption of the algorithms without sliding windows will continue to increase because the release speed is relatively slow.
In order to further analyze the performance of the algorithm proposed in this paper, Table 1 shows the running time test of each function of the proposed algorithm in this paper when the number of transactions is fixed at 64000. The tested parameters include the ratio of the time taken by the node to be deleted, the ratio of the time required to obscure the data stream, the ratio of time to update the node, the ratio of time to delete the node, and the ratio of other times in the operation.
Figure 18 shows that when the number of transactions is fixed at 64000, the size of the sliding window increases from 4 to 128 and the time proportions of each part of the algorithm are tested as follows. Figures 18(a)–18(f) show that the operation time of reading the data increases from 17.8% to 23%, the fuzzification process is reduced from 44.4% to 28%, updating the nodes is increased from 1.8% to 3.3%, deleting the nodes is increased from 2.2% to 5.7%, and the other operations varied from 27.8% to 46.45%. Figure 19 shows an average time distribution of each part. In the process of selecting the size of the sliding window, an appropriate size of the sliding window can be selected as a reference to Figures 18(a)–18(f) given in the experiments.
(a)
(b)
(c)
(d)
(e)
(f)
In the overall operation of the algorithm proposed in this paper, the time consumption is mainly concentrated in the data reading and fuzzification process, because when the window is fixed, updating and deleting the nodes in the construction of lists and trees are relatively small and not more than 10%.
5.3. Results and Comparisons
In the experiments, the database is tested according to the predefined membership and support with different numbers of transactions. In literature [35], the LFFP algorithm outperforms the Apriori algorithm; thus, it is unnecessary to conduct the performance of the Apriori algorithm in our experiments. The execution time of of FPGrowth_itemsets algorithm consumes more time than . For the FPGrowth_itemsets algorithm is not adapted to the data stream, the authors do the experiment at the same number of transaction with sliding window.
From Figure 20, it can be observed that the execution time of the proposed algorithm is longer than the FPGrowth_itemsets algorithm but in an acceptable scale. The first reason is that the proposed algorithm of fuzzification process is based on the type2 fuzzy set theory generating numbers of fuzzy nodes including the frequent patterns and the critical frequent patterns shown in Figures 18(a)–18(f). The second reason is that the input and output data operation and fuzzification processes consume nearly 50% of the whole algorithm from the experimental results in Figure 19. But the proposed algorithm can provide linguistic frequent pattern over the data streams that the people can understand the results meaning without the help of experts. The other one is the proposed algorithm which uses the precision and recall rates to avoid the concept drift during the data stream mining process.
It can be observed from Figure 21 that the proposed WSWFFPT2 algorithm generally has little memory consumption than the FPGrowth_itemsets algorithm at the same support with different numbers of transactions. As the number of transactions increases, the memory consumption of FPGrowth_itemsets algorithm is linearly increasing with the number of transactions. The proposed algorithm has stable memory consumption at a lower level.
Based on the above observations, it can be concluded that the proposed WSWFFPT2 algorithm with one scan can mine with varied weighted sliding window to mine MFFIs over the data stream with efficiency and high performance compared with the other related algorithms.
6. Conclusion
In this paper, an efficient fuzzylist sorted structure and WSWFFPT2 tree are presented to keep the necessary fuzzy information for mining the MFFIs based on the artificial datasets of medical data stream. The weighted sliding window and the construction strategy of fuzzy frequent pattern tree are designed to adapt to the realtime data stream and also to reduce memory consumption in mining the MFFIs. The precision and recall factors are designed to ensure the accuracy of the proposed algorithm. The experimental results and comparative analysis demonstrate that the proposed algorithm outperforms the classical and the latest algorithms over the quantitative data stream in the realtime medical data stream.
Data Availability
The Wisconsin Prognostic Breast Cancer Database (breast.cancer.arff data) is used to support the findings of this study released upon application which can be achieved on the GitHub as https://github.com/renatopp/arffdatasets/tree/master/classification.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The research is sponsored by the National Natural Science Foundation of China (No. 61762071, No. 61872196, No. 61872194, and No. 61902196), the Scientific and Technological Support Project of Jiangsu Province (No. BE2019740, No. BK20200753, and No. 20KJB520001), the Major Natural Science Research Projects in Colleges and Universities of Jiangsu Province (No. 18KJA520008), the Six Talent Peaks Project of Jiangsu Province (RJFW111), and the Postgraduate Research and Practice Innovation Program of Jiangsu Province (No. KYCX19_0909, No. KYCX19_0911, No. KYCX20_0759, No. KYCX21_0787, No. KYCX21_0788 and No. KYCX21_0799) and BSYKJ2021ZZ01.