Research Article | Open Access
Decomposition of Data Mining Algorithms into Unified Functional Blocks
The present paper describes the method of creating data mining algorithms from unified functional blocks. This method splits algorithms into independently functioning blocks. These blocks must have unified interfaces and implement pure functions. The method allows us to create new data mining algorithms from existing blocks and improves the existing algorithms by optimizing single blocks or the whole structure of the algorithms. This becomes possible due to a number of important properties inherent in pure functions and hence functional blocks.
At present, many data mining algorithms have been developed for the tasks of classification, clustering, mining association rules, and so forth. We can split them into several groups. Each group is a family of algorithms—modifications of the basic data mining algorithm, for example, the family of Apriori algorithms: AprioriTID , DHP , Partition , and so forth. They are different from Apriori by some modifications.
Similarity between data mining algorithms is due to the fact that most of them are based on some theoretical basis and/or hypothesis (e.g., probability theory, statistical methods, metric methods, and neural network theory). Such a theoretical basis is general for all algorithms belonging to one group. These algorithms are different from each other by various improvements, which quite often depend on fulfillment conditions (calculation of probability density function, parameter fitting, kernel calculation, etc.) and the type of analyzed data. Modification and addition of new blocks or reformation of separate parts of known data mining algorithms often allow us to obtain new algorithms with enhanced quality.
Most implementations of data mining algorithms have a solid structure. In other words, these algorithms do not suggest any further modifications and therefore these modifications are quite difficult to perform. As a result, in order to perform software implementation of an algorithm, which differs from the existing one by only some blocks, it is necessary to produce the code of the whole algorithm and perform code debugging. Solid structure of the algorithms does not make it possible to reform them (by adding new blocks or replacing the existing ones) for finding an optimal structure of analyzed data.
Solid structure of the algorithms does not allow us to parallelize them either. In order to perform parallel execution of the algorithms, it is necessary to modify them completely. These parallel versions of the algorithms differ from sequential versions by changes of separate blocks or algorithm structural changes. As a result, new algorithms are created for parallel execution only under particular conditions.
In the paper, we suggest an approach to data mining algorithm decomposition for dealing with the problems listed above.
2. Related Works
Research in the area of algorithm decomposition and algorithm construction on the basis of separate blocks has been held since the beginning of algorithm theory development. There are many fundamental works in this area. Most of them are directed at investigating properties of the algorithms [4–6]: estimate of execution time, convergence, and so forth.
There are also works aimed at presenting algorithms in the form of separate blocks for analyzing the possibility of parallel execution. Among such works are algorithm presentation in the form of a Petri net [7, 8], presentation of algorithm structure in the form of a multilevel parallel form , and so forth.
All these works are theoretical. They have proved to be effective for theoretical investigation of the algorithms, but they do not allow us to proceed directly from the theoretical description to practical implementation. This often leads to theoretical results differing significantly from practical ones.
At present time, functional programming languages are very popular (e.g., Haskell , Lisp , and others). Such languages allow presenting a program (algorithm) as sequence of the pure functions. Each of these functions is independent block of algorithm and it can be executed in any order. A number of papers on programming of data mining algorithms that use functional languages were published previously [12, 13]. However, generally such function has not unified interfaces therefore they cannot be rearranged or replaced arbitrarily. Additionally, functional programming language has many disadvantages such as performance and complexity.
In the area of data mining algorithms, there are also investigations aimed at decomposing data mining tasks into simpler subtasks . Two different approaches exist:(i)Model assemblage—several mining models for the same task are constructed using different algorithms and then these models are joined together (assembled) in a single model thus improving each other [15, 16].(ii)Task decomposition—classification task is separated into separate subtasks and each subtask is solved using a separate algorithm; the results for each separate subtask are later joined together for solving the original task.
The second approach also allows us to solve decomposed subtasks in a parallel and distributed way. However, this approach performs task decomposition leaving the algorithms themselves sequential, which makes it possible to modify the solution of the final task by combining subtasks. However, this does not allow us to modify model creation algorithms.
At present, data mining libraries such as RapidMiner [17, 18], Weka [19, 20], and R [21–23] include only solid implementations of the algorithms (implementations that have not been decomposed into separate interchangeable blocks). These implementations include task decomposition, which can be used for implementing analysis from different existing blocks. The most complicated block of data analysis and mining model construction is solid. As a result, in order to add new algorithms to the libraries it is necessary to create new software modules implementing these algorithms.
Data mining algorithm decomposition into separate blocks is used in the NIMBLE project of the IBM company . This project is directed at developing infrastructure allowing us to perform parallel execution of data mining algorithms using tools that implement MapReduce concept. This concept implies that the algorithm has been decomposed into separate parts (map and reduce), which can be performed in a parallel way. The main drawbacks of such an approach are lack of theoretical basis for algorithm decomposition (which would make it possible to carry out rigorous proofs of their performance in a parallel and distributed environment) and connection to the MapReduce concept.
In this paper, we have suggested a formal model for presenting a data mining algorithm on the basis of -calculus theory [25–27] and its practical implementation in an object-oriented language in accordance with functional programming language principles.
3. Functional Modal for Data Mining Algorithms
The algorithm can be created from separated blocks if these blocks have the following features: (1) they are interchangeable; (2) they are executed in arbitrary order.
The first feature can be implemented by unifying input and output interfaces of the block. All these blocks must take the same input argument and return the same output result set.
The second feature can be implemented similarly as a function in functional programming languages. These languages are based on -calculus theory. The theory knows the following Church-Rosser theorem [25, 27]: when applying reduction rules to terms in the lambda calculus, the ordering in which the reductions are chosen does not make a difference to the eventual result. So in accordance with the Church-Rosser theorem -functions can be executed in any order (and even parallel) because the -function has futures of the pure function .
We will call the functional block the block of algorithm with unified interface and pure function.
We would like to extend -calculations for presenting data mining algorithms by preserving the principles of -calculations (use of pure functions, lack of program status, etc.)
Furthermore, we will use those accepted in functional programming languages (here and in the following we will keep to the notation of the Haskell programming language ):(i) for defining type of a function with two arguments of the type and and result of type .(ii) for defining function with the name .(iii) for defining function implementation with the name , two arguments: of type and of type and return value of type (here expression shows functional dependence of from two arguments and ).(iv) for defining substitution (application of -reduction) of arguments and of type , respectively, into function .
Data mining algorithms perform data processing and construct a mining model . In order to use data mining algorithms in a functional model, we will introduce two new types:(i)Dataset is introduced as a sequence of two lists: from the list of attributes: and list of vectors:(ii)Mining model can be presented as a sequence of multitype elements:
Mining model is based on knowledge extracted from data. This knowledge can be applied to new data for a particular type of analysis (e.g., classification and clustering). Thus, extracted knowledge depends on a function determined by the model. For this reason the structure of the mining model will not be covered in the paper.
A data mining algorithm, which performs processing of data and constructs mining model , can be defined by the function type:
The simple item of the algorithm is the step (the single operation). The data mining algorithms analyses the data set and builds the mining model on the each step. Built mining model is passed to the next step. So the each step of the data mining algorithm must take the data set and the mining model as input arguments. The result of step’s work is the new mining model. Accordingly, the step of the data mining algorithm has the unified interface:(i)input: the data set and the mining model;(ii)output: the mining model.
Functions (called functional blocks in the following) executed at each step of the algorithm and used for constructing a mining model on the basis of two arguments (analyzed dataset and mining model) can be used for introducing a new type of functions:
A data mining algorithm can be presented as a sequence of function calls of the given type; the first function will produce an empty mining model:
In order to satisfy the unification requirement, we will consider function to have type , but its second argument will be an empty value:
Thus, all functional blocks (functions of the type ) have a unified interface. The second necessary property—property of purity—is provided by their implementation in the form of functions in accordance with the principles of -calculus theory. We will show that a data mining algorithm can be constructed on the basis of such functional blocks.
Statement. A data mining algorithm can be presented as an embedded functional expression of the following form: <data_mining_algorithm_name> DMA <data_mining_algorithm_name> which corresponds to the principles of -calculations.
Proof. Using -expressions, a data mining algorithm can be introduced the following way:Application of -reduction will make it possible to transform the functional expression shown above to the result in the form of a mining model. For example, an algorithm having 3 functional blocks has the -expression in the following form:Using applicative-order reduction we can obtain the following expression:As a result mining model will be calculated by means of sequential performance of blocks , , and .
Decomposition of any algorithm splits the algorithm into separate logical blocks, cycles, decision, and so forth. Additionally, data mining algorithms have special blocks: cycle for vectors, cycle for attributes and other. In order to characterize these elements in the form of functional expressions we will add embedded functions and show how they can be used to present the enumerated structural elements in the functional form.
In order to simplify the representation of new embedded functions, we will introduce a new function type for calculating the conditional expression on the basis of two arguments (set of analyzed data and mining model ) and returning the corresponding Boolean value:The cycle and the decision also are steps of the data mining algorithm, but they include sequences of other steps. The decision includes two sequences: the sequence of steps is executed for true condition and optional the sequence of steps for false condition. We will show that the conditional operator can be presented in the form of a functional expression and we will introduce the corresponding function as part of the model.
Statement. Conditional operator of a data mining algorithm can be expressed as a function of higher order: <condition_function_name> <condition_function_name> then else ,
where is the function of the type for calculating the conditional expression; is the function of the type, which is executed if the result of function is true; is the function of the type, which is executed if the result of function is false.
Proof. In -calculus theory Boolean types and conditional expressions are characterized by the following -expressions : IF = . is the logical expression, which can be presented as a -expression in the following form: TRUE FALSE ; is the expression executed if ; is the expression executed if .Thus, the function of the conditional operator can be presented by the following -expression: It is necessary to remember that according to the Cherch-Rosser theory the given expression can be executed in a parallel way due to parallel execution of functions , , and . However, we need to take into account that depending on the result of the result of one of the functions or will not be used.
Functional programming languages do not have cycles because there are no assignment operators or program status. Repeated actions without saving the program status can be performed using recursive function applications.
The cycle includes sequences of iteration steps (the iterative sequence). The cycle parameters (the initial value of the iterator, the change of the iterator, the condition of way out from the cycle and other) are determined by input arguments: the data set and the mining model. We will show that a cycle can be presented as a functional expression and introduce the corresponding function to the model.
Statement. The cycle of a data mining algorithm can be presented using a recursive call of a higher-order function: <loop_function_name> <loop_function_name> then loop’ else ,
where is the function of the type determining the condition of a repeated iteration; is the function of the type executed in a cycle prior to execution of the main iteration; is the function of the type of the main iteration; function of the type, which initializes the cycle.
Proof. We are going to prove the correctness of the expression above having applied reductions sequentially, for example, for 3 iterations. In the end, we will obtain the expression that corresponds to the logic of cycle execution (see Algorithm 1).
As a result, the expression in the last line will be executed and one can make sure that the following statements are correct:(i)Cycle initialization (function ) will be performed once at the beginning of expression execution.(ii)Main block of a cycle (function ) will be performed three times and the results of each stage will be transmitted to the next block with preprocessing (function ).Data mining algorithms often have the cycle for vectors and the cycle for attributes. We can determine some blocks as constants for these cycles. For the cycle for vectors, (i)initialization function performs initialization of a vector counter initializing it by the index of the 1st vector;(ii)conditional function checks whether all the vectors have been processed;(iii)preprocessing function changes the vector counter assigning the index of the next vector.Thus, the cycle for vectors can be determined as an embedded higher-order function in the following form: loop ; loop .For the cycle for attributes,(i)initialization function performs initialization of an attribute counter by initializing this counter with the use of index of the first attribute;(ii)conditional function checks whether all the attributes have been processed;(iii)preprocessing function changes an attribute counter assigning the index of the next vector.Thus, an attribute cycle can be determined as an embedded function of the highest order in the following form: loop ; loop .Next, consider the implementation of typed blocks with the unified interface and the pure function. These blocks are functional blocks.
4. Implementation of Functional Blocks
We implemented all described new functional blocks as classes of an object-oriented language Java. Figure 1 shows the class diagram of blocks (units) of the data mining algorithm. Here the simple step (functional block) of the data mining algorithm is described by the class Step.
The implementation step of the algorithm is contained in the method execute. Calling of this methods leads to execution of the step. This method returns mining model and has followed input arguments:(i)constructing mining model—model;(ii)data set—inputData.
The method execute() is the basis method of step; therefore, it must be implemented as the pure function. For this, it works only with input arguments (model and inputData) and has not any references to other variables.
For a data mining algorithm, it is typical to perform data processing by vectors and also by the values of each attribute. Since the decision and the cycle are the steps of the algorithm, classes corresponding to them are inherited from the class Step. The class DecisionStep corresponds to the decision and the class CyclicStep to the cycle. Both of these classes contain the necessary links to the data set (inputData) and the mining model (model).
In addition, the class DecisionStep contains(i)sequence of steps which executed when the condition is true: trueBranch;(ii)sequence of steps which executed when the condition is false: falseBranch.
The condition itself is defined in the method condition.
The class CyclicStep in addition to the methods and attributes of the class Step defines sequence of the steps that make up one iteration of a cycle—iteration. Initialization of a cycle is defined in the method initLoop. Condition of the cycle (loop) termination is implemented in the method conditionLoop. In addition to these methods in the class CyclicStep defined methods to implement preprocessing before each iteration beforeIteration and postprocessing after each iteration afterIteration.
For the implementation of cycle for vectors defined class VectorsCycleStep. It implements the necessary methods of the class CyclicStep, providing selection of vectors of a data set. Processing of each vector is determined by the sequence of steps added to the iteration—iteration.
Similarly, for processing of values of attributes, the class AttributesCycleStep is implemented.
The classes DecisionStep and CyclicStep use the class StepSequence to determine the sequences of steps. The sequence of steps in itself is a step of the algorithm; therefore, the class StepSequence is inherited from the class Step. The object of this class contains a set of objects of class Step corresponding to the steps of the algorithm and being executed sequentially one after another. To add a step to the sequence use method addStep.
To form the target algorithm define the class MiningAlgorithm. It contains a sequence of all steps of the algorithm—steps—and also methods:(i)initSteps—initializing steps of the algorithm.(ii)runAlgorithm—launching the algorithm.(iii)buildModel—building the model.
In essence in the method initSteps occurs formation of algorithm structure by creating of the steps which determining of sequence and nesting of their execution.
For possibility of parallel execution of parts of algorithm, the class StepSequence implements the interface java.lang.Runnable, determining thereby that the sequence of steps can be launched in a separate thread. What parts of the algorithm must be executed in separate sequences (and as a consequence in the subsequent will be executed in separate threads) is defined in the method initSteps in the process of formation of the algorithm and corresponding objects of the class StepSequence.
5. Examples of Algorithm Building
Now consider applying of the suggested method for a family of Apriori algorithms. In order to illustrate correctness and efficiency of the suggested approach, we will implement algorithms of one group on the basis of functional blocks described earlier. We will show that algorithm decomposition into functional blocks and implementation of these blocks using the suggested approach allows us to obtain new modifications of algorithms by minor changes in program code.
The Apriori algorithm  is described in 1994 by Agrawal and Srikant. In 1994 and 1995 modifications of this algorithm were suggested:(i)Apriori—the feature of this algorithm is that the database is not used for counting support after the first pass .(ii)DHP—uses hash-table for reducing of the data set that handled after first iteration .(iii)Partition—splits the data set on parts that have a size enough for an operation memory .
We implement functional blocks as Java classes for each of them. Figure 2 shows the class diagram of all functional blocks. In the lower part of the diagram, there are algorithm classes (successors of the class MiningAlgorithms) containing the sequence of functional block calls.
The first algorithm to be implemented was Apriori TID together with all the necessary functional blocks.
The next algorithm to be implemented was algorithm DHP. As can be seen from the description, this algorithm is the modification of Apriori TID and it uses hash-tables for storing frequent sets (modifications are highlighted in bold). Algorithm implementation required us to add three new functional blocks and block calls, which have been included in the algorithm class (successor of MininigAlgorithm).
We have also implemented Partition algorithm, which is another modification of Apriori TID algorithm and divides the whole dataset into parts for more rapid processing. Three new blocks were created for its implementation and the corresponding changes have been made in the algorithm class.
Using existing functional blocks we easily can create new data mining algorithm. It will have features of algorithms Apriori TID, DHP, and Partition. Algorithm 4 shows this algorithm. At the same time it was not necessary to implement new functional blocks. The corresponding changes have been made only to the algorithm class (successors of the class MiningAlgorithms).
Table 1 shows changes made in the code of the described algorithms. Changes are shown in evaluation code lines (metric ELOC). Implementation of all the algorithms was performed in Java. The implemented algorithms were tested on similar test datasets. The results obtained after testing coincide with reference ones, which confirms implementation correctness in the case when the algorithms are sets of functional blocks.
Thus, software implementation on the basis of the AprioriTID algorithm required as follows:(i)DHP required us to change/add 22% of code.(ii)Partition required us to change/add 17% of code.(iii)Combined algorithm required us to change/add 2% of code.
This is significantly lower than the creation of the algorithms from the very beginning. Furthermore, in order to construct the algorithms, debugged functional blocks are used, which reduces the time and effort for algorithm debugging.
The present paper discusses the model of data mining algorithm representation in a functional way. This model is an extension of -calculus and retains its principles. The model includes new types for dataset definition, mining models, and also a number of functions. Furthermore, new functions have been added in accordance with the main elements of data mining algorithms.
As a result, the functional model of a data mining algorithm presents the algorithm as a set of embedded unified pure functions (functional blocks). This makes it possible to modify algorithms due to block switching and block substitution. It also helps reduce the time and effort needed for data mining algorithm modification and creation of new algorithms.
The paper also contains description of software implementation of a functional model as a set of Java classes. These classes were used for implementing algorithms from the Apriori family: Apriori TID, DHP, and Partition. Implementation of these algorithms was performed by adding several classes and changing several lines. Furthermore, combination of existing functional blocks allowed us to obtain an algorithm that has all the properties of the algorithms listed above.
Further research will be conducted in the area of algorithm parallelization. We plan to explore possibilities of data mining algorithm parallelization when the algorithms consist of functional blocks. Parallelization will be based on both data and tasks. Solution of this task will allow us to dramatically reduce time and effort for converting a sequential algorithm into a parallel one.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The paper has been prepared within the scope of the state project “Organization of scientific research” of the main part of the state plan of the Board of Education of Russia as well as the project part of the state plan of the Board of Education of Russia (Task no. 2.136.2014/K).
- R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in large databases,” in Proceedings of the 20th International Conference on Very Large Data Bases (VLDB '94), pp. 487–499, Santiago, Chile, 1994.
- J. S. Park, M.-S. Chen, and P. S. Yu, “Using a hash-based method with transaction trimming for mining association rules,” IEEE Transactions on Knowledge and Data Engineering, vol. 9, no. 5, pp. 813–825, 1997.
- A. Savasere, E. Omiecinskia, and S. Navathe, “An efficient algorithm for mining association rules in large databases,” in Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95), pp. 432–444, Zurich, Switzerland, 1995.
- D. Knuth, The Art of Computer Programming, vol. 1-4A, Addison-Wesley Professional, 2011.
- T. H. Cormen, C. E. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, MIT Press, Cambridge, Mass, USA, 3rd edition, 2009.
- G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” in Proceedings of the Spring Joint Computer Conference (AFIPS '67), vol. 30, pp. 483–485, Thompson Books, Washington, DC, USA, April 1967.
- J. L. Peterson, Petri net theory and the modeling of systems, Prentice-Hall, Englewood Cliffs, NJ, USA, 1981.
- T. Murata, “Petri nets: properties, analysis and applications,” Proceedings of the IEEE, vol. 77, no. 4, pp. 541–580, 1989.
- Vl. V. Voevodin and V. V. Voevodin, “Analytical methods and software tools for enhancing scalability of parallel applications,” in Proceedings of the Intelligent Conference on HiPer, pp. 489–493, 1999.
- Haskell programming language, http://www.haskell.org/.
- Common Lisp Language, https://common-lisp.net/.
- N. Kerdprasop and K. Kerdprasop, “Mining frequent patterns with functional programming,” International Journal of Computer, Information, Systems and Control Engineering, vol. 1, no. 1, pp. 120–125, 2007.
- L. Allison, “Models for machine learning and data mining in functional programming,” Journal of Functional Programming, vol. 15, no. 1, pp. 15–32, 2005.
- O. Maimon and L. Rokach, Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications, Machine Perception and Artificial Intelligence, World Scientific, 2005.
- T. G. Dietterich, “Ensemble methods in machine learning,” in Proceedings of the 1st InternationalWorkshop on Multiple Classifier Systems, pp. 1–15, Springer, Berlin, Germany, 2000.
- L. Todorovski and S. Džeroski, “Combining multiple models with meta decision trees,” in Principles of Data Mining and Knowledge Discovery, vol. 1910 of Lecture Notes in Computer Science, pp. 54–64, Springer, Berlin, Germany, 2000.
- RapidMiner, http://rapidminer.com/.
- P. M. Gonçalves Jr., R. S. M. Barros, and D. C. L. Vieira, “On the use of data mining tools for data preparation in classification problems,” in Proceedings of the IEEE/ACIS 11th International Conference on Computer and Information Science (ICIS '12), pp. 173–178, IEEE, Shanghai, China, June 2012.
- Waikato Environment for Knowledge Analysis (Weka), http://www.cs.waikato.ac.nz/ml/weka/.
- I. H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. J. Cunningham, “Weka: practical machine learning tools and techniques with java implementations,” in Proceedings of the Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems (ICONIP/ANZIIS/ANNES '99), pp. 192–196, 1999.
- R language, http://www.r-project.org/.
- S. Tippmann, “Programming tools: adventures with R,” Nature, vol. 517, no. 7532, pp. 109–110, 2014.
- F. Morandat, B. Hill, L. Osvald, and J. Vitek, “Evaluating the design of the R language: objects and functions for data analysis,” in Proceedings of the 26th European Conference on Object-Oriented Programming (ECOOP '12), J. Noble, Ed., pp. 104–131, Springer, Beijing, China, June 2012.
- A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan, “NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapReduce,” in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '11), pp. 334–342, San Diego, Calif, USA, August 2011.
- H. P. Barendregt, The Lambda Calculus: Its Syntax and Semantics, vol. 103 of Studies in Logic and the Foundations of Mathematics, North-Holland, 1985.
- A. Church and J. B. Rosser, “Some properties of conversion,” Transactions of the American Mathematical Society, vol. 39, no. 3, pp. 472–482, 1936.
- F. Joachimski, “Confluence of the coinductive λ-calculus,” Theoretical Computer Science, vol. 311, no. 1–3, pp. 105–119, 2004.
- Common Warehouse Metamodel (CWM) Specification, OMG, Version 1.1, 2003, http://www.omg.org/cwm/.
Copyright © 2016 Ivan Kholod et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.