An Extension of Totohasina’s Normalization Theory of Quality Measures of Association Rules
In the context of binary data mining, for unifying view on probabilistic quality measures of association rules, Totohasina’s theory of normalization of quality measures of association rules primarily based on affine homeomorphism presents some drawbacks. Indeed, it cannot normalize some interestingness measures which are explained below. This paper presents an extension of it, as a new normalization method based on proper homographic homeomorphism that appears most consequent.
In the context of the implicative statistical analysis (ISA) , you can never leave aside the probabilistic notions of the measures of quality which assess the degree of implicative link between two patterns of association rule. In light of very rich number of measures in the literature of the binary database, researchers are working [2–7] parallelly to find the more general relationships allowing to classify partially or entirely these various measures of interest. Hence the creation of highly founded concept called “normalization under five constraints quality measures” in the context of data mining appeared in 2003 . This definition of normalized measures is acquired. It thus turns out that all normalizable measures by affine homeomorphism become comparable [2, 8]. Although this author has already well-treated this subject, the problem of normalization of probabilistic quality measures remains not fully resolved. Indeed, the tool called affine homeomorphism he used may have some weakness, because it cannot normalize a measure intended to the infinity at least one of reference situations more or less intuitive such as the incompatibility, the statistical independence, and logical implication [1, 9–12], for example, the measures Cost Multiplying, Sebag, Conviction, Odd-Ratio, Informal Gain, and Ratio of Example counter-example. Therefore, as already announced by the author ( paragraph 3.2 page 65) “the problem remains open concerning the transformation allowing normalization of other measures in a way still to be specified”. We are interested in this issue. This article proposes a way to partly solve this problem and discussion. First, let us recall below our proposed definition of a normalized quality measure which is evolving with three intuitive events as incompatibility, stochastic independence, and implication.
Definition 1 (cf. [6, 13–15]). Let X and Y be itemset of a context of binary data mining , the uniform discrete probability on probabilizable space , all sets of transactions, the set of attributes called items or patterns, the binary relation from to , a probabilistic quality measure, and an association rule, with . , denoted the extention of the itemset : is hence an event of the discrete probability . A quality measure of an association rule is called to be normalized if it verifies the five conditions below: (i), if ; i.e., and are two incompatible events: this means the two patterns and are incompatible then;(ii), if ; i.e., and are two negatively dependent events, and as such both and patterns are then negatively dependent;(iii), if ; i.e., the two events and are independent and this means both and are independent itemsets;(iv), and ; i.e., if and are positively dependent, therefore the two itemsets and are positively dependent;(v), and or if : the itemset is then completely included in . ; i.e., the set of all transactions contains the pattern .
Example 2. According to the definition mentioned above, it is easy to show that the measure defined by
where , if favors , and , if disfavors , is normalized.
Note that this measure was discovered independently by three authors from three continents: by S. Guillaume in France in her thesis , by A. Totohasina in  in Madagascar at his research on normalization, at that time he appointed ION “Implication Oriented Normalised”, and by X. Wu, C. Zhang, and S. Zhang in  in USA under the name of CPIR “Conditional Probability Incrementation Ratio”. Its rich and interesting mathematical properties are studied in [6, 15]. In addition, it is historically interesting to notice that this measure was partially discovered as Certainty Factor (CF) by Edward H. Shortliffe and Bruce G. Buchanan in USA at 1975 . Fernando Berzal et al.  established some statistical properties of CF and its relation with some common interestingness measures as Confidence and Conviction.
Hereafter, our work is divided into five sections. Section 2 recalls some properties of a homographic homeomorphism that will be the main subject of our contribution. Section 3 offers a new normalization process based on homographies, in order to solve the aforementioned problem. Section 4 recommends the raw results of each of the normalized measures. Section 5 raises a conclusion.
2. About Homographic Function Processing Tool
2.1. Definition and Reminders
In mathematics, a homographic function is a function which can be represented as the form of quotient of two linear functions. It is bijective and its inverse function is a particular homographic function. In the commutative field a homographic function f on is a function in itself defined by , where , and are real numbers such as . Prohibit to be zero to avoid a constant function. Sometimes the condition “ not zero” is added, as the case corresponds to linear functions, but then we lose the group structure of the set of homographic functions with the composition of applications.
We will retain that a homographic function is a homeomorphism of the form , where This function determines a bijection from to whose inverse bijection , which has the same determinant . Note that is a homography of the same type as and the graphs of homography and are hyperbolas.
It is seen that if we extend by and , we obtain a projective application and let us denote .
Derivative and Variation. In the real homography case , its derivative is , where is the determinant of the matrix and so is called the determinant of the homography and denoted . For this reason, here are the variations in the homographic function: if is positive, then is increasing on its two definition intervals; if is negative, then is strictly decreasing on both definition intervals.
2.2. Canonical Form
Let be a homography such that , with .
In case is not zero, the canonical form (also called reduced form) of a homographic function is , where , , and . By making a change in reference by taking a point S, the set projective applications, of coordinates as a new origin, the expression of the homographic function becomes which corresponds to the inverse function multiplied by the scalar .
Morality. Any own homographic function nonzero determinant can thus be reduced to a homographic function type as , with .
From now, we are interested in all homographies of type as , with , that : own homography thus returns infinity to a real finite. Knowing that, this time, our main objective is to “make the infinity to be finite”, we thereafter consider the measures that have infinity value among the three conditions such as logical implication, stochastic independence, and incompatibility; we use the homographic function mentioned above. Then, for any situation not leading to infinity, it is relevant to use the theory in  which is based on the use of an affine homeomorphism. Taking advantage of the fact that affine applications are part of the great family of homographies and appear as degenerate homographies returning infinity to infinity, we will enhance the whole nonconstant homographies. In the current theory we can combine the theory in  and the one we have just proposed. It is thus a natural extension of such approach of .
For convenience let us denote by the value of the probabilistic quality measure , by the value of , at logical implication, that of at independence, and the value of at incompatibility, where .
3. Normalization Process by Homography
Let be the homography normalization of quality measure , the semihomographic normalized of , at right semihomographic normalized of , and at left semihomographic normalized of .
As announced in , the main objective of normalization of quality measure is to bring its values in ; under the three conditions that takes the value - 1 at incompatibility, 0 at independence, and 1 at logical implication in order to compare two normalizable measures. Remember always that if these three values are finite and pairwise different, so the research carried out by  has already taken the approach to solve this kind of problem (problem of normalization of probabilistic quality measure), i.e., the use of the expression of the normalized of :where these four coefficients, called normalization coefficient, are determined by passing unilateral limits in reference situations (incompatibility, independence, and logical implication) due to the continuity of evolution in both zones: attraction (positive dependence) and repulsion ( negative dependence ) [2, 6, 17].
If one or two of the three values , , and are infinite and in case we have two infinite values, it is necessary that is excluded, which leads us to use one of the three following expressions to find the four real coefficients, , , , and : These four coefficients are still determined by passing unilateral limits in situations of reference (incompatibility, independence, and logic implication) by taking into account the continuity evolution in both zones: attraction (positive dependence) and repulsion (negative dependence). In case where favors , can be infinite, , and ; then we obtain the following system of equations:As and , so you only use the theories in  for the left normalization. We can write the system of four nonlinear equations, with the following four unknowns:We are just here to take and we have four equations with four unknowns, with this particularity that the coefficient can be infinite. Hence we have the following proposition.
Proposition 3. (i) If with , , and pairwise distinct, then the system of equations (4) admits four real solutions; (ii) if = and , then the system of equations (4) has four real solutions such that = 1, = -, , and = -; (iii) otherwise, the system of equations has no solution.
Proof. The system of equations (4) is equivalent to the system of equations (5):It became a system of linear equations. The matrix writing system of equations (5) is given by the vector equation (6):Let us call = So we must have = ; so (i); for , just take ; for , if , the last two equations do not make sense.
The following system of equations presents the common features of equation (5) with the only difference that can be infinite, and . This gives the system of equations (7).Hence we obtain the following proposition.
Proposition 4. (i) If , with , , and pairwise distinct, then the system of equations (7) admits four real solutions; (ii) if = and , then the system of equations (7) has four real solutions such that = -1, and , ; (iii) otherwise, the system of equations has no solution.
Proof. The system of equations (7) is equivalent to the system of equations (8):It became a system of linear equations. The matrix writing the system of equations (8) is given by the vector equation (9):Let us call = So we must have = ; so (i); for , just take lim ; for , if , then the last two equations do not make sense.
The following system is similar to the previous equation (7), this time with and , including case where ; then in this case, m must be nonzero. Take, for example, m = 1.Then we obtain the following proposition.
Proposition 5. (i) If , with , , , and pairwise distinct, then the system of equations (10) has four real solutions; (ii) if , , and , then the system of equations (10) has four real solutions such that , and , ; (iii) then, the system of equations has no solution.
Proof. The system of equations (10) is equivalent to the system of equations (11):It is reduced to a system of linear equations. The matrix writing of the system of equations (11) is given by the vector equation (12):Let us call = Thus we must have = ; so (i); for , just take ; for , if , then the last two equations do not make sense.
Proposition 6. (i) If , with , , and pairwise distinct, then the system of equations (13) has four real solutions; (ii) if , and if , then the system of equations (13) has four solutions actual such that , and , ; (iii) alternatively, the system of equations has no solution.
Proof. (i) The system of equations (13) is equivalent to the system of equations (14):It therefore comes to a system of linear equations. The matrix writing system of equations (14) is given by the vector equation (15): Let us call = Thus, we must always have ; for , just take and ; for , if , then all the equations do not make sense.
Note 7. Note that the different matrices for the four above-mentioned propositions as , , , have the same determinant; this means that all the measures have the following conditions and have the same condition of normalizability.
4. Application of These Four Propositions
We recall in Table 1 the respective definitions of the various measures that lead to the results below.
The normalized measure of “Cost Multiplying” is
The normalized measure of “Sebag” is
The normalized measure of “Example counter-example” is
The normalized measure of “Odd-Ratio” is
The normalized measure of “Conviction” is
The normalized measure of “Informal Gain” isNote that the normalized of Gain-Informal measure has no relationship with , but it is continuing. Table 1 calls the respective definitions of the various measures.
The following theorem supplements in its part (i) the statement in  (paragraph 3.2, page 62).
Theorem 8. A probabilistic quality measure is normalizing if and only if, for any association rule , the following conditions are met: (i) the following inequalities are satisfied , if ; (ii) if , then it is necessary that any of the three values must be infinite; or in the case where two values are infinite, must always be finite, and .
Proof. (i) and if , then the determinant of the matrix (i = 1, 2, 3, 4) associated with all the equations (6), (9), (12), and (15) is equal to and is, thus, nonzero; for (ii), if , and one or two of them are infinite, then it is necessary to use Propositions 3, 4, 5, and 6 (ii) above. Hence the theorem is stated.
5. Conclusion and Perspectives
This study makes a significant number of quality measures that remain nonaffine normalizable (e.g., Sebag, Odd-Ratio, Example counter-example, Informal Gain, Cost Multiplying, and Conviction), through the use of homographic functions. However, we have shown that each measure has its own position in relation to the three intuitive references such as incompatibility, independency, and logical implication. In this case, we have shown that we can use a homographic homeomorphism or even combine a homographic homeomorphism with an affine homeomorphism, when the infinity appears as reference coefficient. Only, we must be able to play with the homographic homeomorphism in accordance with its property and according to the need. It is explained in our work only that any situation is infinite; moreover, it is not yet risen to the position of stochastic independence: only homographic functions of the type , where and , are sufficient to solve the problem of normalization. A small exception will be noticed on the measurement Informal Gain, because it is infinite at incompatibility and equal to zero at the stochastic independence. Therefore, it appears necessary to introduce the two functions and and , for example, m = 1. Finally, this study is based on normalization by homography complementarily to those by affine application; this remains around the quality measure with its two components and multiplying by closely factors. This reinforces the unifying property of relative to all measures existing in literature. Although we have these homographic homeomorphisms and affine homeomorphisms to normalize interestingness measures, the route to be taken is even longer. Indeed, there is a group of measures that resist to the use of these new tools, namely, Klosgen, Pondered dependency, One way support, Bilateral Support, Coverage, and Prevalence, because they do not meet the conditions for normalization. So, the problem remains still open with respect to the transformation that allows normalization of those quality measures in a sense still to be specified.
No data were used to support this study.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
J. Diatta, H. Ralambondrainy, and A. Totohasina, “Towards a unifying probabilistic implicative normalized quality measure for association rules,” in Quality Measures in Data Mining, 250, 237, 2007.View at: Google Scholar
J. Diatta, H. Ralambondrainy, and A. Totohasina, “Towards a unifying probabilistic implicative normalized quality measure for association rules,” in Quality Measures in Data Mining, chapter 10, pp. 237–237-250, Springer-verlag, 2007.View at: Google Scholar
W. Aljandal, W. Hsu, V. Bahirwani, D. Caragea, and T. Weninger, “Validation-based Normalization and Selection of Interestingness Measures for Association Rules,” in Proceedings of the 18th International Conference on Artificial Neural Networks in Engineering, pp. 1–8, 2008.View at: Google Scholar
A. Totohasina, “Towards a theory unifying implicative interestingess mesures and critial values consideration in MGK,” in Educação Matemática Pesquisa, vol. 16, pp. 881–900, Special ASI, São Paulo, Brazil, 13 edition, 2014.View at: Google Scholar
E. M. Nguifo, D. Grissa, and S. Guillume, “Categorisation des mesures d’intérêt pour l’extraction des connaissances,” in Revue des Nouvelles Technologies de L’Information, D. A. Zighed and G. Venturrini, Eds., pp. 117–143, 2012.View at: Google Scholar
M. Hahsler and K. H. Vienna, New Probabilistic Interest Measures for Association Rules, University of Economics and Business Administration, Vienna, Austria, 2018.
G. Piatetsky-Shapiro, P. et Smyth, and U. M. Fayyad, “Knowledge discovery and data mining: towards a unifying framework,” in Proceedings of the second International Conference on Knowledge Discovery and Data Mining, pp. 82–88, 1996.View at: Google Scholar
D. R. Feno, Mesure de qualité des règles dŠassociation : normalisation et caracterisation des règles dŠassociation des bases [PhD thesis], Université de La Réunion spécialité : Mathématiques Informatique, 2007.
N. Bhargava and M. Shukla, “Survey of Interestingness Measures for Association Rules Mining: Data Mining, Data Science for Business Perspective,” IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), vol. 6, no. 2, 2016.View at: Google Scholar
A. Totohasina, Contribution à l’étude des mesures de qualité des règles d’associations: normalisation sous cinq contraintes et cas de MGK: propriétés, bases composites des règles et extension en vue d’applications en statistique et en sciences physiques [PhD thesis], Université, d’Antsiranana, Madagascar: Mathématiques Informatique, 2008.
A. Totohasina, “Normalisation de mesures probabilistes de la qualité des règles,” in Proceedings of the SFDS'03, XXXV ième Journées de Statistiques, pp. 985–988, Lyon, France, 2003.View at: Google Scholar
A. Totohasina and H. Ralambondrainy, “ION: A pertinent new measure for mining information from many types of data,” in Proceedings of the 1st IEEE International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2005, pp. 202–207, Yaoundé, Cameroon, December 2005.View at: Google Scholar