Abstract

In epidemiology, the rule of association is used to determine the factors at the origin of diseases; implicative statistical analysis is thus a necessary tool in epidemiology too. Epidemiologists have more often chosen the so-called odds ratio measure in their studies of the quantification of the implicit link between an exposure and disease. In order to obtain good results, we need to be sure that the odds ratio measure is really the most relevant measure available. Therefore, it is necessary to study the mathematical properties of the odds ratio. This paper proposes a comparative study of the behaviour and mathematical properties of the odds ratio measure, the measure of Guillaume–Khenchaff (MGK), and the normalised odd-ratio measure. We have chosen the MGK measure because the literature considers it to be a good measure for extracting implicit association rules according to its mathematical properties. The result in this paper concerns only the study of probabilistic data.

1. Introduction

In epidemiology, in order to search for the etiological factors (i.e., causes) of diseases, the study is interested in the association between an exposure E (or risk factor) and a disease D. In order to quantify such an association, we are interested in the study of the measure of association. In parallel to this, we have noted that several thesis works and publications, such as those of Gail et al. [1]; Antony [2]; Rao and Miller [3]; Axame et al. [4]; Adwar et al. [5]; Agresti [6]; Bland and Altman [7]; and Sedgwick [8], have often used the Odd-Ratio measure in their studies. Several epidemiologists also affirm in their work Held [9]; Bouyer [10] that the Odd-Ratio measure is very effective in case-control studies.

Furthermore, the literature attests that the measure of Guillaume–Khenchaff measure is considered to be a good measure for extracting implicit association rules according to its mathematical properties studied in Buchanan and Shortliffe [11], Guillaume [12], Totohasina et al. [13], Wu et al. [14], Totohasina et al. [15], Rakotomalala et al. [16], Rakotomalala [17], and Ralahady and Totohasina [18]. According to the works of Guillaume [12]; Totohasina [19]; and Feno [20], the MGK measure is inspired by the Loevinger measure with its capacity to identify zones of attraction and repulsion and guided by the desire to overcome the disadvantages of the confidence measure, in particular the inconvenience of selecting rules located in the repulsion zone between the premise and the consequent of a rule of association . In his work, Guillaume [12] asserted that MGK verified the principles of [21]. Independently of Guillaume [12], Wu et al. [14]have used MGK for extracting positive and negative rules from a data mining context. They insisted especially on the fact that this measure allows the extraction of negative rules of the forms , , and . About this last bilateral negative rule, let us recall that the MGK- based algorithm depends only on the favoring component , which is fortunately implicative; that is, [13, 19]. This mathematical property makes MGK more relevant and compatible with using the language of implication as it is naturally desired by researchers on interpreting valid association rules.

However, a comparative study between the MGK and odds ratio measures is already done in 2017 in our own Master work Masonova [22]: in fact, odds ratio has a flaw in the face of an intuitive situation, as it does not take a fixed value at logical implication. This causes the difficulty to define a minimum threshold and to interpret the association rules. Following the study carried out by [23, 24], we propose to deepen the comparative study of MGK and the normalised measure of odds ratio according to the 22 properties proposed by [25].

In this article, we will take as reference the work of [25]. In her thesis work, she identifies 22 properties for evaluating the performance of quality measures, which we will review and evaluate on the three measures MGK, odds ratio, and standardised odds ratio to reassure epidemiologists on the choice of measure to use.

In the following, our work is divided into four sections. Section 2 presents the different tools and concepts used in our work. Section 3 presents our results and discusses the comparison of the MGK, odds ratio, and normalised odds ratio measures across the different studies. Section 4 concludes and discusses some perspectives.

2. Materials and Methods

2.1. Notation and Definition

We present in this section some notations and definitions around data mining and the measurement of association rules.

Let be a set of items or attributes also called variables and a set of n transactions or entities defined on the set of attributes. A subset X of is called a pattern or an itemset.

According to Ganter and Wille [26], in Formal Concept Analysis, a formal context is a triplet , where and are the finite sets and is a binary relation from to .

Definition 1. Let us consider a binary context , where and are the finite sets. Such a context will be called a data mining context.
Let be a context of the data mining; X and Y are two patterns of .(i)For a pattern X, its dual X′ is defined by : X′ is also called the extension of the pattern X.(ii)We define the uniform probability P on the finite discrete probability space as follows: for any event E of : and consider , where designates the cardinality of the event E.

Definition 2. According to Feno [20], an association rule is an implication of the form expressing the fact that the attributes in X tend to appear with those in Y.

Definition 3. According to Feno [20] and Totohasina [27], a quality measure or interest measure of the rules is a function of the set of association rules with values in such that for any association rule the value depends exclusively on the four parameters , , and , where P designates the uniform discrete probability on the probability space and n denotes the cardinality of .

Definition 4. Let be an association rule. According to Totohasina [19], a quality measure is said to be normalised if it verifies the following five conditions:(i), if .(ii), if ; that is, X and Y are negatively dependent (in partial repulsion).(iii), if (i.e., X and Y are independent).(iv), if (i.e., if X favours Y or X and Y are partially attracted).(v), if (or if X totally implies Y).

Definition 5. Let X and Y be two patterns of a context of the data mining.
Guillaume [12], Feno [20], and Totohasina [27] define the measure MGK as

Definition 6. Let X and Y be two patterns of a data mining context.

Definition 7. Let X and Y be two patterns of a data mining context.
Tan et al. [28] defined the odds ratio (OR) measure as

Definition 8. According to Feno [20] and Totohasina [27], let X and Y be two patterns of a data mining context.
X favours Y, if and only if Y favours X.

Proof.

2.2. Normalisation of a Measure according to an Affine Homeomorphy

The birth of the normalisation of a quality measure was started by Totohasina in his work [19] with the aim of having a unifying vision of the measures in the literature of binary data mining.

2.2.1. Reference Situations

Definition 9. Let X and Y be patterns of a data mining context and P the uniform discrete probability on the probability space [19]. For the sake of consistency with the principle of duality in formal concept analysis, we characterize it by the properties of their respective patterns and as events of the tribe :(i)X and Y are incompatible, if their extensions are incompatible, that is, if (i.e., ), where is the extension of X.(ii)X and Y are negatively dependent (or X and Y are mutually unfavourable), if (which is equivalent to ).(iii)X and Y are positively dependent (or X and Y favour each other), if (which is equivalent to ).(iv)X logically (totally) implies Y, if , that is, [27].Thus, the quantities and measure the deviations from independence of the two patterns X and Y which are, respectively, noted and .
In general, these two indicators of the degree of statistical dependence are not equal, despite the mutuality of attraction or repulsion depending on whether the link is positive or negative.
Nevertheless, the notions of positive and negative dependence are linked, as shown by the following lemmas [27].

Lemma 1. Let X and Y be two patterns.(1)The following three conditions are equivalent: (i) X disfavours Y; (ii) X favours Y; and (iii) X favours Y.(2)The following four conditions are equivalent: (i) X favours Y; (ii) X disfavours Y; and (iii) X favours Y and X disfavours Y [27].

Note that the two quantities and are increasing functions of the number of examples ; the marginals and are remaining constant. Moreover, the literature already suggests the following five principles [27].

The three Piatetsky–Shapiro principles say that a measure of interest of an association rule must be zero in case of statistical independence of the premises and consequences, a strictly increasing function of the number of examples, the other parameters being fixed, and a strictly decreasing function of the cardinal of the dual of its premise or decreasing of the cardinal of the dual of its consequent; the other parameters are kept constant [27].

The fourth principle of Major and Mangano: a measure of interest of an association rule must be a strictly increasing function of its coverage (i.e., the cardinal of the intersection of the two extensions), once its confidence is kept constant above a previously fixed minimum value [27].

The fifth principle of Totohasina [27] that corrects the symmetrical character of the Piatetsky–Shapiro index: A measure of quality of interest of an association rule must be nonsymmetrical.

In view of the mathematical objectives of normalisation and the five principles mentioned, [19] introduced the definition of a normalised quality measure as follows.

Definition 10. Let X and Y be patterns of a binary data mining context , the uniform discrete probability on the probability space [19], the set of transactions from to , the set of attributes called items or patterns, is the binary relation from to , a probabilistic quality measure, and an association rule. A quality measure of an association rule is said to be normalised, if it verifies the following five conditions:(i), if , that is, and are two incompatible events: we say that the two patterns X and Y are then incompatible.(ii), if , that is, and are two independent events: we say that the two patterns and are then negatively dependent (in partial repulsion).(iii), if , that is, the two events and are independent: we say that the two patterns and are then independent.(iv), if , that is, X favours Y or the two patterns X and Y attract each other partially.(v), if , it is said that the pattern X is then totally included in Y.The distribution of the values of a normalised measure is represented schematically, as shown in figure 1 [27].
According to the definition mentioned, it is very easy to show that the measure MGK is defined by where , if X favours Y and , if disfavours , is well normalised and continuous.
Indeed, in case of incompatibility between two X, Y patterns of , as .
     According to the definition, at incompatibility, .
     Therefore, .In the case of independence between the premise and the consequent, .
     According to the definition, at independence, .
     Therefore, .Finally, in the case of logical implication , .
From the definition, to the logical implication, .
Therefore, .

2.2.2. Remind the Process of Normalisation by Affine Homeomorphism

Let us recall in passing that, according to Totohasina, the normalised measure of the measure has the expression

These four coefficients are determined by crossing unilateral limits in reference situations (incompatibility, independence, and logical implication) due to the continuity of the evolution in the two zones: attraction (positive dependence) and repulsion (negative dependence).

Let us put the value of at the logical implication, that of to independence and the value of to incompatibility [27]. In the case where favours , we obtain

In the case where disfavours , we obtain

This gives the following system of linear equations:

2.2.3. Example of Normalisation of Quality Measures

To illustrate the processes of normalisation of quality measures, here are some details of the calculation of the normalised associated with some quality measures. Let be an association rule of a data mining context [27].Support: which is such that . So, we get , Therefore, .Finally, we find .Confidence: which is such that . Therefore, .Finally, we find .

Faced with the normalisation procedure, the author himself stated that there are certain measures that resist normalisation with an affine homeomorphy. For this reason, he announced a following theorem called the condition theorem of normalisability of a measure according to his theories.

A measure of quality is normalisable if and only if, for any rule , the following conditions are verified:(i)The quantities , , and are finite(ii)The following inequalities are verified , [27]

2.2.4. Study of the Normalisability of the OR Measure

The odds ratio measure is defined by

Now, according to Definition 9, it is always necessary to express the measure of the quality of the association rules using the quantities: .

So, to express the measure as a function of , we must use the Bayes theorem: .

After the transformation, we obtain

In the case of incompatibility between two patterns of , as translates the incompatibility:

     It is obvious from (14) that

In the case of independence between the premise and the consequent, we have .

     As , we obtain at independence .

     Therefore, .

Finally, in the case of logical implication , so .

As , we obtain the logical implication: .

Now, the probability between the interval of [0,1] is positive.

So, .

The relations of equations (16)–(19) show that the measure can take very large values in the reference situations. These properties prove that the measure is nonnormalised and nonnormalisable by an affine homeomorphy. It is therefore clear that this measure is not normalisable by an affine homeomorphism.

2.3. Normalisation of a Measure according to a Homography Homeomorphy

Thanks to the research collaboration by [23, 24], this measure remains currently normalisable. This time, they used the proper homography according to the following approach.

If one or two of the three values , , and are infinite and in the case where we have two infinite values, it is necessary that is excluded, which leads us to use the following expression to find the four real coefficients, , , , and :

These four coefficients are always determined by crossing unilateral limits in reference situations (incompatibility, independence, and logical implication) due to the continuity of the evolution in the two zones: attraction (positive dependence) and repulsion (negative dependence). In the case where favours , can be infinite, , and , then we obtain the system of equations as

As and , it is therefore sufficient to use the theory in [19] for the left-hand normalisation. We can write the following system of four nonlinear equations with four unknowns:

Here, we only need to take and we have four equations with four unknowns, with the particularity that the coefficient can be infinite. Hence, we have the following proposition.

Proposition 1. (i)If (, , ) , with , , and are two distincts, then the system of (16) admits four real solutions(ii)If  =  and , then the system of (16) has four real solutions such that  = 1,  = −, and  = −(iii)If , and if , then the system of (16) has four real solutions such that , , and (iv)Otherwise, this system of equations has no solutionIt is sufficient to take .
Let us take advantage of this proposition with odds ratio: such that , , and .
We have , , and .
By replacing , , , , and with their values in the expression: For m = 0, we haveIt comesIn the case of incompatibility between two patterns of , as , we obtain .
 Therefore, .In the case of independence between the premise and the consequent, we have .
 As , we obtain at independence: .
 Therefore, .In the case of logical implication , .
As , we obtain the logical implication: .
Therefore, .The relations of equations (25)–(27) show that the measure takes particular values in reference situations other than the equilibrium situation. These properties prove that the homography-normalised odds ratio is well normalised. Figure 2 summarises the particular values of in the reference situations.
However, .

2.4. Case-Control Study

According to Held [9], a case-control study examines the degree of association between exposure to a potentially harmful agent and the prevalence of a disease. To do this, a number of people (cases) with a disease (e.g., lung cancer) are first identified. Another group of people (the controls) with the same profile (in terms of age, gender distribution, blood pressure, other medications, and concomitant diseases) as the case group is then selected, except that the controls do not have the disease under study. Finally, the number of people who have been exposed to the toxic agent under investigation (e.g., smoking) is determined for each of the two groups.

2.5. Cross-Tabulation

According to Held [9], a cross-tabulation table or a contingency table in probability, is a table allowing to compare the distributions (absolute or relative frequencies) of population according to characteristics like the sex (male/female) or the smoking status (smoker/nonsmoker), but also according to variables which can take more than two values. A table is used when the responses are dichotomous (can take two values), for example, yes/no or present/absent.

3. Results and Discussion

3.1. Summary of the Study of the Behaviour of the Three Measures at Reference Situations

In this section, we present the main properties of the measures , odds ratio, and normalised odds ratio homography. Thus, we consider a data mining context and and as two patterns of .

(1) The results of the following propositions result from the definition of .

Proposition 2. (reference situations). According to [20], for any patterns and , we have the following:(i) and are incompatible, if and only if (ii) disadvantages , if and only if (iii) and are independent, if and only if (iv) favours , if and only if (v) logically implies , if and only if The above properties express the fact that takes its values on the interval while reflecting the reference situations.

Proposition 3. (reference situations). According to [20], for any patterns and , we have the following:(i) and are incompatible, if and only if (ii) disfavours , if and only if (iii) and are independent, if and only if (iv) favours , if and only if (v) logically implies , if and only if The above properties express the fact that odds ratio (OR) takes its values on the interval while reflecting the reference situations.

Proposition 4. (reference situations). According to [29], for any patterns and , we have the following:(i) and are incompatible, if and only if (ii) is unfavourable to , if and only if (iii) and are independent, if and only if (iv) favours , if and only if (v) logically implies , if and only if The above properties express the fact that the normalised odds ratio homography (ORhn) takes its values on the interval while reflecting the reference situations.
(2) So far, we have the following.
Starting from , we obtain .
Let us determine the critical values of ORhn.
For that, it is enough to express ORhn as a function of MGK.
We get .
So, .
 In any case, this relation (16), we can write as a function of .
Then, it is possible to obtain a page of critical values from those of .

3.2. Comparative Study of the Three Measures with the Binary Data

The two quality measures have varied behaviours with respect to the criteria desired for a good quality measure of the rules. To see the variation between the two formulas and the differences in the sets of values taken by the two measures, consider the data mining context presented in Table 1 formed by five people (A, B, C, D, and E) and six diseases (, , , , and ).

(1) For the M GK and OR Measures. The table presents the values taken for the two quality measures for some of the association rules considered.

In Table2 that presents the behaviours of these two quality measures in the reference situations, we noticed that the measures MGK and OR have of the different behaviours. It was noticed that the odds ratio measure does not take fixed values at logical implication, which causes the difficulty of defining a minimum threshold; that is, it is not known from which value taken of this measure can we get a convincing (interesting) value. Thus, it is very difficult to interpret a rule in this interval of . On the other hand, positive and negative association rules are potentially relevant using the MGK measure. Taking the results in Table 3 as an example, the values of the rules taken by the MGK measure are very accurate and easy to interpret. It lies between the interval [-1, 1].

(2) For the Measures M GK and ORhn. The table presents the values taken for the two quality measures for some of the association rules considered.

In Table 4 that presents the behaviours of these two quality measures in the reference situations, we noticed that the measures MGK and ORhn have the same behaviours. In Table 5 that presents the values taken by these two quality measures, we noticed on the calculation of the rule that the value taken by the measure ORhn is greater than the value of the measure MGK. If, for example, we fix a minimum threshold of 60%, this rule is rejected by the MGK measure but validated by the ORhn measure. This proves that the MGK measure is more discriminating than the ORhn measure ORhn.

The MGK measure is normalised and normalisable, but the odds ratio measure is nonnormalised, nonnormalisable according to an affine homeomorphy and normalisable according to a homographic homeomorphy. From the results we obtained, we can conclude that the MGK measure is a very efficient and more relevant measure compared to the odds ratio (OR) measure and the odds ratio homography-normalised . We advised users to question or abandon their choice of the odds ratio measure and choose the MGK measure for analysis in their epidemiological studies.

3.3. Study on the Relationship between Smoking and Bronchial Carcinoma

For the application of our comparative study of the odds ratio, MGK and the standardised odd-ratio measures in the field of epidemiology, we have chosen the study on the relationship between smoking and bronchial carcinoma. This study is well shown and detailed in the works of Held [9] and Held et al [30]; they have well demonstrated the link between smoking and bronchial carcinoma from the data in Table 6.

Here, we pose the following:(i)C: All people affected by bronchial carcinoma(ii): All people who are not affected by bronchial carcinoma(iii)S: All smokers(iv) : All nonsmokers

To measure the relationship between smoking and bronchial carcinoma, since and are the two reasons, we obtain the following probabilities:(i) and .(ii) and .

It comes:(1)According to Definition 8 for the favorising case: as , so favorise .(2)So

The following values are obtained from the three measurements:

For .

For

For

3.3.1. Recapitulation

The result is

3.3.2. Interpretation

In relation to the results of these three measures, we allow to interpret as follows: <<If a person has bronchial carcinoma, then it is likely that he is a smoker.>>.

3.3.3. Comparative Study of the Three Measures according to the Results Obtained

(1) For the M GK and OR Measures. As the odds ratio measure does not take fixed values at the logical implication, we do not know from which value taken of this measure we can obtain a convincing or interesting value. On the other hand, the measure here has taken the value closer to 1. This easily allows us to say that there is a strong link on the .

(2) For the M GK and ORhnMeasures. In data mining, despite the volume of data to be explored which leads to obtaining several possible rules, this obliges us to be interested in discriminating measures to filter the rules well and to obtain the most interesting rules.

If, for example, we fix a minimum threshold of 80%, the rule is rejected by the measure MGK but validated by the measure ORhn . This proves that the MGK measure is more discriminating than the ORhn measure.

3.4. Study of Behaviour with Grissa’s 22 Properties

We continue our study of the behaviours of the MGK, OR, and ORhn measures using the 22 properties in the work of [25]. The 22 properties are as follows: P1: intelligibility or comprehensibility of the measure, P2: ease of setting a threshold for acceptance of the rule, P3: nonsymmetrical measure, P4: nonsymmetrical measure in the sense of the negation of the conclusion, P5: measure evaluating in the same way and in the case of logical implication, P6: increasing measure according to the number of examples, P8: increasing measure as a function of the size of the learning set, P8: decreasing measure according to the size of the consequent or the size of the premise, P9: measure admitting a fixed value in the case of independence, P10: measure admitting a fixed value in the case of logical implication, P11: measure admitting a fixed value in the case of equilibrium, P12: measure admitting identifiable values in the case of attraction between X and Y, P13: measure admitting identifiable values in case of repulsion between X and Y, P14: measure capable of tolerating the first counterexamples, P15: measure invariant in case of dilation of certain numbers, P16: measure capable of differentiating between the rules and according to an opposition relationship, P17: measure capable of differentiating between the rules and according to a relation of opposition, P18: measure evaluating in the same way the rules and , P19: measure having a size the random premise, P20: statistical measure, P21: discriminant measure, and P22: robust measure.

3.4.1. Evaluation of the Measures MGK, OR, and ORhn according to the Properties

Reference [25] stated in his work that among the 22 properties proposed, only 19 are studied. Like, the properties P1: intelligibility or comprehensibility of the measure, P2: easy setting a threshold for acceptance of the rule, and P22: robust measure have not been retained in this study. Indeed, we think that the first two are subjective and depend on the user’s knowledge of statistics and the third property is difficult to study since it requires very advanced calculation tools in order to avoid calculation errors.

For the behaviour of the three measures MGK, OR, and ORhn we study in the following Table 7, the number 1 means that the measure in question verifies the property concerned Pi such that . And the number 0 means that the measure in question does not validate the property concerned Pi. This work will lead to the construction of a matrix, which we present in Table 7. This study represents the result of the behaviour of the three measures MGK, OR, and ORhn according to their properties.

By studying the mathematical properties of these three measures, we have obtained the following theorems.

According to [31], we define a fixed value in the case of equilibrium as follows.

Definition 11. Let m be a measure of quality of association rules. A fixed value c is a reference point which is the equilibrium when a rule has as many examples as counterexamples.(i)If does not admit a fixed value in the case of equilibrium, that is, if , such that and , then .(ii)If admits a fixed value in the case of equilibrium, that is, if such that , then .By studying the property P11, we obtain the following theorem.

Theorem 1. All affine measures and homography normalisable do not admit of a constant value in the case of equilibrium.

Proof. According to the definition of normalised measures (2.4), according to definition (3.1), and the justification of this theorem in the demonstration of the property P11.
According to [32], we define a discriminating measure as follows.

Definition 12. A quality measure m is said to be discriminating if it is able to distinguish interesting rules as the size of the training set n increases.(i)If such that, then (nondiscriminant).(ii)If such that, then (discriminant).By studying the property P21, we obtain the following theorem.

Theorem 2. (i)A probabilistic quality measure is a discriminant measure if, and only if, for any association rule , the following condition is verified at the reference situation: .(ii)All the measures (affine or homography) which can be normalised and normalised are discriminating measures.

Proof. (i)Following the Definition 2 of a discriminant measure and the justification of this theorem in the demonstration of the property P21 of supplementary document, Theorem 2 is obvious.(ii)Following the definition of normalised measures 4 and the justification of this theorem in the demonstration of the property P21 .

3.5. Recapitulation

From the analysis carried out, we claim that the measures OR, OR, and ORhn are all good measures of the quality of association rules. Indeed, they almost all validate half of these properties. According to Table 7, the MGK measure validated 12 out of 19 properties; then, the odds ratio measure (OR) validated 11 out of 19 properties and the odds ratio normalised homography measure (ORhn) validated 09 out of 19 properties. As a consequence, the best measure among the three proposed ones is the MGK measure. Thus, we have well discovered that the tool of homography-normalisation does not improve the totality of the behaviour of the measures, but it simply improves value to the reference situation.

4. Conclusion and Perspective

This paper has summarised the different comparative studies of the MGK, OR, and ORhn measures. We have seen that the MGK measure gives really more precise and easier to interpret results; on the other hand, the odds ratio measure does not take fixed values at the logical implication, which caused the difficulty to define a minimum threshold: it was very difficult to interpret a rule, and the homography-normalised odds ratio measure (ORhn) is a normalised measure but is not more precise and relevant than the MGK measure. The comparative study of these properties allows us to discover the most relevant measure among the three. From the results we obtained from the comparative studies of the behaviour of the three measures MGK, OR, and ORhn in this paper, there is no doubt that the measure MGK is the most relevant among the three proposed measures. These results answered well the uncertainties of thinking that the odds ratio measure had the best mathematical properties compared to the other measures. In epidemiology, in order to have more reliable, precise, and easier-to-interpret results on the study of the quantification of statistical implication link between an exposure and the disease, we advise analysts and epidemiologists to choose the MGK measure in their work. As a prospect, we propose to extend our comparative study of these measures into the statistical field, more specifically into the study of logistic regression, the study of statistical inference, and so on.

Data Availability

No data were used to support the results of this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors are very grateful to the leaders of the Doctoral School “Science, Culture, Society and Development” of the University of Toamasina who welcomed them in its team: Mathematics, Informatics and Applications (MIA) to carry out their research concerning this article.

Supplementary Materials

A supplementary file named “Supplementary material” presents brief reminders and demonstration that check the result of the matrix describing the three measures according to the 19 properties in Table 7. (Supplementary Materials)