Negative and Positive Association Rules Mining from Text Using Frequent and Infrequent Itemsets

Mahmood, Sajid; Shahbaz, Muhammad; Guergachi, Aziz

doi:https://doi.org/10.1155/2014/973750

The Scientific World Journal

On this page

Abstract Introduction Background Literature Review Experimental Results References Copyright Related Articles

Research Article | Open Access

Volume 2014 | Article ID 973750 | https://doi.org/10.1155/2014/973750

Negative and Positive Association Rules Mining from Text Using Frequent and Infrequent Itemsets

Sajid Mahmood,^1,2Muhammad Shahbaz,¹and Aziz Guergachi³

Academic Editor: Daniel D. Sánchez

Received17 Feb 2014

Revised27 Mar 2014

Accepted01 Apr 2014

Published18 May 2014

Abstract

Association rule mining research typically focuses on positive association rules (PARs), generated from frequently occurring itemsets. However, in recent years, there has been a significant research focused on finding interesting infrequent itemsets leading to the discovery of negative association rules (NARs). The discovery of infrequent itemsets is far more difficult than their counterparts, that is, frequent itemsets. These problems include infrequent itemsets discovery and generation of accurate NARs, and their huge number as compared with positive association rules. In medical science, for example, one is interested in factors which can either adjudicate the presence of a disease or write-off of its possibility. The vivid positive symptoms are often obvious; however, negative symptoms are subtler and more difficult to recognize and diagnose. In this paper, we propose an algorithm for discovering positive and negative association rules among frequent and infrequent itemsets. We identify associations among medications, symptoms, and laboratory results using state-of-the-art data mining technology.

1. Introduction

Association rules (ARs), a branch of data mining, have been studied successfully and extensively in many application domains including market basket analysis, intrusion detection, diagnosis decisions support, and telecommunications. However, the discovery of associations in an efficient way has been a major focus of the data mining research community [1–6].

Traditionally, the association rule mining algorithms target the extraction of frequent features (itemsets), that is, features boasting high frequency in a transactional database. However, many important itemsets, with low support (i.e., infrequent), are ignored by these algorithms. These infrequent itemsets, despite their low support, can produce potentially important negative association rules (NARs) with high confidences, which are not observable among frequent data items. Therefore, discovery of potential negative association rules is important to build a reliable decision support system. The research in this paper extends discovery of positive as well as negative association rules of the forms , , and so forth.

The number of people discussing their health in blogs and other online discussion forums is growing rapidly [7, 8]. Patient-authored blogs have now become an important component of modern-day healthcare. These blogs can be effectively used for decision support and quality assurance. Patient-authored blogs, where patients give an account of their personal experiences, offer near accurate and complete problem lists with symptoms and ongoing treatments [9]. In this paper, we have investigated the efficient mechanism of identifying positive and negative associations among medications, symptoms, and laboratory results using state-of-the-art data mining technology. Rules of the form or can help explain the presence or absence of different factors/variables. Such types of associations can be useful for building decision support systems in the healthcare sector.

We target 3 major problems in association rule mining: (a) effectively extracting positive and negative association rules from text datasets, (b) extracting negative association rules from the frequent itemsets, and (c) the extraction of positive association rules from infrequent itemsets.

The rest of this paper is organized as follows. In the next section, we present brief introduction to data mining terminology and background. Section 3 reviews related work on association rule mining. In Section 4, we describe the methodology for identifying both frequent and infrequent itemsets of interest and generation of association rules based on these itemsets. The proposed model for extracting positive and negative association rules is presented in Section 5 and the final section of this paper gives the details of the experimental results, comparisons, and conclusions. Section 6 presents the experimental results and the conclusion and future directions are given in Section 7.

2. Terminology and Background

Let us consider as a set of distinct literals/terms called items and let be a database of transactions (documents/blogs, etc.), where each transaction is a set of items/terms such that is a subset of “.” Each transaction is associated with a unique identifier, called . Let , be sets of items; an association rule is a derivation of the form , where , , and . “” is called the antecedent of the rule, and “” is called the consequent of the rule. An association rule can have different measures denoting its significance and quality. In our approach, we have employed (i) support, by denoting it by supp which is the percentage of transactions in database containing both and , (ii) confidence, by denoting it by conf which is representing the percentage of transactions in containing that also contain which can be denoted in probability terms by , and (iii) lift, by denoting it by lift characterizing the direction of relationship between the antecedent and consequent of the association rule. Rules having the support value greater than user defined minimum support minsupp, in which the itemset needs to be present in minimum threshold number of transactions, and confidence greater than user defined minimum confidence minconf are called valid association rules. The lift symbolizes the association whether positive or negative. A value of lift greater than 1 indicates a positive relationship between the itemsets; value of lift less than 1 indicates a negative relationship; and where the value of lift equals 1, the itemsets are independent and there exists no relationship between the itemsets.

Some of the above and derived definitions can be represented with the following mathematical equations: The huge number of infrequent items generates an even higher number of negative rules in comparison to the positive association rules. The problem is overwhelmed when dealing with text where words/terms are items and the documents are transactions. It is also difficult to set a threshold for the minimum support as a measure for text because of the huge number of unique and sporadic items (words) in a textual dataset. Indexing (assigning weights to the terms) of the text documents is very important for them to be used as transactions for extracting association rules. Indexing techniques from the information retrieval field [10] can greatly benefit in this regard. Index terms, the words whose semantics help in identifying the document's main subject matter [11], help describe a document. They possess different relevance to a given document in the collection and, therefore, the assigned different numerical weights. Text mining aims to retrieve information from unstructured text to present the extracted knowledge in a compact form to the users [12]. The primary goal is to provide the users with knowledge for the research and educational purposes.

It is, therefore, imperative to employ some weight assignment mechanism. We have used inverse document frequency (IDF), which denotes the importance of a term in a corpus. Selection of features on the basis of IDF values needs a careful reading, that is, which range of IDF value features is included. This can greatly affect the results as choosing a very low value of “” is feared to nullify the impact of IDF, and choosing a very high value may result in losing important terms/features of the dataset. We have proposed a user tuned parameter, that is, top %, where the value of “” is chosen by the user, depending upon the characteristics of the data.

2.1. IDF

IDF weighting scheme is based on the intuition of term occurrence in a corpus. It surmises that the fewer the documents containing a term are, the more discerning they become and hence an important artefact. The IDF helps in understanding the terms that carry special significance in a certain type of document corpus. It assigns high weight to terms that occur rarely in a document corpus. In a medical corpus, for instance, the word “disease” is not likely to carry a significant meaning. Instead a disease name, for example, “cholera” or “cancer,” would carry significant meanings in characterizing the document.

Keeping this in view, a higher weight should be assigned to the words that appear in documents in close connection with a certain topic, while a lower weight should be assigned to those words that show up without any contextual background. IDF weighting is a broadly used method for text analysis [10, 13]. We can mathematically represent IDF as , where is the term whose weight is to be calculated, represents the documents in which the term is present, and symbolizes the document corpus.

Table 1 shows the example set of words and their respective IDF scores in a document corpus with a threshold score value of 60%.

The number of words in the documents before IDF score calculation and top 60% selection is 280254 and, after selecting the top 60% based on IDF scores, goes down to 81733. The bold part of the table shows a sample of the eradicated words which have IDF scores below threshold value.

3. Literature Review

The association rule mining is aimed at the discovery of associations among itemsets of a transactional database. Researchers have extensively studied association rules since their introduction by [14]. Apriori is the most well-known association rule mining algorithm. The algorithm has a two-step process: frequent itemset generation (doing multiple scans of the database) and association rule generation. The major advantage of Apriori over other association rule mining algorithms is its simple and easy implementation. However, multiple scans over the database to find rules make Apriori algorithm's convergence slower for large databases.

The other popular association rule mining algorithm, frequent pattern-growth (FP-Growth), proposed by Han et al. [3], compresses the data into an FP-Tree for identifying frequent itemsets. FP algorithm makes fewer scans of the database, making it practically usable for large databases like text.

However, there has been very little research done for finding negative association rules among infrequent itemsets. The association rule mining algorithms are seldom designed for mining negative association rules. Most of the existing algorithms can rarely be applied in their current capacity in the context of negative association rule mining. The recent past, however, has witnessed a shift in the focus of the association rule mining community, which is now focusing more on negative association rules extraction [1, 15–19]. Delgado et al. have proposed a framework for fuzzy rules that extends the interesting measures for their validation from the crisp to the fuzzy case [20]. A fuzzy approach for mining association rules using the crisp methodology that involves the absent items is presented in [21].

Savasere et al. [22] presented an algorithm for extracting strong negative association rules. They combine frequent itemsets and the domain knowledge, to form taxonomy, for mining negative association rules. Their approach, however, requires users to set a predefined domain dependent hierarchical classification structure, which makes it difficult and hard to generalize. Morzy presented the DI-Apriori algorithm for extracting dissociation rules among negatively associated itemsets. The algorithm keeps the number of generated patterns low; however, the proposed method does not capture all types of generalized association rules [23].

Dong et al. [16] introduced an interest value for generating both positive and negative candidate items. The algorithm is Apriori-like for mining positive and negative association rules. Antonie and Zaïane [15] presented a coefficient based algorithm for mining positive and negative association rules; the coefficients need to be continuously updated while running the algorithm; also, the generation of all negative association rules is not guaranteed.

Wu et al. proposed negative and positive rule mining framework, based on the Apriori algorithm. Their work does not focus on itemset dependency; rather, it focuses on rule dependency measures. More specifically, they have employed the Piatetsky-Shapiro argument [24] about the association rules; that is, a rule is interesting if and only if . This can be further explained using the following.

A rule is as follows: is valid if and only if(1) indicates a positive relationship between and ,(2)the other condition for a rule to be valid implies that .

The second condition can greatly increase the efficiency of the rule mining process because all the antecedent or consequent parts of ARs can be ignored where . This is needed because the itemset is infrequent so Apriori does not guarantee the subsets to be frequent.

Association rule mining research mostly concentrates on positive association rules. The negative association rule mining methods reported in literature generally target market basket data or other numeric or structured data. Complexity of generating negative association rules from text data thus becomes twofold, that is, dealing with the text and generating negative association rules as well as positive rules. As we demonstrate in the later sections, negative association rule mining from text is different from discovering association rules in numeric databases, and identifying negative associations raises new problems such as dealing with frequent itemsets of interest and the number of involved infrequent itemsets. This necessitates the exploration of specific and efficient mining models for discovering positive and negative association rules from text databases.

4. Extraction of Association Rules

We present an algorithm for mining both positive and negative association rules from frequent and infrequent itemsets. The algorithm discovers a complete set of positive and negative association rules simultaneously. There are few algorithms for mining association rules for both frequent and infrequent itemsets from textual datasets.

We can divide association rule mining into the following:(1)finding the interesting frequent and infrequent itemsets in the database ,(2)finding positive and negative association rules from the frequent and infrequent itemsets, which we get in the first step.

The mining of association rules appears to be the core issue; however, generation and selection of interesting frequent and infrequent itemsets is equally important. We discuss the details of both in the following discussion.

4.1. Identifying Frequent and Infrequent Itemsets

As mentioned above, the number of extracted items (both frequent and infrequent) from the text datasets can be very large, with only a fraction of them being important enough for generating the interesting association rules. Selection of the useful itemsets, therefore, is challenging. The support of an item is a relative measure, with respect to the database/corpus size. Let us suppose that the support of an itemset is 0.4 in 100 transactions; that is, 40% of transactions contain the itemset. Now, if 100 more transactions are added to the dataset, and only 10% of the added 100 transactions contain the itemset , the overall support of the itemset will be 0.25; that is, 25% of the transactions now contain itemset . Therefore, the support of an itemset is a relative measure. Hence, we cannot rely only on support measure for the selection of important/frequent itemsets.

The handling of large number of itemsets, which is the case when dealing with textual datasets, is more evident when dealing with infrequent itemsets. This is because the number of infrequent itemsets rises exponentially [1]. Therefore, we only select those terms/items from the collection of documents, which have importance in the corpus. This is done using the IDF weight assigning method. We filter out words either not occurring frequently enough or having a near constant distribution among the different documents. We use top-% age (for a user specified , we used 60%) as the final set of keywords to be used in the text mining phase [25]. The keywords are sorted in descending order by the algorithm based on their IDF scores.

4.2. Identifying Valid Positive and Negative Association Rules

By a valid association rule, we mean any expression of the form , where , , , and , s.t. Consider(i),(ii),(iii),(iv).Let us consider an example, from our medical “cancer” blogs’ text dataset. We analyze people’s behavior about the cancer and mole. We have(i), ,(ii), ,(iii),(iv),(v).From above we can see the following:(i); therefore, is an infrequent itemset (inFIS);(ii); from (3),(a)therefore, cannot be generated as a valid rule using the support-confidence framework.Now, we try to generate a negative rule from this example:(i):(a), from (6);(ii), from (7):(a); therefore, we can generate ( as a negative rule;(iii):(a), which is much greater than 1 showing a strong relation between the presence of mole and absence of cancer; therefore, we can generate ( as a valid negative rule with 87.5% confidence and strong association relationship between the presence and absence of mole and cancer, respectively.

The above example clearly shows the importance of infrequent itemsets and the generated negative association rules and their capability to track the important implications/associations, which would have been missed when mining only positive association rules.

4.3. Proposed Algorithm: Apriori FISinFIS (See Algorithm 1)

Input: TD: Transaction (Document) Database; ms: minimum support threshold;
Output: FIS: frequent itemsets; inFIS: infrequent itemsets;
(1) initialize FIS = Φ; inFIS = Φ;
(2) temp₁ = top(N) IDF terms; /* get all 1-itemsets being in top-N IDF items */
(2.1) = temp₁ and support () ≥ (ms)}; /* get frequent 1-itemsets */
(2.2) = temp₁ − FIS₁; /* all infrequent 1-itemsets */
; /* initialize for itemsets greater than 1 */
(3) while (≠ Φ) do begin
(3.1) = generate (, ms); /* candidate -itemsets */
(3.2) for each transaction
do begin /* scan database TD*/
= subset(); /* get temp candidates in transaction */
for each candidate
c.count⁺⁺; /* increase count of itemsets if it exists in transaction */
end;
(3.3) c.support = ; /* calculate support of candidate -itemset */
(3.4) and ; /* add to temp -itemsets */
(4) = and . support ms)}; /* add to frequent -itemsets if supp() greater than minsupp*/
(5) = ; /* add to in-frequent -itemsets having support less than minsupp*/
(6) FIS = ; /* add generated frequent -itemsets to FIS */
(7) inFIS = ; /* add generated in-frequent -itemsets to inFIS */
(8) k ⁺⁺; /* increment itemset size by 1 */
end;
(9) return FIS and inFIS;

The Apriori FISinFIS procedure generates all frequent and infrequent itemsets of interest in a given database , where FIS is the set of all frequent itemsets of interest in and inFIS is the set of all infrequent itemsets in . FIS and inFIS contain only frequent and infrequent itemsets of interest, respectively.

The initialization is done in Step . Step generates temp₁, all itemsets of size 1; in step , we generate FIS₁, all frequent itemsets of size 1, while, in step , all infrequent itemsets of size 1 in database are generated in inFIS₁ in the first pass of .

Step generates and for by a loop, where is the set of all frequent -itemsets, which have greater support than user defined minimum threshold, in the pass of ; is the set of all infrequent -itemsets, which have less support than user defined minimum threshold. The loop terminates when all the temporary itemsets have been tried; that is, . For each pass of the database in Step , say pass , there are five substeps as follows.

Step generates candidate itemsets of all -itemsets in , where each -itemset in is generated by two frequent itemsets in . The itemsets in are counted in using a loop in Step . Step calculates support of each itemset in and step stores the generated itemsets in a temporary data structure. We have used an implementation of “HashMap” in our experimentation as a temporary data structure.

Then and are generated in Steps and , respectively. is the set of all potentially useful frequent -itemsets in , which have greater support value than the minsupp. is the set of all infrequent -itemsets in , which have less support values than the minsupp. The and are added to the FIS and inFIS in Steps and . Step increments the itemset size. The procedure ends in Step which outputs frequent and infrequent itemsets in FIS and inFIS, respectively.

4.4. Algorithm: FISinFIS Based Positive and Negative Association Rule Mining

Algorithm 2 generates positive and negative association rules from both the frequent itemsets (FIS) and infrequent itemsets (inFIS). Step initializes the positive and negative association rule sets as empty. Step generates association rules from FIS; in step , positive association rules of the form or , which have greater confidence than the user defined threshold and lift greater than 1, are extracted as valid positive association rules. Step generates negative association rules of the form , , and so forth, which have greater confidence than the user defined threshold and lift greater than 1, are extracted as valid negative association rules (Figure 2).

Inputs: min_sup: minimum support; min_conf: minimum confidence; FIS (frequent itemsets); inFIS (infrequent itemsets)
Output: PAR: set of +veARs; NAR: set of −veARs;
(1) ;
(2) /* generating all association rules from FIS (frequent itemsets). */
For each itemset I in FIS
do begin
for each itemset
do begin
(2.1) /* generate rules of the form . */
If
then output the rule
else
(2.2) /* generate rules of the form and . */
if
output the rule
if
output the rule
if
output the rule
end for;
end for;
(3) /* generating all association rules from inFIS. */
For any itemset I in inFIS
do begin
For each itemset , supp(A) ≥ minsupp and supp(B) ≥ minsupp
do begin
(3.1) /* generate rules of the form . */
If conf( lift(
then output the rule
else
(3.2) /* generate rules of the form , and . */
if
output the rule
if
output the rule
if
output the rule
end for;
end for;
(4) return PAR and NAR;

Step generates association rules from inFIS; in step , positive association rule of the form or , which has greater confidence than the user defined threshold and lift greater than 1, is extracted as a valid positive association rule. Step generates negative association rule of the form , or , which has greater confidence than the user defined threshold and lift greater than 1, and is extracted as a valid negative association rule.

5. Discovering Association Rules among Frequent and Infrequent Items

Mining positive association rules from frequent itemsets is relatively a trivial issue and has been extensively studied in the literature. Mining negative association rules of the form , or , and so forth from textual datasets, however, is a difficult task, where is a nonfrequent itemset. Database (corpus) has an exponential score of nonfrequent itemsets; therefore, negative association rule mining stipulates the examination of much more search space than positive association rules.

We, in this work, propose that, for the itemsets occurring frequently, given the user defined min-sup, their subitems can be negatively correlated leading to the discovery of negative association rules. Similarly, the infrequent itemsets may have their subitems with a strong positive correlation leading to the discovery of positive association rules.

Let be the set of items in database , such that , . Thresholds minsup and minconf are given by the user.

5.1. Generating Association Rules among Frequent Itemsets

See Algorithm 3.

Given: supp(
If , and

then is a valid positive rule, having required minimum confidence and there is a positive correlation between
rule items A and B.
Else if , and

then is not a valid positive rule having lower than required minimum confidence and there is a negative
correlation between rule, items A and B. Therefore, we try to generate a negative association rule from itemset .
If , and

then is a valid negative rule, having required minimum confidence and there is a positive correlation between
rule items A and ¬B.

5.2. Generating Association Rules among Infrequent Itemsets

For brevity, we only consider one form of association from both positive and negative; that is, and ; the other forms can similarly be extracted.

In the description of association rules as shown in Algorithm 4, guarantees that the association rule describes the relationship among items of a frequent itemset, whereas guarantees that the association rule describes the relationship among items of an infrequent itemset; however, the subitems of the itemset need to be frequent as enforced by the conditions and . The interestingness measure, lift, has to be greater than 1, articulating a positive dependency among the itemsets; the value of lift less than 1 will articulate a negative relationship among the itemsets.

Given: supp(, and (
supp(, and
supp(,
If conf(, and
lift(
then is a valid positive rule, having required minimum confidence and there
is a positive correlation between rule items A and B.
Else If
conf(, and
lift(
then is a valid negative rule, having required minimum confidence and there
is a positive correlation between rule items A and ¬B.

The algorithm generates a complete set of positive and negative association rules from both frequent and infrequent itemsets. The frequent itemsets have traditionally been used to generate positive association rules; however, we argue that items in frequent itemsets can be negatively correlated. This can be illustrated using the following example: , ; , , ; , ; , ; , ; .

The above example clearly shows that itemsets, despite being frequent, can have negative relationships among their item subsets. Therefore, in Step (2.2) of Algorithm 2, we try to generate negative association rules using frequent itemsets.

The infrequent itemsets, on the other hand, have either been totally ignored while generating associations or mostly used to generate only negative association rules. However, infrequent itemsets have potentially valid and important positive association rules among them having high confidence and strongly positive correlation. This can be illustrated using the following example: , ; , , ; , ; , .

We can visualize from the example that (), in spite of being an infrequent itemset, has a very strong positive association having 100% confidence and a positive correlation. Our proposed algorithm covers the generation of such rules as explained in Step of our proposed algorithm.

6. Experimental Results and Discussions

We performed our experiments on medical blogs datasets, mostly authored by patients writing about their problems and experiences. The datasets have been collected from different medical blog sites:(i)Cancer Survivors Network [http://csn.cancer.org/],(ii)Care Pages [http://www.carepages.com/forums/cancer].

The blogs text was preprocessed before the experimentation, that is, stop words removal, stemming/lemmatization, nonmedical words removal, and so forth. We assigned weights to terms/items after preprocessing using the IDF scheme, for selecting only the important and relevant terms/items in the dataset. The main parameters of the databases are as follows:(i)the total number of blogs (i.e., transactions) used in this experimentation was 1926;(ii)the average number of words (attributes) per blog (transaction) was 145;(iii)the smallest blog contained 79 words;(iv)the largest blog contained 376 words;(v)the total number of words (i.e., attributes) was 280254 without stop words removal;(vi)the total number of words (i.e., attributes) was 192738 after stop words removal;(vii)the total number of words selected using top-% age of IDF words was 81733;(viii)algorithm is implemented in java.

Table 2 summarizes the number of itemsets generated with varying minsup values. We can see that the number of frequent itemsets decreases as we increase the minsup value. However, a sharp increase in the number of infrequent itemsets can be observed. This can also be visualized in Figure 1.

Table 3 gives an account of the experimental results for different values of minimum support and minimum confidence. The lift value has to be greater than 1 for a positive relationship between the itemsets; the resulting rule, however, may itself be positive or negative. The total number of positive rules and negative rules generated from both frequent and infrequent itemsets is given.

Although the experimental results greatly depend on the datasets used, they still flaunt the importance of IDF factor in selecting the frequent itemsets, along with the generation of negative rules from frequent itemsets and the extraction of positive rules from infrequent itemsets. The number of negative rules generated greatly outnumbers the positive rules not only because of the much more infrequent itemsets as compared to frequent itemsets but also because of finding the negative correlation between the frequent itemsets, using proposed approach, leading to the generation of negative association rules.

The frequent and infrequent itemset generation using Apriori algorithm takes only a little extra time as compared to the traditional frequent itemset finding using Apriori algorithm. This is because each item’s support is calculated for checking against the threshold support value to be classified as frequent and infrequent; therefore, we get the infrequent items in the same pass as we get frequent items. However, the processing of frequent and infrequent itemsets for the generation of association rules is different. For frequent items generated through Apriori algorithm, they have an inherent property that their subsets are also frequent; however, we cannot guarantee that for the infrequent itemsets. Thus, we impose an additional check on the infrequent itemsets that their subsets are frequent when generating association rules among them.

The researches on mining association rules among the frequent and infrequent itemsets have been far and few, especially from the textual datasets. We have proposed this algorithm which can extract both types of association rules, that is, positive and negative, among both frequent and infrequent itemsets. We give a sample of all four types of association rules extracted using the algorithm.

Table 4 gives a summary of the generated association rules. The four types of generated association rules are illustrated. Table 4(a) shows a sample of positive association rules generated from the frequent itemsets. Table 4(b) shows negative association rules generated from the frequent itemsets. This has not yet been explored by the research community. Sample of positive association rules generated from the infrequent itemsets are demonstrated in Table 4(c). This type of association rules would potentially be useful and researchers are interested to extract them. There is no research done in this domain of extracting positive association rules from infrequent itemsets in the textual data before this research. Table 4(d) shows the results of negative association rules from the infrequent itemsets.

7. Concluding Remarks and Future Work

Identification of associations among symptoms and diseases is important in diagnosis. The field of negative association rules (NARs) mining holds enormous potential to help medical practitioners in this regard. Both the positive and negative association rule mining (PNARM) can hugely benefit the medical domain. Positive and negative associations among diseases, symptoms, and laboratory test results can help a medical practitioner reach a conclusion about the presence or absence of a possible disease. There is a need to minimize the errors in diagnosis and maximize the possibility of early disease identification by developing a decision support system that takes advantage of the NARs. Positive association rules such as can tell us that Headache is experienced by a person who is suffering from Flu. On the contrary, negative association rules such as tell us that if Headache experienced by a person is not Throbbing, then he may not have Migraine with a certain degree of confidence. The applications of this work include the development of medical decision support system, among others, by finding associations and dissociations among diseases, symptoms, and other health-related terminologies. The current algorithm does not account for the context and semantics of the terms/items in the textual data. In future, we plan to assimilate the context of the features in our work, in order to perk up the quality and efficacy of the generated association rules.

In this paper, contributions to the NARM research were made by proposing an algorithm for efficiently generating negative association rules, along with the positive association rules. We have proposed a novel method that captures the negative associations among frequent itemsets and also extracts the positive associations among the infrequent itemsets. Whereas, traditional association rule mining algorithms have focused on frequent items for generating positive association rules and have used the infrequent items for the generation of negative association rules. The experimental results have demonstrated that the proposed approach is effective, efficient and promising.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

X. Wu, C. Zhang, and S. Zhang, “Efficient mining of both positive and negative association rules,” ACM Transactions on Information Systems, vol. 22, no. 3, pp. 381–405, 2004.
View at: Publisher Site | Google Scholar
R. J. Bayardo Jr., “Efficiently mining long patterns from databases,” ACM SIGMOD Record, vol. 27, no. 2, pp. 85–93, 1998.
View at: Google Scholar
J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” ACM SIGMOD Record, vol. 29, no. 2, pp. 1–12, 2000.
View at: Google Scholar
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequent closed itemsets for association rules,” in Database Theory—ICDT ’99, pp. 398–416, Springer, 1999.
View at: Google Scholar
H. Toivonen, “Sampling large databases for association rules,” in Proceedings of the International Conference on Very Large Data Bases (VLDB '96), pp. 134–145, 1996.
View at: Google Scholar
J. Wang, J. Han, and J. Pei, “CLOSET+: searching for the best strategies for mining frequent closed itemsets,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03), pp. 236–245, usa, August 2003.
View at: Publisher Site | Google Scholar
R. J. W. Cline and K. M. Haynes, “Consumer health information seeking on the internet: the state of the art,” Health Education Research, vol. 16, no. 6, pp. 671–692, 2001.
View at: Google Scholar
J. M. Morahan-Martin, “How internet users find, evaluate, and use online health information: a cross-cultural review,” Cyberpsychology and Behavior, vol. 7, no. 5, pp. 497–510, 2004.
View at: Publisher Site | Google Scholar
S. Mahmood, M. Shahbaz, and Z. Rehman, “Extraction of positive and negative association rules from text: a temporal approach,” Pakistan Journal of Science, vol. 65, pp. 407–413, 2013.
View at: Google Scholar
M. Rajman and R. Besancon, “Text mining: natural language techniques and text mining applications,” in Proceedings of the 7th IFIP 2. 6 Working Conference on Database Semantics (DS '97), pp. 50–64, 1997.
View at: Google Scholar
B. Lent, R. Agrawal, and R. Srikant, “Discovering trends in text databases,” in Proceedings of the ACM SIGMOD International Conference on Knowledge Discovery and Data Mining (KDD '93), pp. 227–230, 1997.
View at: Google Scholar
S. Ananiadou and J. McNaught, Text Mining for Biology and Biomedicine, Artech House, London, UK, 2006.
J. Paralic and P. Bednar, “Text mining for document annotation and ontology support,” in Intelligent Systems at the Service of Mankind, pp. 237–248, Ubooks, 2003.
View at: Google Scholar
R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules between sets of items in large databases,” ACM SIGMOD Record, vol. 22, no. 1, pp. 207–216, 1993.
View at: Google Scholar
M. L. Antonie and O. Zaïane, “Mining positive and negative association rules: an approach for confined rules,” in Knowledge Discovery in Databases: PKDD 2004, vol. 3202, pp. 27–38, Springer, 2004.
View at: Google Scholar
W. U. X. Dong, Z. C. Qi, and S. C. Zhang, “Mining both positive and negative association rules,” in Proceedings of the 19th International Conference on Machine Learning (ICML '2002), pp. 658–665, 2002.
View at: Google Scholar
X. Dong, F. Sun, X. Han, and R. Hou, “Study of positive and negative association rules based on multi-confidence and chi-squared test,” in Advanced Data Mining and Applications, pp. 100–109, Springer, 2006.
View at: Google Scholar
M. Gan, M. Zhang, and S. Wang, “One extended form for negative association rules and the corresponding mining algorithm,” in Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC '05), pp. 1716–1721, August 2005.
View at: Google Scholar
L. Sharma, O. Vyas, U. Tiwary, and R. Vyas, “A novel approach of multilevel positive and negative association rule mining for spatial databases,” in Machine Learning and Data Mining in Pattern Recognition, pp. 620–629, Springer, 2005.
View at: Google Scholar
M. Delgado, M. D. Ruiz, D. Sánchez, and J. M. Serrano, “A formal model for mining fuzzy rules using the RL representation theory,” Information Sciences, vol. 181, no. 23, pp. 5194–5213, 2011.
View at: Publisher Site | Google Scholar
M. Delgado, M. D. Ruiz, D. Sanchez, and J. M. Serrano, “A fuzzy rule mining approach involving absent items,” in Proceedings of the 7th Conference of the European Society for Fuzzy Logic and Technology, pp. 275–282, Atlantis Press, 2011.
View at: Google Scholar
A. Savasere, E. Omiecinski, and S. Navathe, “Mining for strong negative rules for statistically dependent items,” in Proceedings of the IEEE International Conference on Data Mining (ICDM '02), pp. 442–449, December 2002.
View at: Google Scholar
M. Morzy, “Efficient mining of dissociation rules,” in Data Warehousing and Knowledge Discovery, pp. 228–237, Springer, 2006.
View at: Google Scholar
G. Piatetsky-Shapiro, “Discovery, analysis, and presentation of strong rules,” in Knowledge Discovery in Databases, pp. 229–238, AAAI Press, 1991.
View at: Google Scholar
A. W. Fu, R. W. Kwong, and J. Tang, “Mining n-most interesting itemsets,” in Foundations of Intelligent Systems, Lecture Notes in Computer Science, pp. 59–67, Springer, 2000.
View at: Google Scholar

Copyright

Copyright © 2014 Sajid Mahmood et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

16416

Downloads

2071

Citations