Abstract

In associative classification method, the rules generated from association rule mining are converted into classification rules. The concept of association rule mining can be extended in web mining environment to find associations between web pages visited together by the internet users in their browsing sessions. The weighted fuzzy association rule mining techniques are capable of finding natural associations between items by considering the significance of their presence in a transaction. The significance of an item in a transaction is usually referred as the weight of an item in the transaction and finding associations between such weighted items is called fuzzy weighted association rule mining. In this paper, we are presenting a novel web classification algorithm using the principles of fuzzy association rule mining to classify the web pages into different web categories, depending on the manner in which they appear in user sessions. The results are finally represented in the form of classification rules and these rules are compared with the result generated using famous Boolean Apriori association rule mining algorithm.

1. Introduction

Classification is a Data Mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants in a bank as low, medium, or high credit risks. A classification task begins with a data set in which the class assignments are known. A classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case. Classification techniques include decision trees, association rules, fuzzy systems, and neural networks. Classification has many applications in customer segmentation, business modeling, marketing, credit analysis, web mining and biomedical, and drug response modeling.

Classification models include decision trees, Bayesian models, association rules, and neural nets. Although association rules have been predominantly used for data exploration and description, the interest in using them for prediction has rapidly increased in the Data Mining community. When classification models are constructed from rules, often they are represented as a decision list (a list of rules where the order of rules corresponds to the significance of the rules). Classification rules are of the form , where is a pattern in the training data and is a predefined class label (target) [1]. Association rule based classification is introduced by Liu et al. [2]. Association rule mining algorithm like Apriori can be used for generating rules and a second algorithm is used for building the classifier. The rules generated by association rules are called classification association rules (CARs), as they have a predefined class label or target. From the generated CARs, a subset is selected based on the heuristic criterion that the subset of rules can classify the training set accurately.

Servers register a Web log entry for every single access they get, in which important pieces of information about accessing are recorded, including the URL requested, the IP address from which the request originated, and a time-stamp. Applying Data Mining techniques on this web log data can reveal many interesting knowledge about the web users [3]. These web log data shows information accessed by the users and give their surfing pattern. When Data Mining techniques are implemented on these logs to extract hidden patterns between the URLs requested by the users [4], it is commonly known as Web Usage Mining. In recent years there has been an increasing interest and a growing body of work in Web usage mining [5] as an underlying approach in capturing and modeling Web user behavioral patterns and for deriving e-business intelligence. Web usage mining techniques rely on offline pattern discovery from user transactions. These techniques can be used to improve Web personalization based on historic browsing patters. Association rule mining can bring out precise information about user’s navigational behavior. When we apply the association rule mining techniques with web log file, the result will be of the form where and are URLs [6]. It means if a user accesses URL then he would be accessing URL most likely. The user’s navigational pattern information can be used in predictive prefetching of pages and web personalization. Development of such recommendation systems has become an active research area. Some recent studies have considered the use of association rule mining [7] in recommender systems [8, 9].

In this work, the association rule mining techniques are used for web classification based on the navigational patterns. A novel web classification algorithm is presented here, which is developed on the foundations of fuzzy association rule mining techniques. The concepts of weighted fuzzy transactions and fuzzy support and confidence framework are used to derive this algorithm. This associative classification algorithm finds longest possible access sequence patterns which lead to a web category. Here, each web category is considered as a class label. These identified classification rules can be later used for web personalization and predictive prefetching. The Boolean Apriori algorithm also used in the same framework to find access sequences which lead to a particular web category as the consequent. The results are compared and it is found that the new algorithm identifies more natural patterns.

In Data Mining area, general classification algorithms were designed to deal with transaction-like data. Such data has a different format from the sequential data, where the concept of an attribute has to be carefully considered. The association-rule representation is an extensively studied topic in Data Mining. Association rules were proposed to capture the co-occurrence of buying different items in a supermarket shopping. It is natural to use association rule generation to relate pages that are most often referenced together in a single server session [6]. In the association rule mining literature, weights of items are treated as insignificant until recently and a common weight of one (1) is assigned as a common practice. Some of the very recent approaches generalize this and give item weights to reflect their significance to the user. In weighted association rule mining, the weights may be as a result of particular promotions for the items or their profitability, and so forth, [10]. Fuzzy weighted support, confidence, and transactions are also defined in a fuzzy association rule framework [11, 12]. The concepts and methods used in weighted association rule mining can be extended to web mining [13].

Muyeba et al. [12] presented a novel approach for effectively mining weighted fuzzy association rules [14]. The authors address the issue of weight of each item according to its significance with respect to some user defined criteria. Most works on weighted association rule mining do not address the downward closure property while some make assumptions to validate the property. This paper generalizes the weighted association rule mining problem with binary and fuzzy attributes with weighted settings. This methodology follows an Apriori approach but employs T-tree data structure to improve efficiency of counting item sets. The authors’ approach avoids preprocessing and postprocessing as opposed to most weighted association rule mining algorithms, thus eliminating the extra steps during rules generation. The paper also presents experimental results on both synthetic and real-data sets and a discussion on evaluating the proposed approach.

In Boolean Apriori algorithm, all the products are treated uniformly, and all the rules are mined based on the occurrences of the products. However, in the social science research, the analysts may want to mine the rules based on the importance of the products, items or attributes. For example, total income attribute is more interesting than the height of a person in a household. Based on this generalized idea [15, 16], the items are given weights to reflect the importance to the users. The downward closure property of the support measure in the mining of association rules no longer exists in this approach. Here, they make use of a metric, called support bounds, in the mining of weighted fuzzy association rules. Furthermore, the authors introduce a simple sample method and the data maintenance method, based on the statistical approach, to mine the rules.

Mobasher et al. [4] proposed an effective and scalable technique for Web personalization based on association rule discovery from usage data. Here, the association rules are used for the development of a recommender system. In this work they proposed a scalable framework for recommender systems using association rule mining from click stream data. The recommendation algorithm utilizes a special data structure to produce recommendations efficiently in real-time, without the need to generate all association rules from frequent item sets. This method can overcome some of the limitations of low coverage resulting from high support thresholds or larger user histories and reduced accuracy due to the sparse nature of the data.

Suneetha and Krishnamoorti [17] suggested an improved version of Apriori algorithm to extracts interesting correlations, frequent patterns, and associations among web pages visited by users in their browsing sessions. In order to reduce repetitive disk read, a novel method of top down approach is proposed in this paper. The improved version of Apriori algorithm greatly reduces the data base scans and avoids generation of unnecessary patterns which reduces data base scan, time and space consumption. Kumar and Rukmani [18] used Apriori algorithm for web usage mining and in particular focuses on discovering the web usage patterns of websites from the server log files. In this work the memory usage and time usage of Apriori algorithm are compared with frequent pattern growth algorithm.

Ramli generates the university E Learning (UUM Educare) portal usage patterns using basic association rules algorithm called Apriori algorithm [19]. Server log files are used with Apriori algorithm to produce the final results. Here, web usage mining, approach has been combined with the basic association rule, Apriori algorithm to optimize the content of the university E Learning portal. The authors have identified several Web access pattern by applying the well known Apriori algorithm to the access log file data of this educational portal. This includes descriptive statistic and association Rules for the portal including support and confidence to represent the Web usage and user behavior for UUM Educare. The results and findings for this experimental analysis can be used by the Web administration and content developers in order to plan the upgrading and enhancement to the portal presentation.

Mei-Ling Shyu and Shu-Ching Chen proposed a new approach for mining user access patterns. The approach aims at predicting Web page requests on the website in order to reduce the access time and to assist the users in browsing within the website [20]. To capture the user access behavior on the website, an alternative structure of the Web is constructed from user access sequences obtained from the server logs, as opposed to static structural hyperlinks. Their approach consists of two major steps. First, the shortest path algorithm in graph theory is applied to find the distances between Web pages. In order to capture user access behavior on the Web, the distances are derived from user access sequences, as opposed to static structural hyperlinks. They refer to these distances as minimum reaching distance (MRD) information. The association rule mining (ARM) technique is then applied to form a set of predictive rules which are further refined and pruned by using the MRD information. In this paper, finally they propose a new method for mining user access patterns that allows the prediction of multiple nonconsecutive Web pages, that is, any pages within the website.

Srivastava et al. [21], proposed a data mining technique for finding frequently used web pages. These pages may be kept in a server’s cache to speed up web access. Existing techniques of selecting pages to be cached do not capture a user’s surfing patterns correctly. Here, they use a weighted association rule (WAR) mining technique that finds pages of the user’s current interest and cache them to give faster net access [5]. This approach captures both user’s habit and interest as compared to other approaches where emphasis is only on habit. If user A logs on to Internet every day for reading news and checking emails. He visits googlenews.com and gmail.com in any order. In this case, association rule would give rules (User A, googlenews.com) → (User A, gmail.com) and gmail can be pre-fetched to the cache to reduce the access time.

Among these classification methods in Data Mining, Association rule mining is simple and effective in classification. In fact rules generated from association rule mining can be easily converted to classification rules so it becomes a natural choice for classification in Data Mining. This technique is known as associative classification [1]. In the research work [22], the author focused on the construction of classification models based on association rules. In order to mine only rules that can be used for classification, the well-known association rule mining Apriori algorithm is modified to handle user-defined input constraints. Using this characterization, a classification system is implemented based on association rules. In this work, the performance of this classification method is compared with the performance of several model construction methods, including CBA (classification based on association). This classification algorithm mines for the best possible rules above a user-defined minimum confidence and within a desired range for the number of rules.

3. Finding Weighted Associations from Web Logs

We model the pieces of Web logs as sequences of events to find the associations between web pages on the basis of sequential patterns over a period of time. Each sequence is represented as an ordered list of discrete symbols and each symbol represents one of several possible categories of web pages requested by the user. Let be a set of events. A Web log piece or (Web) access sequence () for () is a sequence of events, while is called the length of the access sequence. An access sequence with length is also called an -sequence. In an access sequence , repetition is allowed. Duplicate references to a page in a web access sequence imply back traversals, refreshes or reloads [6, 23]. For example, 1, 1, 2 and 1, 2 are two different access sequences, in which 1 and 2 are two events. Figure 1 shows a sample of such sequence. The Data we used for the experiment comes from Internet Information server (IIS) logs for msbc.com and news related portions of msn.com for one entire day. Each sequence in the data set corresponds to page views of a user during that day. There are 1 million records and we selected 64,000 samples. Each event in the sequence corresponds to a request for a page. Requests are recorded only at the level of page category. There are 16 categories of pages and these categories are given numeric codes from 1 to 16. The pages are included into one of these categories based on their content. These categories are front page(1), news(2), technology(3), local(4), opinion(5), on air(6), miscellaneous(7), weather(8), health(9), living(10), business(11), sports(12), summary(13), bbs(14), and travel(15), msn news(16). Although other information pertaining to the web access is available, we model only the categories of page requests.

When we try to find the hidden patterns in web access sequence using the Boolean association, we can consider only the presence or absence of pages in an access sequence. We do not give importance for the number of occurrences of a category of web pages in a sequence. If a particular category of web page is appearing together continuously then such occurrences also have to be processed with more significance. Instead of Apriori algorithm, if we use weighted fuzzy association rule mining algorithms, we will be able model the web sequences which reveal more natural patterns by considering all the above mentioned facts.

From Figure 1 it is clear that the weight of a category of web page in a browsing session of a user can be directly associated with the number of times the user visits that particular category of pages in his session. Again if the user is continuously visiting the same category of pages then more weight has to be given for such continuous accessing of same category of pages. Considering the above facts we define the following concepts for the development of the new web classification algorithm.

3.1. Definition  1. Fuzzy Weight of Web Page

The fuzzy weight of a web page is defined by considering the number of co-occurrences of a web category. The following expression is used for weight calculation: where is the fuzzy weight of the th category web pages in a session. Here, we assume that there are subgroups in a sequence which contains the th category web pages. Then is the number of successive th category web pages appearing in the th group and is the total number of th pages appearing in the session under consideration. Finally is the total number of pages in the browsing session. The weight thus generated will be more than one in some cases and the values are normalized by dividing each weight with the maximum weight in a session. A portion of the values are given in Table 1.

3.2. Definition  2. Web Class of a Session

The class of a web sequence is defined as the web category in a sequence with maximum weight. It is assumed that in a web access sequence, the remaining page visits are leading to this web category with maximum weight. In an association rule framework, the page category with maximum weight can be considered as the consequent and the remaining pages as the antecedents. A web class for an access sequence is defined as So, this concept work like a data mining classification problem where the available sequence patterns are classified into groups leading to a particular class and this information can be later used to predict the user behavior in browsing sessions. Using the above mentioned equation, all the web pages in the access patterns are converted into their corresponding weights. In this new approach, the web pages visited in a session are given weights, after considering the number of visits in that session and the extend of continuous visits of the same category of pages. The weights obtained in each session are normalized so that all the weights appear within the range of 0 and 1. Since we have sixteen categories of web pages, a new database table is created with sixteen attributes such that each attribute corresponds to a web category. All the sequences are converted into this fixed database table format with matching weights for each category. All the weights are normalized as shown in Table 1 so that they appear within a range of zero and one.

4. The Fuzzy Web Classification Algorithm (FWCA)

With the concept of web page weight and web access class, now the new algorithm for web classification can be derived. The algorithm has the following steps.

Step 1. Convert all the web pages in access sequences to corresponding fuzzy weights in a fixed database table format.

Step 2. Sort each web sequence in the descending order of weights and select the web page with maximum weight as the class (consequent) of a sequence.

Step 3. For each access sequence, the remaining pages (other than the consequent) are included into a classification rule sequence as long as the product of the weights is greater than a given support threshold.

Step 4. Select only those rules having the confidence value (associated with the number of times such rule sequences exist in the entire set of web access sequences) greater than the user specified threshold. The confidence for the th rule for the th web category is defined as where is support of th rule for th web category is number of th rules identified for the th web category is total number of rules with as the web class.

In Algorithm 1, we have the detailed pseudo code for the algorithm. In the algorithm, is the weights of page categories for the browsing sessions. Weight is calculated using expression 1. is used to sort the th access sequence on the basis of the weights of web pages. The function is used to get the web page category whose weight is the maximum in an access sequence. Finally stores the rules generated. The algorithm find all fuzzy weighted classification rules from the web access sequence for a user specified support () and confidence () threshold values.

Algorithm FWCA ( )
{
//  is the number of access sequence
//  is the number of web categories
//  is the weights of web pages for sessions
// seq is the rules generated
for to
{
 for to
 {
    = weight( )
 }
}
Sort
rule = 1
for to
 {
   
   Support = (Max )
   while support≥∞
 {
  wm = (Max )
  Seq[rule, ] = Addwebtype(wm)
  Support = support * wm
  Delete(wm)
  
 }
 Rule = rule + 1
 }
for to rule
 {
 if confidence(seq ) >
 print(seq )
 }
}

In the algorithm, the weights of each category of web pages are calculated and stored in the two dimensional array for all available sequences. In the next steps the weights are sorted in the decreasing order for every sequence. So in each sequence, the web category with maximum weight will be placed first and it is considered as the consequent of that sequence. The then highest weight category will be placed in the second position and so on. In each sequence the web categories are arranged in the decreasing order of their importance. The web page categories are included in to the rules (as antecedents in the association rule) in the order they are arranged.

We have a variable support and its initial value is the weight of the consequent. When one web category is included in to the rule its weight is multiplied with the support value. When the support becomes smaller than the given threshold alpha () the rule generation for that user session is stopped. The next user session sequence is considered for the same process and this is done for all the available user session sequences and finally there will be “” rules. In the next step, the global confidence of rules is checked.

Since the web categories are included into the rules as antecedents in the descending order of their weights, only the most significant web categories in a user session will be included into the rule generated for that session. Once all rules are generated, their global confidence count is found (the number of times the same antecedent-consequent sequence appears in the entire sessions). All the rules with confidences lesser than the user specified threshold beta are removed from the set of rules. By applying the weighted association rule mining approach to the web mining problem we could identify fifty five rules from one 64000 sessions these rules are given in (Table 2).

5. Finding Web Page Associations Using Apriori Algorithm

To find the Boolean associations between the co-occurrences of the web pages, first we converted the entire web sequences into true or false values. Since there are sixteen categories of web pages, here also we designed a database with sixteen fields. The presence and absence of a web page in a sequence is represented with true or false values in the corresponding record representing the web sequence. In this approach, we are considering only the presence of a web page in the sequence and we do not give any significance for the number of occurrences of the page in the sequence. With this preprocessing, the web sequence database is converted in to a true Boolean database with only true or false values indicating the presence and absence of web pages in the user sessions. Table 3 shows a portion of such Boolean database generated.

To apply the Apriori algorithm to the new dataset we used the IBM SPSS Modeler 14.1 data mining software. The Table 4 represents the rules identified using Apriori algorithm. By applying Apriori algorithm, 45 rules were identified with a support value of 5 and confidence of 20. Among the rules, only seven are having more than one antecedent.

By comparing the rules generated from both the methods, it is clear that the fuzzy weighted approach for associative classification using the new proposed algorithm is far superior to Boolean Apriori Algorithm. This is in terms of the coverage of the rules and inclusion of web categories into the classification rules. In the case of Apriori algorithm we got only two antecedents in five cases and only one in all the remaining cases but in the case of weighted approach many rules are having more than three web categories as antecedents.

6. Discussions

Web server log files contains repository of web browsing information by the internet users. Mining on this data collection can bring out valuable information about the web access patterns of users. When we apply classification and prediction techniques in web usage mining environment, the access patterns web users can be predicted. The data used in this work contains sixteen web categories and 64,000 samples of web access sequences involving these web categories. Some of the web categories are highly popular among web users that these web categories appear in several access patterns. The importance of web categories is evident from the graphical representations (Figures 2 and 3) which are directly linked with the number of occurrences of web categories in access sequences.

Figure 2 shows the number of sequences in which the web categories appear. The Figure 3 is the total occurrences web categories in different sequences (a web category may appear many times in a sequence). From the figures it is clear that some of the web categories are more important in comparison with others. The concept of importance of web categories in access sequences is modeled using the concept of fuzzy weight of web categories (Table 1).

In this paper, the associations are found between the web categories using conventional Boolean Apriori algorithm and the FWCA. By using the new algorithm, fifty-five rules are identified and forty-five rules are identified using Apriori algorithm. It is found that the rules generated using the FWCA algorithm have more coverage (classification rules are identified for more web categories) and it identified classification rules leading to fifteen web categories. A comparison between the two techniques in terms of the number of rules identified from each web category is given in Figure 4.

The classification rules generated using the techniques show the associations between the web categories. The number of web categories involved in each rule shows the ability of the rule generation technique to find more inclusive rules from the available web access sequences. The average number of web categories (antecedents) involved in each class of rules using the two rule generation techniques are given in Figure 5. It is evident from the figure that the FWCA is more inclusive (more web categories are included as antecedents in rules) while identifying the classification rules.

From the sample data, the number of access sequences in which the web categories appear similar to the antecedents of the web classification rules are also found. This is similar to the support threshold in association rule mining. Even though the rules generated using the fuzzy weighted algorithm have more antecedents, those rule patterns appear more in the access sequences than the Boolean rules (Figure 6). It shows that the new fuzzy based algorithm is more capable of identifying natural associations between web categories.

Finally the fuzzy weighted rules out perform the Boolean rules in terms of the number of access patterns which actually satisfy the rules. This is equivalent to the confidence measure of association rule mining technique. Actual validity and authority of the rules are analyzed by finding the number of occasions in which the access sequences from the sample data perfectly satisfy the classification rules. The advantage of fuzzy weighted rules over Boolean rules in terms of number of cases from the sample data, satisfying the rules is demonstrated in Figure 7 The graph shows the number of cases satisfying the forty-five Boolean rules and fifty-five fuzzy weighted rules.

From the above discussion, it follows that the FWCA algorithm presented here to classify the access sequences has noticeable advantage over the Boolean Apriori method. The benefits of this method are listed as follows.(i)In a web access sequence the importance of web categories vary according to the user preferences. By using FWCA, we can assign more weight-age for frequently visited pages by uses. This consideration will help in evolving rule which represent natural access habits.(ii)In the experiment, There are sixteen categories of web pages. A classification rule generation system for this sample data is efficient if it can identify rules which lead to most these web categories. FWCA identified rules which lead to fifteen web categories. Apriori algorithm identified rules for only eleven web categories.(iii)These web classification rules can be used for prediction and selective prefetching of web pages. The rules will become useful if more number of web categories are involved in the rules. Using FWCA, more web categories are included as antecedents in the rules. It helps in identifying wide range of associations between web categories.(iv)The number of sequences in which the antecedents of a rule appear together from the sample sequences (the support count) is more in the case of FWCA. It shows that the rules generated by the algorithm are exactly revealing the access patterns of users.(v)The number of sequences which actually satisfy the rules (Confidence measure) from the sample is also more for FWCA. It proves that that the rules generated by the algorithm are correct, that is, the antecedent sequences identified by the rules are leading to the web category of the rule.

The main advantage of this algorithm is that it can identify the longest possible frequent patterns in a single step by using the concepts of fuzzy weighted association, while the Apriori algorithm requires many passes over the data to generate the rules.

7. Summary

The concept of a market basket can be extended as the pages visited by a user in one session in web mining. Association rule mining techniques are used here to find associations between web pages visited by users. Here the problem is redefined like which pages are most frequently visited simultaneously by web users? The Boolean Apriori algorithm for association rule mining is used to find the association between the web pages visited together by users. But by using Apriori algorithm, only the presence and absence of web pages in a browsing session is considered. But, we also have to consider other important factors like the number visits of a web category, the time spent on a web page, and so forth, This paper discussed about a novel web classification algorithm using the principles of fuzzy association rule mining to classify the web pages into different classes in a single step, depending on the manner in which they appear in user sessions. In this approach, page visits in a browsing sessions are converted into fuzzy weighted values and association rules are generated from this. These fuzzy rules are used to classify access patterns in the form of classification rules.