Abstract

The dynamic nature of information resources as well as the continuous changes in the information demands of the users has made it very difficult to provide effective methods for data mining and document ranking. This paper proposes an efficient particle swarm chaos optimization mining algorithm based on chaos optimization and particle swarm optimization by using feedback model of user to provide a listing of best-matching webpages for user. The proposed algorithm starts with an initial population of many particles moving around in a D-dimensional search space where each particle vector corresponds to a potential solution of the underlying problem, which is formed by subsets of webpages. Experimental results show that our approach significantly outperforms other algorithms in the aspects of response time, execution time, precision, and recall.

1. Introduction

With the rapid development in Internet technology, the number of webpages and the volume of information content have led to an explosion in the amount of available information. While there may be some webpages that are more relevant, popular, or authoritative than others, web users look forward to easily, search the most interesting and significant website by specifying relevant keywords [1]. When a user enters a query into a search engine by using keywords, the engine will provide a list of best-matching webpages according to its criteria [2]. It becomes increasingly important to help user find useful information more easily and quickly.

It is well known that web search is one of the most universal and influential applications on the Internet. Web search engines can support users on a wide variety of topics across a comprehensive range of websites [3]. In order to achieve the typical massive content collections rapid response to a specific query form, web mining technology appears in the technical background and social environment [4]. Web mining is a comprehensive data mining method that uses data mining techniques from web-related resources and extracts interesting behavior, useful patterns and implicit information, involving web technology, data mining, computational linguistics, artificial intelligence, machine learning and other fields.

Web mining can be viewed as the extraction of structure from an unlabeled, semistructured data set containing the characteristics of users and information [5]. Web mining is divided into web content mining and web usage mining. Web content mining is mining the web page content and the background of transactional database and obtaining useful knowledge from the web document and the description of the content information regarding the websites [6]. Web usage mining is done by mining the appropriate website log files and related data to discover frequent browsing patterns based on clickstream data analysis.

A number of novel optimization methods have been proposed to optimize the web usage evaluation function. The conventional web mining approach makes a type of relevance ranking, whereby a webpage may be relevant to a topic or theme. Taking into account the amount of available information, the processing essentially requires adequate approaches suitable for extracting only the relevant, sometimes hidden, knowledge as the final result of the problem under consideration. A heuristic intelligent mining approach can be derived using particle swarm optimization (PSO) for addressing the web mining.

This paper focuses on feedback model of user by using chaos optimization and particle swarm optimization to help user find useful information as fast as possible. The dynamic feature of web information as well as the continuous changes in the information demands of the users has made it very difficult to provide efficient and effective approaches for data mining techniques. To realize the goal of searching useful information effectively and efficiently, we have developed an efficient particle swarm chaos optimization mining algorithm (PSCOMA) based on chaos optimization and particle swarm optimization by using feedback model of user to provide a list of best-matching webpages for user. We compare our approach with PSO, PCS, and HITS algorithms, and, as far as we know, it is currently the best method for the problem considered. Experimental results show that our approach significantly outperforms other algorithms in most cases.

The remainder of the paper is organized as follows. A brief survey is given in Section 2. We study the user’s feedback behavior and formalize it as a mathematical optimization model in Section 3. Section 4 explains the details of the PSCOMA. Section 5 discusses the experimental results and compares them with other Web mining algorithms. Finally, Section 6 concludes the paper and discusses some future research directions.

In this section, we focus our discussion on the prior research on web mining. The web mining is a very important problem and has attracted much attention of many researchers.

Drs. Yin and Guo proposed a new formulation for the website structure optimization (WSO) problem based on a comprehensive survey of existing works and practice considerations [7]. In [8], the greedy-add algorithm with backtracking was introduced for obtaining the initial solution, and a better solution is sought through VNS as well as tabu search. Dr. Chen et al. presented a page clipping synthesis (PCS) search method to extract relevant paragraphs from other web search results [9]. Spink et al. reported the results of a major study examining the overlap among four major web search engines for results retrieved for the same queries [10]. In [11], proposed hybrid algorithm makes use of the strong global search ability of particle swarm optimization and the strong local search ability of tabu search to obtain high quality solutions in web mining. Herrera et al. presented a study about the impact of using several features extracted from the document collection and query logs on the task of automatically identifying the users’ goals behind their queries [12]. Ling and Van Schaik reported findings from two experiments that explored the influence of font type and line length on a range of performance and subjective measures [13]. Moawad et al. proposed a new multiagent system based approach for personalizing the web search results. The proposed approach introduces a model to build a user profile from initial and basic information and maintain it through implicit user feedback to establish a complete, dynamic, and up-to-date user profile [14]. Zhang and Dong proposed a novel effective approach to exploit the relationships among users, queriesm and resources based on the search engine’s log [15].

The rank value indicates the importance of a particular page. A hyperlink to a page counts as a vote of support. Link analysis is a method for determining which pages are good for particular topics based on both the quantity and quality of links pointing to that document. PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents [16]. hyperlink-induced topic search (HITS) is a link analysis algorithm that developed by Jon Kleinberg. HITS provides a new method of searching the web that returns a list of the most valuable sites on a given topic, plus a list of sites that index the subject. In [17], a comparision among the rankings of results for identical queries retrieved from several search engines is made. The method was based only on the set of URLs that appear in the answer sets of the engines being compared.

3. Optimal Feedback Model

In this section, we design the optimal feedback model of users for the evaluation of the weight of webpage. The explicit feedback is the one in which the user is asked to fill up a feedback form after he has obtained searching results [18]. Since the user’s feedback behavior reflects the user’s preferences, we need to monitor the response of the user to the search results presented before him. Here is the general definition of the problem.

Definition 1. Assume that represents the feedback model of user, denotes the order that user browses the webpages, denotes the time in which the user browses the webpages, denotes the clicks of the webpages, denotes the behavior that the user evaluates the webpages and denotes the user’s behavior in responding to the webpages.
If is the kth webpage visited by the user, then the value of is set . If the th webpage is not browse by the user at all before the next query is submitted, then we set , and the corresponding value of is set 0. One of the mostly used measures for evaluating the web usage is the number of clicks needed for accessing the target webpage [19, 20]. The click frequency can be easily derived by scanning the user sessions, and the initial value of is 0; if the user clicks the one time, then set . indicates the level of importance that webpage holds for the specified query; according to the rules and strategies for evaluation of difference, the value of is not the same, . The initial value of is 0; if the user participates in the discussion of the th webpage, then the value of is set 1.

When the feedback model of the user is complete, we propose to define the weight of the th webpage.

Definition 2. Assume that represents the weight of the th webpage; represents the importance of the th webpage and can be formulated as where denotes the maximum time that the user is expected to spend examining the th webpage, in which , represents each of the five factors of the feedback vector. According to the user’s preferences and practice considerations, the user can set different values for each of the five factors in the feedback vector. From (1), it is not difficult to see that the more time the user spent in browsing through the documents, the more important they must be for him. If the user evaluates the documents or responding to them, they must be having the significance for the user. Therefore, this paper combines of the above five components by simply taking their weighted sum and gives the overall importance of webpages.

4. The Application of PSCOMA

4.1. Particle Swarm Optimization and Chaos Optimization

Particle swarm optimization is an evolutionary computation technique based on swarm intelligence optimization algorithm inspired by the social behavior of birds flocking for food, which was first introduced to optimize various continuous functions by Kennedy and Eberhart. It is computationally effective, has fast convergence, and is easier to implement when compared with other mathematical and evolutionary algorithms while only a few parameters need to be adjusted.

Chaos is a universal phenomenon in many nonlinear systems. Chaos optimization can escape from local minima more easily compared with other stochastic optimization algorithms that escape from local minimum by accepting some wrong solutions according to a certain probability [21]. Chaos optimization can be within a certain range according to their own laws without repetition through all states. Recently, chaos optimization and PSO have been combined in different application fields for different purposes. Some of the works have intended to show the chaotic behaviors in PSO process [22].

4.2. The Basic Idea of PSCOMA

This paper proposes an efficient particle swarm chaos optimization mining algorithm that attempts to balance exploration and exploitation. The PSCOMA makes full use of the strong global search ability of PSO and the strong local search ability of chaos optimization to obtain high quality solution. PSCOMA uses the properties of ergodicity, stochastic property, and regularity of chaos to lead particles exploration.

The webpages that have higher weight are selected to compose an initial population that is then analyzed by PSCOMA. The basic idea of PSCOMA is as follows. The proposed algorithm starts with an initial population of particles moving around in a D-dimensional search space where each particle vector corresponds to a potential solution of the underlying problem, which is formed by subsets of webpages, starting from the webpages with higher weight that have a high probability of being different from each other. We let the th particle at the tth iteration he used to evaluate the quality of the particle and represent candidate solution, denoted by , representing a possible solution. The PSCOMA operates using a fitness function that considers the weight of a webpage. Then, all of the particles repeatedly move until a termination condition is satisfied. During each iteration, the particle individual best and swarm’s best positions are determined. The particle adjusts its position based on the individual experience and swarm’s intelligence. Each particle is further updated using a chaos optimization algorithm. Once the PSCOMA has been run, the best particle individual constitutes the subset of webpages to be returned to the user. When the algorithm terminates, the best particle is returned as a solution.

4.3. Fitness Function

Each particle is assigned a fitness value indicating the merit of the particle. Since all the particles represent candidate feasible solutions, we use (1) to assign a fitness value to each particle.

Definition 3. Assume that represents the weight of the th webpage that corresponds to a particle, denotes the th particle at the tth iteration, the fitness function of particle is given by

4.4. Formulation of the Movement of the Particle

During the search process, the particle successively adjusts its position and updates its velocity toward the global optimum using two “best” values: and . The represents the best position encountered by itself denoted as . The represents the best position encountered by any particle in the population denoted as After finding these two best values, the particle updates its velocity and position at the next iteration are calculated according to the following equations: where and are cognitive coefficients, and are random real numbers drawn from the interval (0, 1), is called inertia weight, , , , , and

During the search process for optimum values, it is possible for a particle to escape its search space in any of the dimensions. In each iteration process, when the fitness value of each particle tends to converge or local optimum, it will lead to inertia weight increase; while the fitness value of each particle scattered, it will be easy to make the inertia weight decreases. Therefore, in order to maintain the value of inertia weight range in a reasonable range, (4) should make proper adjustments to the specific implementation is as follows:

4.5. Chaotic Local Search in PSO

PSCOMA uses chaotic local search methods during the search process; namely, the chaotic map is used to control the value of parameters in the velocity updating equation. The specific implementations are as follows.

Step 1. Each of the individual maps to the interval , namely, where and are, respectively, the lower and upper bounds of the jth dimensional variable, , , .

Step 2. Use the logistic equation for chaotic iteration so as to get chaotic gene series:

Step 3. Convert the chaos variables into decision variables :

Step 4. Evaluate the new solution on the basis of decision variables . If the new solution is better than the initial solution, the new solution will be as an optimization result of chaos. Otherwise, go to Step 2, .

4.6. The Process of PSCOMA

The process of PSCOMA consists of the following seven steps.

Step 1. Data preparation: training, validation, and test sets are represented, respectively.

Step 2. Particle initialization and PSCOMA parameters setting: the webpages that have higher weight are selected to compose an initial population, which particles moves around in a -dimensional search space. Set the PSCOMA parameters including the number of iterations , the number of particles , particle dimension , inertia weight, cognitive coefficients, and .

Step 3. Randomly generate the position and velocity of particles.

Step 4. Evaluate the fitness of each particle and store the current position of each particle and the adaptation degree of each particle in and store the best individual fitness value of the position and fitness value of in .

Step 5. Update the velocity and position of each particle using (4) and (5).

Step 6. Perform the following chaotic local search for the best particles in population and update its and .

Step  6.1. Each of the individual maps to the interval ; namely,

Step  6.2. Use the logistic equation for chaotic iteration, so as to get chaotic gene series:

Step  6.3. Convert the chaos variables into decision variables :

Step  6.4. Evaluate the new solution on the basis of decision variables . If fitness of the the new individual is larger than the old one, then the new individual (solution) will be an optimization result of chaos. Otherwise, return Step  6.2.

Step 7. Stop condition checking: if , end the training and testing procedure, otherwise go to Step 2 to begin the next iteration, .

5. Simulation Experiment and Result Analysis

In this section, we present the experimental results which include the algorithm parameter configuration and comparative performances with other algorithms. The platform for conducting the experiments are a PC with Intel Core 2 Duo CPU E6300 processor, 1.86 GHz. All programs are coded in C# language under a Windows NT platform. The numerical results are the means of outcomes from 50 independent runs of the algorithms.

The experimental results compare the PSCOMA with several typical web mining algorithms including the PSO, PCS, and HITS algorithms. We experimented with a few queries on six popular search engines, namely, AltaVista, Netscape, Excite, Google, Direct Hit, and Yahoo. We denote the number of webpages and keywords in and . This paper will be compared from the important aspects of web search such as response time, execution time, precision and recall. Heuristic algorithms have different configuration parameters, which may affect the results. In order to reflect the fairness of algorithms, this paper will take the same configuration parameters in [11]. The specific configuration parameters are as follows:

Figures 1 and 2 show the experimental results of the response time in different number of keywords. Figures 3 and 4 show the experimental results of the response time in different number of keywords. Figures 5 and 6 show the execution time in different number of generations. Figures 7 and 8 show the precision in different number of generations. Figures 9 and 10 show the recall in different number of generations. Figure 11 shows the contrast curve of recall and precision.

As we can see from Figures 111, the proposed algorithm PSCOMA outperforms PSO, PCS, and HITS algorithms. In the aspect of response time, PSCOMA is faster than PCS and PSO by 28.6%, indicating that PSCOMA has fast response speed. When increasing the number of keywords, the response time curve of PSCOMA shows to be the most stable. In the aspect of execution time, PSCOMA spends the least time than other algorithms. In addition, when the number of webpages or keywords continually increases, PSCOMA increases the magnitude is the smallest. The simulation results illustrate that the response time and execution time of the proposed algorithm are better than those obtained by others. In the aspects of precision and recall, the precision and recall of PSCOMA have performed the best, where the highest precision is close to 97.2% and the highest recall rate is close to 97.8%.

From the experimental results in the aspects of response time, execution time, precision, and recall, we can conclude that PSCOMA is more satisfactory than the PSO, PCS, and HITS algorithms.

6. Conclusions

To prevent the user from being overwhelmed by a large number of redundant and useless or uninteresting information, approaches are needed to provide for data mining. In this paper, we have presented a survey on web mining involving chaos optimization and particle swarm optimization. This paper is the first full use of the strong global search ability of PSO and the strong local search ability of chaos optimization for solving web search and has gained a higher quality solution in the aspects of response time, execution time, precision, and recall. In the future, we will extend the PSCOMA algorithm to other domains of data mining and investigate the possibility of reaching closer optimum by improving chaotic local search.

Acknowledgments

This research work was supported by the Hubei Key Laboratory of Intelligent Wireless Communications (Grant no. IWC2012007) and the Special Fund for Basic Scientific Research of Central Colleges, South-Central University for Nationalities (Grant no. CZY11005). The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.