Due to the exponential growth of Internet users and traffic, information seekers depend highly on search engines to extract relevant information. Due to the accessibility of a large amount of textual, audio, video etc., contents, the responsibility of search engines has increased. The search engine provides relevant information to Internet users concerning their query, based on content, link structure, etc. However, it does not guarantee the correctness of the information. The performance of a search engine is highly dependent upon the ranking module. The performance of the ranking module is dependent upon the link structure of web pages, which analyze through Web structure mining (WSM) and their content, which analyzes through Web content mining (WCM). Web mining plays a vital role in computing the rank of web pages. This article presents web mining types, techniques, tools, algorithms, and their challenges. Further, it provides a critical comprehensive survey for the researchers by presenting different features of web pages, which are essential to check their quality. In this work, authors presented different approaches/techniques, algorithms and evaluation approaches in previous researches and identified some critical issues in page ranking and web mining, which provide future directions for the researchers working in the area.

1. Introduction

The size of web documents over the World Wide Web (WWW) has exponentially increased due to increasing the dependency of users over the Internet. An automatic system is required to fetch reliable information from such a huge collection of web documents because this task is challenging to analyze manually. Search engine [13] is an information retrieval tool for the Web like Google, Yahoo, Bing. The summary of various search engines is shown in Table 1. Still, these search systems can sometimes not guarantee reliable and accurate information, but still, these systems provide better results than performing the task manually by experts. These tools often do not provide precise information because the IR system [6] returns information to Internet users based on specific retrieval criteria. For instance, it fetches web documents based on the subject/title as given. To fetch huge web documents related to a specific domain is very easy and common. Therefore, search engines provide a ranking system to find reliable web documents for user/client queries. Generally, a ranking mechanism creates the rank of web pages based on either keywords/reliability or links/popularity.

Hyperlinked Structure [7] was developed in 1989 to share information among researchers in Switzerland. Later, it became a platform of WWW development guided by the WWW association at MIT (Massachusetts Institute of Technology) in Cambridge. The recent growth of WWW has changed the computer science & engineering and the people's lifestyles and economics of various countries.

Since its onset, the WWW has been increasing exponentially as shown in Figure 1(a) A 10–106 terabytes of traffic have increased in a month between 1995 and 2000. The total web traffic between 2005 and 2010 increased from 1 to 7 exabytes. Now in 2020, Internet traffic is increasing approximate 5.3 exabyte per day. According to Cisco, 82% of video-Internet traffic of all web traffic will be in 2021. In 2016, 73% of video traffic [8] of all Internets was present as shown in Figure 1(b). People view large amounts of video, but they also use high bandwidth to view good-quality videos.

All types of web content (like video, Netflix, webcam) generate demand. Now growing live videos is an integral part of the Internet. These video offerings from various sources like live Facebook, Twitter’s broadcast, live YouTube, live sports is expected to increase approximately 13% of traffic as shown in Figure 1(c) of total video web traffic by 2021 [8]. WWW is an essential and widely used tool to provide reliable information to Internet users. It provides an essential and easy mechanism for information like static text, images, dynamic and interactive services such as audio/video conferences. It provides the facility to view various types of information, including magazines, library resources in different sectors, current and business news, etc. Now the web is an essential source of all kinds of information.

The information retrieval systems [9] were developed to store and search web pages in efficient manner because the size of WWW increased exponentially. Generally, the text documents are stored in text databases, and the IR system provides a framework to enable searching. The IR system generates a list of documents in response to a query. In general, these are listed in descending order by estimated relevancy. Because most users only glance at the first 10–50 items (the maximum criterion), the algorithms try to put the most relevant papers at the top.

However, searching for information on the web is difficult for an information seeker. Web-based information retrieval systems called search engines [10] have made things easy for information seekers but do not provide guarantees about the correctness of the information. Many times, the information is not precise. It is a program that searches for the documents for specified queries and returns the list of documents where the query keywords were found.

It is important to understand that the term “popularity” is normally the result of link analysis and not user feedback. A web search engine as shown in Figure 2, typically consists of a ranking system that measures the importance of Web Pages [11, 12]. Using the hybrid approach, one can fetch content-based information from web documents [13]. The traffic of search engines is affected [14] by the following factors: size of the web, loading speed [15], web security condition, SEO Crawling Factor (Title, heading, Meta Description of web page, Content, URL), User behavior [11, 16]. [17] presents a web page rank mechanism that is query dependent. This approach was much better and effective, but it took more time to rank. In [18], the authors present a ranking mechanism based on link attributes, but it was not able to check the content quality of the web page. Some content-based ranking approaches are presented in [1921]. The main issue in content mining is that it was increasingly perceived latency, addressed in [22] by an additional component, said the proxy server.

Search engines follow the following steps to process user queries:(a)Take user query and, based on its keywords, make a precise query to process.(b)Analysis and Fetch data from web repository corresponding user request.(c)Ranked to all fetched web pages.(d)Return the list of URLs array of ranked web pages for the user request.(e)Get the updated user query of the user, if any?

1.1. Working Process of Search Engine

frontend_search_engine (UserQuery) {(1) result_QP = Quesry_processor (UserQuery, Indexed_Web_Repository, Meta_data);/Fetch web record from web repository corresponding user query and store into “result_QP”/(2) ranked_web_pages = Ranking_system (result_QP, Meta_data);/ Ranking system process result of Query Processor “result_QP” to arrange all web pages into high rank to low rank/ }backend_search_engine (URL_List) {(1) WebPageRepository = Crawler (URL_List);/Crawl the web pages by crawler with the help of robot.txt file, store into web page repository and add new URLs into URL_List to crawl all these web pages also. //(2) indexed_web_page_Repository = indexer (webPageRepository);/The indexer analyses all extracted documents by extracting relevant terms for creating an index to search documents against user queries/(3) new_list_of_URLs = contentAnalysis (webPageRepository); Content Analysis compute the relevance of a web page on the basis of its contents with respect to user query/(4) Meta_data = contentAnalysis (webPageRepository);(5) Update_URL_List (URL_List, new_list_of_URLs); }QueryProcessor (UserQuery, Indexed_Web_Repository, Meta_data) {(1) WebPageRepository = Crawler (URL_List);/Crawl the web pages by crawler with help of robot.txt file, store into web page repository and add new URLs into URL_List to crawl all these web pages also./(2) indexed_web_page_Repository = indexer (webPageRepository);/The indexer analyzes all extracted documents by extracting relevant terms for creating an index to search documents against user queries/(3) new_list_of_URLs = contentAnalysis (webPageRepository);/Content Analysis compute the relevance of a web page on the basis of its contents with respect to user query/(4) Meta_data = contentAnalysis (webPageRepository);(5) Update_URL_List (URL_List, new_list_of_URLs) }

2. Web Mining

Data mining is used to find out relevant patterns or knowledge from repositories (such as databases, texts, images), which should be valid, valuable, and understandable. Text mining has become popular and reliable by increasing the popularity of text documents. Web mining [2325] is used to fetch useful/relevant information and use this information to generate knowledge and personalize the information and learn about users. The hyperlink structure of web pages, the content of web pages is used to collect the relevant information. Data mining techniques as shown in Figure 3 [2528] are used to fetch and discover relevant information automatically from web pages and web services in web mining. Data mining services are discussed in [29] to extract something useful out of the Web. There are following steps are needed to perform for this purpose:(i)Resource finding: Extract the useful data/resources from either web documents, which are available online, or offline mode.(ii)Information selection and preprocessing: Apply the preprocessing (cleaning, normalization, feature extraction) on the specific information, which is automatically selected.(iii)Transformation: Preprocessed data are transformed into valuable information by removing stop words to obtain necessary phrases in training mass.(iv)Generalization: It is used to fetch patterns from a website or across various websites by applying machine learning (ML) and other data mining techniques.(v)Analysis: This phase analyses mined patterns by validation and interpretation. Pattern mining plays an important role in this phase. In knowledge generation on the web, human being plays an important role.

There are three basic information such as the previous pattern, shared content’s degree, and link structures in web mining discussed below:

2.1. Web Usage Mining (WUM)

Web and application servers are the main sources to collect web log data. Log files generate over the web whenever an Internet user interacts with the web through search engines (shown in Figure 4).

The following techniques [3] are used in web usage mining:

2.1.1. Association Rules

By using association rule creation in the Web domain, pages that are frequently referred together can be combined into a single server session. Unordered correlation between objects observed in a repository of activities that can be discovered using association rule mining techniques. In web usage mining, the association rules apply to groups of pages that are accessed together and have a support value that is greater than a certain threshold. Support value is the percentage of activities for a specific pattern. The presence or absence of association rules can help Web designers rebuild their pages more effectively. Association rules can be used as a trigger for pre-fetching documents while loading a page from a distant site to reduce user perceived latency. Association rules in WUM provide the relationship between web pages that frequently appear next to one another in user sessions [6, 7].

Statement of association rules written as follows:where A, B are sets of items in a series of transactions.

For example, an association rule: Page A, Page B => Page C shows, if the user/client observe page A, and B then page C will be observed in the same meeting.

2.1.2. Classifications

Classification is used to map a data item into predefined classes.

In the Web domain, it is necessary to extract and select attributes that best characterize the properties of a specific class or category in order to create a profile of people belonging to that class or category. The web usage mining process understands the existing data and behavior of new instances. It identifies a particular class/category of a user. Classification techniques use Machine Learning (ML), Neural Network (NN) and statistical. Decision tree classifiers, naive Bayesian classifiers, k-nearest neighbor classifiers, Support Vector Machines, and other supervised inductive learning techniques can be used for classification as shown in Figure 5.

2.1.3. Clustering

Clustering is one of the most challenging unsupervised learning problems. Objects are sorted into groups of related members during the clustering process. As a result, a cluster is a group of things that are related to one another but not to the objects of other clusters. Clustering analysis is a method of grouping individuals or data objects (pages) with similar characteristics together. The formulation and execution of future marketing plans might be aided by grouping user information or pages. The usage of user clustering will aid in the discovery of groups of users who have similar navigation patterns. Clustering techniques make sets of similar items from a large volume of data by using distance functions that compute the similarity ratio between items [30]. The contrast of the user/client and individual groups is an essential factor in such type of searching. There are two types of clustering available in this area:(i)User clustering(ii)Page clustering

User clustering is used to find those users who have the same browser patterns, and page clustering is used to find similar content’s web pages.

2.1.4. Sequential Analysis

Sequential analysis is that which is found in those patterns in which one set or sets of pages are accessed one after another with a time sequence. For the prediction of future visitors, this application works by advertising on users group. Some techniques are utilized for sequential analysis [31], as shown in Figure 5. A detailed description of various algorithms of WUM Techniques is given in Table 2.

2.2. Web Content Mining (WCM)

Web Content Mining (WCM) [13, 33, 34] as shown in Figure 6, is used to fetch relevant & Reliable information from web pages which may contain text documents, Hyperlinks, Structured data, audio, and Video. Nowadays, web pages are increasing exponentially over www.

Fetching relevant data related to user queries from an extensive collection of web pages is very difficult and time-consuming. Web content mining has the following approaches [33] to extract user relevant information from different types of data such as unstructured data, structure data, semistructured. There are various content mining algorithms [35] used by the above content mining techniques are shown in Table 3.

2.3. Web Structure Mining (WSM)

Web Structure Mining detects the structural summary of a web page and its linked web pages as architecture shown in Figure 7. It finds out-link (forward/backwards) structure inside a web page by structure mining [33, 36]. It is used to classify and compare web documents and integrate the number of different web documents. Some of the popular Web structure mining algorithms are summarized in Table 4.

Web structure mining (WSM) as shown in Figure 7 follow the following steps:(i)Apply link analysis on a web page repository to extract links (forward/backward) summary of web pages.(ii)Apply a link mining techniques in the summary to find out the weight or quality of the web pages.

2.4. Challenges in Web Mining

Web mining is faced with some technical and nontechnical issues. Nontechnical issues occur due to management, fund, and resources (such as professional humans), Some technical issues are discussed below:(i)Inappropriate data: Collected data should be reliable and in proper format to do successfully mining because many times data are incomplete and unavailable. It is very difficult to assure the accuracy of such a data.(ii)Complexity of web pages: The structure of a web page is not predefined. It is stored in a digital library (order of data is not defined) in its original format. So, mining of data is very complex.(iii)Dynamic Web: In dynamic web, data are frequently changed due to new updation. For example, sports data. Therefore, the complexity of mining is increased.(iv)Shortage of Mining Tools: Need to develop a mining tool because a very smaller number of appropriate and complete mining tools is available.

2.5. Features of Web Page and Importance of These Features in a Ranking System

In this, we find out features of web pages and the importance of these features in the ranking system [29, 31], [30, 3739] of the search engines (shown in Table 5). For each web page, there are fifteen features as given in the table, these features further divide into seven groups. All seven groups were finally categorized into three parts based on Web Mining types (WCM, WUM & WSM).Page: It has two characteristics one of them is Page rank (PR) score and the second one is the age (AGE) of web pages in an index of search engine.Links: It is associated with links/URLs (forward/Backward Links) on the web Page.Query and Text Similarity: It indicates similarity ratio between query keywords and contents of a web page [40].

It has main three features:(i)Frequency of query keywords inside title(ii)Frequency of query keywords inside heading tags (H1, H2, …, H6) separately.(iii)Frequency of query keywords inside paragraph.Head Tag: Head tag contains two features: title and meta data. Both are used based on keywords inside title and meta description.Body: it is associated with the density of keywords inside the body of a web page.Content: associate with different features which are part of content analysis such as headings, links/URLs.Session Specific: in this count total number of clicks, count unique clicks, and time duration for a session.

The above web parameters used in mining by Search Engine to find the quality and relevant web pages for Internet users for their queries. All the parameters are categorized according mining techniques. There are following web mining tools discussed in Table 6.

3. Web Page Ranking System

Every day, millions of people’s access search engines to retrieve information according to their needs; hence, it becomes a common knowledge retrieval platform. The weight of the ranking in expert search for web documents is explained in [41]. The search engines have become the driver of Internet users that move them toward the highly ranked web by using various web mining techniques [42]. In order to maintain the ranking of web pages, the main objective of the website is to attract Internet users or clients so that they can maintain the ranking on renewed search engines. Reinforcement learning for Web Pages Ranking (WPR) algorithms is explained in [43]. There are several ways to improve the ranking of a web page on search engines, as SPAM farms are a very famous method to enhance a Website’s ranking. During Rank calculation of web pages, cognitive spammer framework (CSF) deletes all spam web documents [44]. A framework Preference-based Universal Ranking Integration (PURI) [45] is designed by combining various ranking mechanisms. The Internet is an important source to access information from the web. At the same time, almost all web pages contain much noise such as advertisements, different types of banners, unreliable links that affect the performance of content and structure-based search engines, Question-Answering System, Web Summarization [13]. For instance, it fetches web documents based on the subject/title as given. To fetch huge web documents related to a specific domain is very easy and common. Therefore, to find reliable/matched web documents for user/client queries, search engines provide a ranking system. The g-index based expert-ranking system in which mainly Rep-FS, Exp-PC, and weighted Exp-PC techniques are used, explained in [46]. Ranking system utilize various web page ranking algorithms (as shown in Figure 8) like page rank [18, 47], weighted page rank [48], Eigenrumor [49], HITS [50], Weight Links Rank [21], distance ranking [51], tag rank [52], query dependent [17] to compute a rank of web page. It returns the order of web pages (order is done based on their rank).

Page Rank is frequently used to calculate web page rank on the basis of in-link and out-link of the web page. The formula (shown in equation (2)) to calculate rank of a web page A

Page rank of A is depend on the page rank value of each page B contained in the set of Xa (the set of all pages linking to page A), divided by number of links from page B.L (B) -> Out-link from page B.PageRank (B) -> Page rank of page B.

Weighted Page Rank is extended version of Page rank algorithm. It consider the popularity of web pages on the basis of link structure (in-links and out-links). WPR assign the different rank of the web page to its all the out-links.

Eigen Rumor is proposed to resolve the limitation of page rank and other web page ranking algorithm over blog i.e. it assign the rank value to each blog on the basis of weight of hub and authority of the blogger.

In query-dependent algorithm, use queries of the users to increase the performance of the page-ranking algorithm. A component was incorporated in the page-ranking algorithm which was dedicated to calculating the similarities between the user queries. The similarities between the user queries was analyzed by the algorithm and that information was used to decide the final results of return back to the user for a query.

A new approach (SimRank) using vector space model was proposed which uses the similarity from the vector space based model and finds the rank of the web page. The SimRank [17] algorithm assigns rank to the pages to be retrieved from the search engine in an effective way. Most of the traditional page rank algorithm uses the link structure of the web pages to find the page rank, and some of them are totally ignoring the content of the web pages. But SimRank algorithm also incorporates the content of the web pages to find the final rank score of a web page.

HITS algorithm computes rank of a web page by using popularity of web page. It also calculates the number of In-links and Out-links of a web page. The Hit based algorithm is basically computing the rank of a web page by calculating popularity of web page. The popularity is computes by determining input links (Authority) and output links (Hub) of a web page.

R. Baeza and E. Davis developed a Weighted Links Rank (WLR) algorithm with the help of standard PR algorithm. This algorithm generates weight of a link on the basis of three different arguments, that is, the anchors text length, tags, and relative position in the web page.

ZarehBidoki and Yazdani [14] proposed a reliable and intelligent web page ranking mechanism is called distance rank algorithm which is working on the basis of reinforcement learning algorithm. The distance between pages is calculated by using shortest logarithmic distance between 2-pages and assigns the rank accordingly to them. This algorithm returns very fast high-quality web pages by using distance based solution. For this algorithm, crawler takes more time to compute the distance vector for new web crawled web page. Table 7 shows the summary of web mining techniques and ranking algorithms for each mining technique.

4. State-of-the-Art Review

Due to increasing the information for humans on the WWW, the responsibility of the Internet also increased. It is straightforward for us to collect the information from www using search engines. Search engines return a large number of web pages as information for a user query. It is challenging for users to select reliable information among them. Therefore, in this section, we will discuss research papers in which the author tries to improve search engine techniques that support users to select reliable information.

In [54], authors give an approach to fetch experts’ attributes by using text mining from the web, that is, it is a recommended model to return a precise record. This research has shown the effectiveness of the proposed approach in box-office revenue prediction. In [55], the author proposed a prediction for movie revenue based on YouTube trailer reviews. It is mainly utilized in business intelligence as well as in decision-making. In [56], the author developed a framework for Geographic Information Mining (GIM) framework. Microsoft discussion (MSD) forums used expert rank [57], a technique to find experts. This methodology used document-based relevance as well as authority. It does not consider MSD features (like rating by the user, which is a more reliable feature used to mine expert users). In [58], author identified user activities in the SO-forum and compared them with their GitHub repositories and feasible features of the user (active in both platforms). In [59], author proposed user activity models for stack overflow, Wenwo Forums & SinaWeibo to classify real experts. In [60], the model uses some basic features to compute the user weight. In this model, the question-answer ratio is used to generate user weight; still, it ignores the consistency of the user. Besides this, the quality of the tag was not considered. Although, it may lead to more reliable and accurate recommendation systems. The link-based expert finding techniques mainly used the structure of links instead of their contents. Link analysis used question-answer relationship [61], to find experts, citation networks [62] and e-mail communications [63]. For online users, in [64], the author presented an automatic expert-finding model. In this model, the profile of user expertise was evaluated based on social network score and postconditions. The Z-Score, PageRank, In-degree, and HITS, etc., algorithms were used to compute social network authority scores. A search engine to fetch biomedical information [65] return all the documents corresponding user query from MEDLINE based on word/concept indexes. Several researchers have investigated various ranking approaches by using different methodologies that increase the efficiency of search engines to provide highly relevant web pages for a particular user query.

In [66], recent research in CARSs is mainly directed by developing novel techniques or adapting and combining existing ones that can efficiently deal with the growing complexity and dynamicity of social networks.

The main consequences of the [67] are (1) ontologies of a corpus can be organized effectively, (2) no effort and time are required to select an appropriate ontology, (3) computational complexity is only limited to the use of unsupervised learning techniques, and (4) due to no requirement of context awareness, the proposed framework can be effective for any.

In [68] author explores the various explanation techniques to identify the local contribution of ranking indicators based on the position of an instance in the ranking as well as the size of the neighborhood around the instance of interest. We evaluate the generated explanations for the Times Higher Education University ranking dataset as a benchmark of competitive ranking.

Table 8 summarizes various research papers [6985] based on different attributes such as methodologies, approaches, pros and cons, etc. Additionally, futuristic research directions in similar areas are presented in [8691].

4.1. Observations from the State-of-the-Art- Reviews

Following observation are made after the critical review of the state-of-the-art review:Observation 1: Mostly search engines return relevant web pages to users for their queries. Relevancy of web page depends upon in-link/ out–link (i.e., web structure mining) and popularity of web page.Many times, the most relevant web pages may be less important for user queries. Important web pages, according to user queries may be missing out from the result. So new techniques are required to develop that may consider user queries as an additional parameter to find the relevant web pages for those queries.Observation 2: Due to increasing the size of the web, search engines delay returning a list of web pages as output to users. The delay between user query submissions and to get output is called perceived latency. Therefore, a pre-fetching mechanism needs to be developed to reduce the response time.Observation 3: Even with the introduction of a pre-fetching mechanism that aims to reduce the user perceived latency, unsuccessful predictions made to prefetch the pages may result in information overkill. Thus, a mechanism is required that could actually make credible predictions for only those pages that are more relevant, that is, make correct predictions to minimize the problem of information overkill.Observation 4: Due to increasing WWW and Internet users, it is very difficult to fetch the information, which is looked at, by a specific group of users. For example, in an organization all employees may request the same type of information. Therefore, it requires approaches that personalize the content of web pages with respect to the user’s group.

A critical look at the available state-of-the-art reviews reveals that the following major gaps are identified:(1)Possibility of existence important page but less popular, which may not be linked(2)Delay in response as perceived by user(3)Need to search information in a similar interest group in an organization

5. Conclusion and Future Scope

Three categories of ranking algorithms are mainly discussed. The first category of algorithm based on the content of web pages is known as content-based page ranking. The second category of the algorithm, which uses the link structure of the worldwide web, is known as web structure-based page ranking algorithms, and the third category used a hybrid of the first and second categories. Ranking systems highly rely upon web mining techniques, but some issues need to be addressed in web mining due to improper data, shortage of mining tools, and other challenges in classification and clustering techniques.

The existing ranking systems have several limitations, which define the challenge and new research paths for researchers. The observations about existing research work will help the researcher select the specific area where further research may be initiated.

There are some challenges related to web page ranking, such as the following:(i)Web structure-based page ranking algorithms may ignore web pages with less page ranking score but good content for a user query. Content-based page ranking algorithms take more time to find page rank because of content mining at query time.(ii)The size of WWW is huge, so content mining is a very time-consuming process to check the quality of web pages. There is a need to reduce the time taken by search engines to return the results.(iii)To improve the search results for user queries, it is needed to search for information in a similar interest group in an organization [107113].

Conflicts of Interest

There are no conflicts of interest.


This research was supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R195), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.