Research Article | Open Access
Text Matching and Categorization: Mining Implicit Semantic Knowledge from Tree-Shape Structures
The diversities of large-scale semistructured data make the extraction of implicit semantic information have enormous difficulties. This paper proposes an automatic and unsupervised method of text categorization, in which tree-shape structures are used to represent semantic knowledge and to explore implicit information by mining hidden structures without cumbersome lexical analysis. Mining implicit frequent structures in trees can discover both direct and indirect semantic relations, which largely enhances the accuracy of matching and classifying texts. The experimental results show that the proposed algorithm remarkably reduces the time and effort spent in training and classifying, which outperforms established competitors in correctness and effectiveness.
Rapid developmental trend in social network means the explosive growth of users as well as dramatic changes in providing services. Therefore, large-scale text classification and retrieval revive the interest of researchers . The traditional knowledge representations are characterized by strong pertinences and have great power in expressing empirical knowledge or rules, but they are insufficient in representing complex and uncertain knowledge existent in social webs. Texts share various forms of common structural components (from simple nodes and edges to paths [2, 3], subtrees , and summaries ) . Direct semantic information can be found easily, but hidden semantic information is extremely difficult to be detected. Zaki and Aggarwal  propose a structural rule-based classifier for semistructured data, called XMiner, which can mine out parent-child frequent branches and ancestor-descendant ones and conduct structured or semistructured data perfectly, but the shortness is the lack of semantic information in text representation.
Semantic similarity assessment [7, 8] can be exploited to improve the accuracy of current information retrieval techniques , to automatically annotate documents [10, 11], to protect privacy [12, 13], to match web services , and to resolve problems based on knowledge reuse . Semantic network [16–18] is more concerned about semantic information. For the semantic data mining can be based on the text analysis, many semantic community detection algorithms exploited the latent Dirichlet allocation (LDA) model as the core model, which is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar [19, 20]. However, semantic analyzing based on LDA [16, 21] is complicated, and semantic information mining is important for text matching and categorizing, so it is needed to find a much more efficient and friendly way, of which the results are precise and accurate.
A relation between two words can be in one-way direction or bidirection based on the interrelationships between them, so it is reasonable to use graphs or trees to express a text. The method proposed can mine out implicit semantic information without cumbersome lexical analysis by making links express semantic knowledge and pointers record a traversal sequence which describes different abilities of nodes in expressing a text. The method proposed in this paper not only extracts semantic information by creating tresses but also calculates the similarities of coexisting hidden structures to measure the similarities of texts. Three main contributions of this work are listed as follows. One is to represent all semantic information in a text using tree-shape structures. The other is to generate semantic trees based on the combining of pointers and a fixed traversal strategy and to use subtrees as addenda structures. The last one is to discover implicit knowledge by analyzing semantic trees and mining coexisting hidden structures.
2. Representation of Semantic Information
Because knowledge model is highly dependent on relations, it is reasonable to use trees to express a text. This paper employs tree-shape structures to describe a text, from which semantic information can be mined out without cumbersome lexical analysis.
2.1. Semantic Graphs
A text is deemed as a sequence of sentences (denoted as , where is the th sentence in text ), and a sentence is viewed as a sequence of words (denoted as , where is the th noun in sentence ). Nouns are extracted from texts to create , because nouns are capable of describing the meaning of a sentence.
Definition 1 (SemGraph). It is a semantic graph and is denoted as , where is a set of nodes and each contains the information about the frequency and weight of a noun, is a collection of edges representing relations among nodes, and is a collection of pointers used as the guidelines when traversing a graph, which depicts the descending order of abilities of nodes in .
Based on the assumption that words in one sentence are deemed as having semantic relationships (a relationship is existent between and in , where , and ), the nodes arising in the same sentence are linked with each other in SemGraph.
Definition 2 (isolated node). The node has neither in-degree pointers nor out-degree pointers.
The process of building SemGraph is as in Algorithm 1.
The creation of a SemGraph is illustrated by the example shown in Figure 1. By scanning a text in sentences and supposing nouns and appear in the first sentence, they are added into SemGraph directly and the Counts of them are assigned to 1, respectively (Figure 1(a)). Since the Counts of the two nodes are equal, the direction of the pointer is set randomly. Figure 1(b) supposes and coexist in the second sentence. Because has existed in SemGraph, there is no need to add to the SemGraph again, but the Count of must be modified (). As a new node, is added to the SemGraph directly and the Count is set to 1. Because , the direction of the pointer is shifted from to . Similarly, for , the pointer between them is set from to . In short, pointers mark from the more frequent node to the opposite. In Figure 1(c), sentences in the text are supposed to be as follows: , , , , , , . After processing the similar works mentioned above for each sentence, the final result is shown in Figure 1(c). If the last sentence is , the Counts of and add one, respectively. Because and , the pointers between - and - should be changed, as shown in Figure 1(d).
After building an original SemGraph, the redundant nodes (the Counts under threshold ) must be pruned to achieve the purpose of simplification as they are weak to describe a text.
The following is the setting method of threshold :
is the sum of Counts. is the number of nodes. is the number of characters in the text. is the average number of characters in samples. is an optional artificial setting value. is a measurement of SemGraph by considering various parameters, and controls in the Scope of . If some texts are more authoritative or have stronger abilities to represent a class, is reset to a smaller value based on specialist knowledge. The smaller the is, the more important the text is. Eventually a smaller makes more information in the text retained.
Further explanations are as follows. (1) is inversely proportional to the mean of Counts in SemGraph. (2) More characters in a text lead to more redundant information, so the size of the text is used to fine tune . (3) Experts can manually select some representative texts and assign a smaller for building a SemGraph representing a class (denoted as class-SemGraph) quickly.
2.2. Fusion Strategy
SemGraphs that represent texts are denoted as text-SemGraph. A class-SemGraph is generated by merging text-SemGraphs in the same class. The following two problems need to be considered to merge different text-SemGraphs in the same class. (i) Weights of nodes must be recalculated. The formula is as follows: /, where represents a node and embodies the importance of node in the corresponding class. (ii) If the number of occurrences of a new relationship is larger than threshold , the link between the two nodes will be added in the class-SemGraph. If the number of disappearances of an old relation is less than threshold , it means the relationship is weak or nonexistent, so it should be deleted in the class-SemGraph.
By adding new classified texts to the corresponding class-SemGraph, high accuracy and real-time performance can be ensured. In conclusion, the merging operation is needed to be performed by combining text-SemGraphs for creating or updating class-SemGraphs. The implement strategy is as in Algorithm 2.
judges whether there is a link between node and in SemGraph . , if a link exists between the nodes and 0 otherwise.
Insignificant nodes in must be deleted every once in a while to ensure timeliness, which is done by Algorithm 3. Nodes deemed as less capable to describe a class must satisfy the following conditions: , where is the sum of in-degree and out-degree pointers of the th node and is an artificial threshold and proportional to the average length of texts. The root of should be relocated when the network is changed, which is the start node when traversing or mining frequent structures. The concrete implementation of finding a root is also done by Algorithm 3.
2.3. Formation of Trees
In order to analyze implicit frequent structures, SemGraphs should be decomposed into several trees; thus studying the features of the trees is equivalent to processing the SemGraph. Depth-First Search (DFS) or Breadth-First Search (BFS) strategies cannot meet the requirements of social network, because fixed traversal strategies would miss or destroy some important relationships; thus pointers are needed to achieve correct mining results when traversing graphs. The method of choosing a root is as follows: (1) choose the node with the maximum Count; (2) if there is more than one node having equivalent maximum Count, the node having more out-degree pointers is chosen as the root.
BuildTree() is a semantic graph searching method proposed in this paper without losing semantic information based on DFS or BFS. BuildTree() usually generates more than one tree, so several trees can express all semantic relationships between nodes. Algorithm 4 is the semantic graph searching strategy based on DFS.
In Figure 1, has the biggest Count, so is chosen to be the root. BuildTree() creates three sets of trees based on DFS and BFS, respectively, shown in Figures 2 and 3. The analysis shows that in spite of the two different results they do not affect follow-up works as they express exactly the same semantic information.
is a master subtree, while both and are auxiliary subtrees. DFS or BFS only create master subtrees, which omit some vital semantic relationships. For instance, believes that and have no semantic relation, but actually they have one in SemGraph, so is essential to replenish this missing relationship.
3. Mining Implicit Frequent Structures
Definition 3 (implicit frequent structure (IStruc)). IStruc is a frequent structure of SemGraph, which reserves ancestor-descendant relationships.
That is, there are at least two connected nodes in IStruc and they are not linked in SemGraph; the frequent structure like this is called implicit frequent structure.
Definition 4 (Scope). It represents the Scope of node in a tree, whose format is , where is an index of node generated by traversing the tree according to DFS or BFS and is the maximum value of ’s among all successor nodes of .
Definition 5 (branch root). It meets the following conditions: is the smallest one in all ’s and is the smallest one in all ’s.
Definition 6 (List). It is denoted as [, , , ], where is the ID of a text, is the ID of a tree, and is the number of branch nodes.
In order to mine IStrucs, it is needed to analyze new structures generated by connecting nodes one by one. But there is no need to connect all the nodes. For example, if two nodes in a tree do not have a common ancestor node, they should not be connected. Therefore, it is essential to judge whether the nodes meet some preconditions.
Preconditions. If node (the List is ) and node (the List is ) are linked in a IStruc, they must meet one of the following conditions. (1) If , is a child node of in the SemGraph. (2) If and have the same ancestor node, is a brother node of in the SemGraph. (3) If , is a child node of ’s branch root in the SemGraph. (4) If , is a brother node of ’s branch root in the SemGraph.
To specify the process of mining IStrucs by computing the Scopes of nodes, two sets of trees representing two texts are shown in Figure 4. The subscript of in Tree 0 of Text 1 is 0 determined by DFS, so . All the direct successor nodes of are , and ’s of those nodes are . Obviously, has the maximum , so the of is set to 5 and the Scope of is .
The final Lists are shown in (2). There are six different nodes in the tree, so six Lists are set up. Taking as an example, appears in Tree 0 of Text 1 and Tree 0 of Text 2, and the Scopes are and , so the List of has two items, which are and .
Lists of nodes are as follows. The format of an item in a list is :
The node just appearing in one text cannot be a common IStruc, so nodes like this are deleted. , , and only appear in Text 1, so they are deleted. After deleting redundant nodes, the rest are . Assuming that is a root node, it will be linked with other nodes which meet the preconditions to create new IStrucs. Therefore, and are created as that shown in (3).
Lists of structures are as follows. The format is , :
The interpretation of the result generated by is as follows: After visiting the Lists of and , it is obvious that and meet the preconditions (text IDs of the two nodes are 1; tree IDs are 0; ). Therefore, are in a father-child relationship and is a branch root, because the Scope of is wider than that of . Finally, the List of is . After getting the Lists of and , it can be discerned that they also meet the preconditions, so structure is calculated on the basis of and , as shown in (4).
The result of is as follows:
After several linking operations, two results are gained. is the structure shown in Figure 5(a) existing in Tree 0 of Text 1. means that Tree 0 of Text 2 has the structure shown in Figure 5(b). Nodes are not directly connected in the original trees, so -- is an implicit frequent structure. Although the mining results have different structures in SemGraphs, they contain the same hidden knowledge. Without mining IStrucs, some implicit relations of texts are ignored entirely, which greatly reduces the accuracy of text matching.
4. Scoring Tactics
The semantic trees having common IStrucs are not a proof of existing association relations, so it is essential to analyze the authorities of IStrucs. The following is the scoring tactic of IStrucs to compute similarities between two texts or between an unknown-class text and a class.
Scoring rules:where is a node in an IStruc, while is a node in a SemGraph. is the degree of variance between a text and a class.
From the above example, the differences between Text 1 and Text 2 can be computed. The score of the IStruc in Tree 0 of Text 1 is . The score of the IStruc in Tree 0 of Text 2 is . So the distance of the two texts is .
Three datasets are used in this paper.(1)SND: the dataset is gathered from sina (http://www.sina.com/) repeatedly. Training data is collected in different periods, and testing set dynamically collects data from websites which is timeliness with the focus of hot topics. Training set contains 5200 documents in 5 different classes, while testing set has 2500 documents.(2)TREC: the dataset (http://trec.nist.gov/data.html) based on a subset of the AP newswire stories has 242,918 stories. Over 50,000 texts are selected from TREC randomly, reporting events from areas as different as the politics, finance, media, entertainment, and so forth.(3)20 Newsgroups Dataset (http://www.qwone.com/~jason/20Newsgroups/): this dataset is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. 2,0000 texts are randomly selected for the classification experiments. Of them 3,000 are multilabel texts and the rest are single-labeled.
Three sets of baseline approaches are chosen for the experiments.(1)-NN approach: this approach finds the nearest neighbors in the training set. After finding the neighbors, it can be calculated how many of these neighbors belong to the th class. Therefore, the probabilities of the test points belonging to each of the classes can be got by dividing the counts with .(2)Term vector model: it is an algebraic model for representing texts as vectors of identifiers, which is used in information filtering, retrieval, indexing, and relevancy rankings. VSM  signs the importance of topics by the term weights computed as the term frequencies.(3)Multilabel classification approaches: MetaLabeler  can determine the relevant set of labels for each instance without intensive human involvement or expensive cross-validation. Two steps are involved: one is to construct the metadata; the other is to learn a metamodel. The first step can be considered as a multiclass classification problem.
The size of training dataset should be kept within a reasonable Scope. A small amount of data could affect the authority of SemGraph, while a large number of data would incur unnecessary computational cost. After class-SemGraphs have been established, unknown-class texts are studied to ensure the timeliness and the quality of the corresponding class-SemGraph, so the size of training dataset is not the bigger the better. In order to explain the method of setting the size of training dataset, ten texts are made as a set to build or update a class-SemGraph. If the number of information increment is small and the added information is of low importance, the learning process will be ceased.
In Figure 6, the size of training dataset within is reasonable. If a class contains a relatively larger amount of information, the size of the training set should be set as a bigger value, such as Book.
Details of the training sets and the class-SemGraphs are shown in Table 1. The number of nodes in SemGraphs is compared with the size of datasets, which is shown in Figure 7. Take Car for an example; the generated knowledge is shown in Table 2 after studying 100 texts in Car. The weights in Table 2 are the sums of weights of the same nodes in different texts, which are calculated by the algorithms mentioned before.
Adding new texts that have been categorized into the corresponding class-SemGraph can help to update it in real-time. Manually analyzing the training dataset points out that most of the texts can be classified into one or two classes (shown in Figure 8), but only the closest matching class is selected. If the algorithm maps one text to several matched classes, it will cause unnecessary troubles, because multiple matches would confuse distinctions between classes, which make text classification more difficult. For example, Car normally contains the features of the following classes: finance, energy, transportation, and environmental protection. If texts in Car are used to update the SemGraphs of finance, the two class-SemGraphs will become more similar and more difficult to be distinguished, so this paper only classifies a text as the most similar class. The final classification results are shown in Table 3.
By analyzing the experimental results, the proposed algorithm is proved to be effective, which outperforms the other algorithms and is stable to deal with different kinds of data. The result is shown in Table 4.
It can be found that the errors of the algorithm proposed are acceptable and reasonable by analyzing relationships between the wrong classified text and the class. The errors will not affect users’ experiences but may indirectly influence the accuracy of class-SemGraphs. It is simple to improve this shortcoming, which is to add a judgment for filtering inappropriate texts before updating class-SemGraphs. Instead of using all the new texts to update class-SemGraphs, the improved method is to select the texts which highly match with one certain class and have low matching degree with other classes to update class-SemGraphs. If we wish to keep class-SemGraphs entirely pure, the texts only matching one class are chosen to update the corresponding class-SemGraph. It can be found from Figure 8 that the number of texts belonging to one class is the largest, so this method is feasible.
Compared with other mainstream methods, the method proposed is simple and able to discover implicit knowledge. In addition, the algorithm is more stable in dealing with different kinds of data. After analyzing classification results, it is found that errors fall within a reasonable range and the relationship between the incorrectly classified text and the wrongly specified class makes some senses.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This project is supported by National Natural Science Foundation of China (60973040); National Natural Science Foundation of China for Young Science (61300148); Key Scientific and Technological Breakthrough Program of Jilin Province (20130206051GX); China Postdoctoral Science Foundation Funded Project (2012M510879); and Scientific Frontier and Cross Project of Jilin University (201103129).
К is affected by several parameters. In order to clarify the calculation process of К, the parameters are explained in detail.
- J. Manyika, M. Chui, and B. Brown, “Big data: the next frontier for innovation, competition, and productivity,” Tech. Rep., McKinsey Global Institute (MGI), 2011.
- G. Costa and R. Ortale, “On effective XML clustering by path commonality: an efficient and scalable algorithm,” in Proceedings of the IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI '12), pp. 389–396, IEEE, Athens, Greece, November 2012.
- P. Antonellis, C. Makris, and N. Tsirakis, “XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries,” in Proceedings of the 23rd Annual ACM Symposium on Applied Computing (SAC '08), pp. 1081–1088, March 2008.
- M. J. Zaki and C. C. Aggarwal, “XRules: an effective algorithm for structural classification of XML data,” Machine Learning, vol. 62, no. 1-2, pp. 137–170, 2006.
- S. Tan, “An effective refinement strategy for KNN text classifier,” Expert Systems with Applications, vol. 30, no. 2, pp. 290–298, 2006.
- G. Costa, G. Manco, R. Ortale, and E. Ritacco, “Hierarchical clustering of XML documents focused on structural components,” Data and Knowledge Engineering, vol. 84, pp. 26–46, 2013.
- J.-B. Gao, B.-W. Zhang, and X.-H. Chen, “A WordNet-based semantic similarity measurement combining edge-counting and information content theory,” Engineering Applications of Artificial Intelligence, vol. 39, pp. 80–88, 2015.
- S. Joshi, N. Agrawal, R. Krishnapuram, and S. Negi, “A bag of paths model for measuring structural similarity in Web documents,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03), pp. 577–582, August 2003.
- K. Robles, A. Fraga, J. Morato, and J. Llorens, “Towards an ontology-based retrieval of UML class diagrams,” Information and Software Technology, vol. 54, no. 1, pp. 72–86, 2012.
- H.-C. Chu, M.-Y. Chen, and Y.-M. Chen, “A semantic-based approach to content abstraction and annotation for content management,” Expert Systems with Applications, vol. 36, no. 2, pp. 2360–2376, 2009.
- D. Sánchez and D. Isern, “Automatic extraction of acronym definitions from the Web,” Applied Intelligence, vol. 34, no. 2, pp. 311–327, 2011.
- J. Marés and V. Torra, “On the protection of social networks user's information,” Knowledge-Based Systems, vol. 49, pp. 134–144, 2013.
- M. Batet, A. Erola, D. Sánchez, and J. Castellà-Roca, “Utility preserving query log anonymization via semantic microaggregation,” Information Sciences, vol. 242, pp. 49–63, 2013.
- M. Liu, W. M. Shen, Q. Hao, and J. W. Yan, “An weighted ontology-based semantic similarity algorithm for web service,” Expert Systems with Applications, vol. 36, no. 10, pp. 12480–12490, 2009.
- P. Resnik, “Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language,” Journal of Artificial Intelligence Research, vol. 11, pp. 95–130, 1999.
- A. Tagarelli, “Exploring dictionary-based semantic relatedness in labeled tree data,” Information Sciences, vol. 220, pp. 244–268, 2013.
- W. De Smet and M.-F. Moens, “Representations for multi-document event clustering,” Data Mining and Knowledge Discovery, vol. 26, no. 3, pp. 533–558, 2013.
- Y. Guo, Z. Shao, and N. Hua, “Automatic text categorization based on content analysis with cognitive situation models,” Information Sciences, vol. 180, no. 5, pp. 613–630, 2010.
- D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 4-5, pp. 993–1022, 2003.
- Y. Xin, J. Yang, Z.-Q. Xie, and J.-P. Zhang, “An overlapping semantic community detection algorithm base on the ARTs multiple sampling models,” Expert Systems with Applications, vol. 42, no. 7, pp. 3420–3432, 2015.
- T. Zesch and I. Gurevych, “Wisdom of crowds versus wisdom of linguists—measuring the semantic relatedness of words,” Natural Language Engineering, vol. 16, no. 1, pp. 25–59, 2010.
- G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, 1989.
- L. Tang, S. Rajan, and V. K. Narayanan, “Large scale multi-label classification via MetaLabeler,” in Proceedings of the 18th International World Wide Web Conference (WWW '09), pp. 211–220, New York, NY, USA, April 2009.
Copyright © 2015 Lin Guo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.