New Trends in Networked Control of Complex Dynamic Systems: Theories and ApplicationsView this Special Issue
A Novel Mathematical Formula for Retrieval Algorithm
A method is proposed to retrieve mathematical formula in LaTeX documents. Firstly, we represent the retrieved mathematical formula by binary tree according to its LaTeX description, normalize the structure of the binary tree, and obtain the structure code and then search the mathematical formula table that is named by the structure code and the formula elements of the first two levels of the binary tree in the mathematical formula database. If the table exists, then we search the normalizing variable name preorder traversing sequence of the binary tree in the table and display the document information that contain the mathematical formula. The experimental results show that the algorithm realizes the retrieval of mathematical formula in LaTeX documents and has higher retrieval precision and faster retrieval speed.
With the rapid development of the internet and digital libraries, more and more documents that contain mathematical formulas are stored on the computer. In order to share and communicate these documents quickly, online retrieval for mathematical formulas has attracted much attention and has become an important research area.
The retrieval technology for text already is relatively mature [1–7]. However, how to effectively retrieve mathematical formulas in documents is still an ongoing research issue . And some control ideas, such as data driven [9–13] and system switch [14–17], have also been employed for this. Lee and Wang  presented a system of mathematical formula reorganization, but this system cannot handle multiline mathematical formulas, as well as more complex single-line ones. Fateman et al.  designed a system of mathematical formula reorganization, but the system can only reorganize integral tables with fixed format. Zanibbi et al. [20–22] proposed methods that can achieve good results for scanned images of the formulas and support automatic evaluation of recognition performance. Nonetheless, the methods cannot analyze the expression with two or more modifiers. MatheReader  can recognize more kinds of mathematical expressions; however, it still does not reach the degree of practical application.
The description methods of mathematical formulas mainly include MathML, LaTeX, and image. Among them, LaTeX has been widely used to edit scientific papers, books, files, dissertations, manuscripts, personal letters, and a variety of complex symbolic formulas. In addition, other format documents can be easily converted to LaTeX format. Therefore, a method is proposed to retrieve mathematical formula in LaTeX documents.
The rest of the paper is organized as follows. Section 2 gives the binary tree description of mathematical formula. Section 3 introduces the design of database. Section 4 describes our mathematical formula retrieval method in detail. Experimental results are presented in Section 5. Conclusion is outlined in Section 6.
2. Binary Tree Representation of Mathematical Formula
2.1. Construction of Binary Tree
Due to the noticeable structural feature, a complicated mathematical formula in LaTeX form can be divided into multiple subexpressions and then each subexpression can be divided into much smaller ones. We repeat the procedure until no collapsible component is left. The final subexpressions are called formula elements.
The operator has three operands, such as “,” which has a close relationship with its top region, bottom region, and right region. We combine it with the right subexpression by adding an operator “link.”
We traverse the formula element string with “link” from left to right to generate the priority list of formula elements and then the binary tree representation of a mathematical formula can be obtained according to its structural feature and the priority list. The data structure of the binary tree is given in Table 1.
We use recursion approach to get the binary tree representation of a formula element. Root, the lowest priority element, is first created and then we create the left subtree according to the elements before the root element in the formula element string. Accordingly, the right subtree can be created by the elements after the root element in the formula element string.
For each node, its element category and combination can be determined by the formula element. The height of each node can be calculated by the following: where is the height of node, is the height of left child of node, and is the height of right child of node.
For example, for mathematical formula , its LaTeX form is (sum_[i=1]ai+xtimes ytimes z)times(xtimes y+ytimes z). The corresponding binary tree representation is given in Figure 1.
2.2. Normalization Processing
Due to the fact that some operators satisfy the commutative law, that is, for these operands, one can exchange them randomly for constituting different mathematical expressions; the meanings of these expressions are identical. But it is worth noting that the structures of the corresponding binary trees are likely to be different. Hence, the normalization must be done for differently structural but identically meaningful binary trees. We traverse the binary tree in preorder, if the category of the formula element is OPS and the height of left child is higher than that of right child, then exchanging the left subtree and right subtree of the node. Figure 2 shows the normalized binary tree corresponding to Figure 1.
After normalizing the binary tree, the structure code of every node can be generated by traversing the binary tree in postorder. The structure code of node “node” can be obtained according to the following: where is the structure code of left child of node and is the structure code of right child of node.
Note that variable names of mathematical expression are independent of the formula meaning. For a given structure binary tree, we can get its corresponding sequence of the formula elements according to given traversal order. To make the sequence unique, we still need to normalize all the variable names in the sequence. The normalization approach is to use a fixed set of variable names to successively replace each formula element labeled “VAR” in the formula element sequence.
3. Database Designing
Retrieval database of mathematical formulas contains two kinds of tables: one is document information table and the other is formula information table. Their structures are given in Tables 2 and 3. Naming rule for the formula information table is described as follows: where is the formula element of root, is the formula element of left child of root, and is the formula element of right child of root.
Mathematical formulas with the same information, including structure code, formula element of root, formula element of the left child, and element of the right child, are stored in a table.
4. Retrieval Algorithm
For the retrieved mathematical formula, we create the corresponding binary tree representation by its LaTeX format, obtain the structure code after normalizing the structure of binary trees, and then search the formula information table named by structure code and the formula elements of the first two layers of the binary tree in the formula database. If the table exists, we find the preorder traversing sequence of the binary tree in the table. The retrieval algorithm is described in detail as follows.
Step 1. For a candidate testing LaTeX document, extract all mathematical formulas to get a retrieved formula set Formula = and go to Step 2.
Step 2. If Formula is nonempty, then take out a formula from Formula, create its binary tree representation, and normalize structure of the binary tree to get binary tree . Traverse in preorder and normalize variable names to get traversing sequence and go to Step 3; else, go to Step 8.
Step 5. . For each nonleft node, if its element category is OPS and the heights of left child and right child are identical, exchange its left and right subtrees. Traverse the tree in preorder and normalize variable names to get the corresponding traversal sequence. If the sequence is not existing in , then add the sequence to . Finally, get formula element sequence set and go to Step 6.
Step 6. Search the formula element sequence that is the same as () in the table. If it exists, output the document information containing formula ; else, go to Step 7.
Step 8. End.
5. Experimental Results
To verify the effectiveness of the proposed method on different types of mathematical formulas, we collect 1138 different mathematical formulas from 500 pressed research papers written in English and Chinese. We represent every mathematical formula by binary tree according to its LaTeX description, normalize the structure of the binary tree, and obtain the structure code. We save the preorder traversing sequence of normalizing variable name to the formula information table that is named by the structure code and the formula elements of the first two levels of the binary tree. We Save these documents information to the document information table at the same time.
The computational experiments were done on a Pentium 2.0 G with 2.0 MB memory, Windows XP SP3, and ACCESS 2007. The precision, recall, and values are used to evaluate the retrieval performance of the algorithm: where is the number of mathematical formulas retrieved correctly in retrieval results, is the number of mathematical formulas that should be retrieved but do not appear in retrieval results, and is the number of mathematical formulas that should not be retrieved but appear in retrieval results.
To verify the performance of the proposed method, some mathematical formulas are modified according to Table 4.
In experiments, retrievals are done 2016 times; the average precision is 96.35%, the average recall is 95.38%, the average value is 96.86%, and the retrieval time is 378 ms.
The experimental results show that the proposed method obtains high retrieval accuracy. The key reasons are that the method realized semantic retrieval. If the semantic of retrieved mathematical formula is the same as the destination mathematical formula, then the corresponding structure of binary tree is uniform after normalizing the structure of the binary tree. Even if the destination mathematical formula exits in more than one binary tree representation, after normalizing variable names, at least one preorder traversing sequence of binary tree is the same as retrieved mathematical formula. The retrieval speed of the proposed approach is fast. The key reasons are that the method searches the table named by the structure code and the formula elements of the first two levels of the binary tree. If the table exists in the mathematical formula database, then to search the preorder traversing sequence of the retrieved mathematical formula in the table.
Based on the binary tree representation of mathematical formula, a mathematical formula retrieval method for LaTeX documents is introduced in this paper. Experimental results show that the algorithm not only realizes semantic retrieval of mathematical formula but also has higher retrieval precision and faster retrieval speed. The results achieved in the offline retrieval promise the proposed method will work in the online case as well. The disadvantage of the existing retrieval system is that it cannot retrieve mathematical formula in LaTeX documents when it is solved. How to retrieve mathematical formula in PDF documents and WORD documents would be our research work in future.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This study is partly supported by the National Natural Science Foundation of China (no. 61304149), the Polish-Norwegian Research Programme (Project no. Pol-Nor/200957/47/2013), the Natural Science Foundation of Liaoning Province in China (no. 201202003), and the Program for New Century Excellent Talents in University (no. NCET-11-1005).
J. Zobel and A. Moffat, “Exploring the similarity space,” ACM SIGIR Forum, vol. 32, no. 1, pp. 18–34, 1998.View at: Google Scholar
A. Si, H. V. Leong, and R. W. Lau, “CHECK: a document plagiarism detection system,” in Proceedings of the ACM Symposium on Applied Computing, pp. 70–77, 1997.View at: Google Scholar
J.-P. Bao, J.-Y. Shen, X.-D. Liu, and Q.-B. Song, “Survey on natural language text copy detection,” Journal of Software, vol. 14, no. 10, pp. 1753–1760, 2003.View at: Google Scholar
J.-J. Zhao and X.-G. Hu, “A way to judge plagiarism in academic papers based on word-frequency statistics of paragraphs,” Computer Technology and Development, vol. 19, pp. 231–233, 2009.View at: Google Scholar
N. Kang, A. Gelbukh, and S. Han, “PPChecker: plagiarism pattern checker in document copy detection,” in Text, Speech and Dialogue, vol. 4188 of Lecture Notes in Computer Science, pp. 661–667, 2006.View at: Google Scholar
D. Martín-Albo, V. Romero, and E. Vidal, “An experimental study of pruning techniques in handwritten text recognition systems,” in Pattern Recognition and Image Analysis, pp. 559–566, Springer, New York, NY, USA, 2013.View at: Google Scholar
J. M. Jin, H. Y. Jiang, and Q. R. Wang, “Mathematical expression recognition system: MatheReader,” Chinese Journal of Computers, vol. 29, no. 11, pp. 2018–2026, 2006.View at: Google Scholar