Abstract
Aiming at the defects of traditional fulltext retrieval models in dealing with mathematical expressions, which are special objects different from ordinary texts, a multimodal retrieval and ranking method for scientific documents based on hesitant fuzzy sets (HFS) and XLNet is proposed. This method integrates multimodal information, such as mathematical expression images and context text, as keywords to realize the retrieval of scientific documents. In the image modal, the images of mathematical expressions are recognized, and the hesitancy fuzzy set theory is introduced to calculate the hesitancy fuzzy similarity between mathematical query expressions and the mathematical expressions in candidate scientific documents. Meanwhile, in the text mode, XLNet is used to generate word vectors of the mathematical expression context to obtain the similarity between the query text and the mathematical expression context of the candidate scientific documents. Finally, the multimodal evaluation is integrated, and the hesitation fuzzy set is constructed at the document level to obtain the final scores of the scientific documents and corresponding ranked output. The experimental results show that the recall and precision of this method are 0.774 and 0.663 on the NTCIR dataset, respectively, and the average normalized discounted cumulative gain (NDCG) value of the top10 ranking results is 0.880 on the Chinese scientific document (CSD) dataset.
1. Introduction
Scientific literature retrieval and ranking is an important way for workers to obtain scientific and technological information. As an important part of scientific documents, mathematical expressions and contextual texts with mathematical semantics are the primary basis for scientific document retrieval and ranking. However, the traditional fulltext retrieval model for onedimensional is not effective when facing the special twodimensional pattern retrieval of mathematical expressions. At present, research studies on mathematical expression retrieval and ranking have been carried out with some progress, and methods and prototype systems [1–6] with mathematical retrieval functions have been proposed.
In terms of mathematical expression retrieval, WikiMirs3.0 [7] constructed a hybrid index composed of the formulas index and the context index to enable more comprehensive use of mathematical information. In addition, the importance of formulas in the document is calculated for distinguishment. Zhang and Youssef [8] proposed a multidimensional similarity index based on a vector model to determine and evaluate five factors: system distance, data type level, matching depth, query coverage, and whether it is a formula. According to these five factors, the similarity between the query expression and the matching expression parsed by MATHML can be calculated.
In the research of mathematical expression retrieval and ranking that fuses mathematical expressions with textual information, MIaS [9] used the LRO (Leave Rightmost Out) method to split the original query generated by the combination of keywords and mathematical expressions into subqueries and merged the results using appropriate weighting to obtain more relevant results to the original topic. Zai and Tian [10] used FDS [11, 12] to parse the formulas and retrieved relevant documents using obtained operators. The cosine distance between the input word vectors and the keyword vectors in the documents after the word embedding model is calculated to obtain the similarity between the two, which enables a more reasonable and comprehensive retrieval and ranking. The textual information of a mathematical expression is usually contained in the context of the expressions. Kristianto [13] proposed the concept of mathematical expression dependency, using rich semantic information to obtain better accuracy and improve the retrieval results of the mathematical search system.
Multimodality refers to any combination of two or more modalities. Piergiovanni and Ryoo [14] proposed a joint multimodal representation space method, using adversarial formulas for unmatched text and video data to improve the joint embedding space. Frome et al. [15] proposed a deep visual semantic embedding model based on the semantic information in the labeled image data and unlabeled text to identify visual objects. Jin et al. [16] proposed a generalized deep multimodal hashing framework for scalable imagetext and videotext retrieval that explored feature representation learning, intermodality similarity preserving, intramodality semantic label preserving, and hash function learning with different types of loss functions simultaneously. Shen et al. [17] proposed a novel unsupervised hashing method (multiview discrete hashing) to learn compact hash codes from multiview data. The proposed method jointly learned the hash codes and cluster labels via factorization techniques and spectral analysis. And they developed an efficient alternating algorithm to optimize the proposed model. The generated hash codes not only could reflect the underlying semantics from multiple views but also enjoy high discrimination. Lu et al. [18] proposed an Online Multimodal Hashing with Dynamic Queryadaption (OMHDQ) method in a novel fashion that was designed to adaptively preserve the multimodal feature information into hash codes. Moreover, the online module was parameterfree. It could avoid timeconsuming and inaccurate parameter adjustment in the unsupervised query hashing process.
In the image recognition of mathematical expressions, the mathematical document INFTY system [19] utilized the optical character recognition techniques to analyze the structure of mathematical expressions and recognized printed mathematical expressions into LaTeX and XML markup formats. Deng et al. [20] explored an imagetext generation technology, applied them to mathematical expression recognition, used a convolutional neural network (CNN) to extract image features, and employed a recurrent neural network (RNN) for encoding and decoding.
The abovementioned research on the recognition and retrieval of mathematical expressions has achieved certain results. However, the singlemodal retrieval model has great limitations because mathematical expressions in scientific documents often exist in multiple forms, such as embedding descriptions and images. Based on this, this study proposes a multimodal retrieval method for scientific documents based on HFS [21, 22] and XLNet [23]. This method integrates the functions of mathematical expression images and contextual text to improve the accuracy of retrieval results. In this study, the input form of mathematical expressions is no longer limited, and the information of mathematical expressions in images and text format can be input, which increases the flexibility and practicability of retrieval. In addition, the context of mathematical expression is closely related to the mathematical expression itself in scientific documents, and the combination between mathematical expression and context makes the retrieval and sorting of scientific documents more reasonable.
The contributions of this study can be summarized as follows:(1)Multimodal retrieval is introduced into the retrieval task of scientific documents, and the complementarity between image mode and text mode is utilized to retrieve scientific documents.(2)Mathematical expressions and their context are combined to retrieval and ranking, and XLNet is used to generate word vector, so that a richer semantic representation of mathematical expression context can be obtained.(3)The hesitancy fuzzy set is used to calculate the hesitancy fuzzy measure of scientific documents. The hesitancy fuzzy set considers the attributes of the documents. In addition, Chinese scientific documents (CSD) were added to the retrieved dataset.
2. Model Framework
The multimodal retrieval and ranking process of scientific documents based on HFS and XLNet is shown in Figure 1.
First, in the query module, mathematical expression images and text keywords are inputted.
The processing module of the image model is used to calculate the similarity between mathematical expression in images and in candidate technical documents. The LaTeX forms of the input mathematical expressions are obtained by recognizing the images of the input mathematical expression, and FDS is used to analyze the recognition result. Then, the hesitant fuzzy set theory is introduced to calculate the similarity between the mathematical expressions and the results are returned to the document processing module.
The processing module of text modal is used to calculate the similarity between the mathematical expression context. The text in the context of mathematical expressions in the dataset is extracted and used to pretrain XLNet. XLNet is used to calculate the similarity between the query text and the mathematical expression context of the candidate scientific documents.
The document processing module is used to output documents in order. The document attributes are designed, the scores of the documents are calculated by hesitation fuzzy set, and the ranking results are output in descending order of similarity.
3. Similarity Measure of Multimodal Mathematical Expressions
3.1. Mathematical Expression Image Model’s Similarity Measure
3.1.1. Mathematical Expression Image Recognition
The ViT and transformer models proposed in the literature [24–26] for processing sequence problems and image tasks are shown in Figure 2.
The model consists of a ViT [24] encoder with a deep residual network (ResNet) [25] backbone and a Transformer [26] decoder. The encoder is used for feature extraction, and the decoder is used to convert the mathematical expression information in the image into the LaTeX form. The experimental results show that the accuracy of Bilingual Evaluation Understudy (BLEU) is 0.88.
3.1.2. Mathematical Expression Image Similarity
The hesitant fuzzy set proposed by Torra [21, 22] is used to measure the similarity between query expressions and candidate expressions. The value of membership in the hesitant fuzzy set is a value set containing several possible membership degrees. Therefore, the results can be evaluated from multiple aspects. This approach avoids the errors due to a single phenomenon. The degree of hesitation of people in the process of transaction processing can be more objectively reflected.
Definition 1. (hesitating fuzzy set). Let be a nonempty set, and the definition of the hesitation fuzzy set iswhere represents the set of possible membership degrees for , which is a subset of the interval [0, 1] [21, 22]. Among them, means evaluation attributes, which may be one or more. Each group of evaluation attributes contains multiple evaluation indicators.
The similarity of the analytical mathematical expression of FDS [11, 12] is calculated by the hesitant fuzzy set. The evaluation attribute of the mathematical expression is defined as a triple [27], where is the structural attribute of the expression, is the operator attribute of the expression, and is the operand attribute of the expression. The structure and operator characteristics of the expression are evaluated, respectively. Each evaluation attribute contains several evaluation indicators. By setting the membership function for each indicator, the query expression and the hesitant membership degree of each result expression for each attribute are evaluated.
In conclusion, the set of hesitating fuzzy evaluation attributes and the set of hesitating fuzzy elements are constructed based on the above attributes. , , are the corresponding hesitant fuzzy membership functions of each evaluation attribute.
(1) Structural Attribute
Definition 2. The subformula weight distribution method [28] in the traditional tree index structure is referred, and the flag, length, and operator level in the subexpression are used to replace the structural complexity, length, and depth of nodes in the traditional method.whereHere, is the lowest form of the flag bit of the current subexpression, is the flag of the subexpression in the expression, is the length of the subexpression, is the length of the entire expression, and is the level of operators in the subexpression. When the subexpression appears several times in the query results, the average is taken as its attribute value.
(2) Operator Attribute. Here, the BM25 algorithm is referenced as the membership function of the operator index:The formula can be disassembled into three components. The first component represents the total number of expressions in the database, represents the total number of expressions, which contains . The second component is the weight of the query word in the database, where represents the frequency of the operator in the database, and and are empirical parameters. The third component is the weight of the query operator itself, where represents the word frequency of the query operators in the user’s queries, which is usually set to 1 for shorter queries. is an empirical parameter.
The evaluation of operand attribute is similar to the operator attribute , so the description will not be repeated.
(3) Similarity Calculationwhere is the number of evaluation values and and represent the jth element in and , respectively.
Let be , and some of the retrieval results and the corresponding hesitant fuzzy sets are shown in Table 1.
Definition 3. Let the set of mathematical expressions corresponding to the document set be .
The mathematical expression similarity calculation algorithm is as follows:

3.2. Mathematical Expression Context Similarity Measure
XLNet [23] is a generalized autoregressive pretraining model. The text in the documents is extracted, onethird of which is annotated to train XLNet, so that a richer semantic representation of the mathematical expression text can be obtained. The main structure is shown in Figure 3 (assuming the factorization order is 3 ⟶ 2 ⟶ 4 ⟶ 1).
The same keyword may have different meanings in different contexts, and textual information that explains a mathematical expression often appears around the expression. The example is in the document “Parasitic capacitance.html.” The expression in this document is , and its contexts are “When two conductors at different potentials are close to one another, they are affected by each others’ electric field and store opposite electric charges like a capacitor” and “where C is the capacitance between the conductors.” The meanings of “potentials,” “electric charges,” and “capacitance” may have different meanings in other contexts, and the constructed vectors are also different.
This study introduces the XLNet [23] language model to generate word vectors to be rich in semantics. XLNet solves the problem that BERT did not consider the relationship between the words that are shielded and the words that are not shielded during the training process; that is, the independence between words was not taken into account. The XLNet model implements a new bidirectional coding based on autoregressive (AR) language model. When calculating the text similarity, XLNet will fully consider the semantic information of word vectors, and therefore, the accuracy of calculating text similarity is improved.
The TFIDF algorithm is used to extract keywords and their weights in the context of mathematical expressions. By analyzing a large number of scientific literature studies, the context of the mathematical expressions is used to analyze mathematical expressions and explain symbols. It can be seen that the context of expressions is closely related to mathematical expressions, so it is very important to extract the context of mathematical expressions for the retrieval of mathematical expressions. The context and keywords corresponding to two mathematical expressions are selected as shown in Table 2.
4. Calculation of the Similarity of Scientific Documents
Retrieval and ranking of scientific and technological documents is a comprehensive measurement with multiple attributes including mathematical expressions and keywords. Different scientific documents have different meanings, even if they contain the same formula. Therefore, hesitating fuzzy sets are used to evaluate scientific documents in an allaround way to achieve the final sorting in this study.
Define the attribute of the scientific document as a fivetuple , where is the similarity attribute of the mathematical expression, is the keyword similarity attribute, is the relative position attribute of the expression, is the frequency attribute of the expression, and is the frequency attribute of the keyword. The mathematical expressions and keywords of scientific documents are evaluated.
Definition 4. is the keyword set corresponding to ; , and are the word vectors corresponding to and the query keyword .
Definition 5. The function is used to calculate the similarity between the query expression and the expression in the candidate document.where represents the similarity between the mathematical expression of the query and the mathematical expression in the candidate scientific document.
Definition 6. The function is used to express the similarity between the query keyword and the keyword in the context.where represents the keyword in the document retrieved in the candidate scientific document.
Definition 7. The function is used to express the position of the expression in the document .where is the position where the query expression appears for the first time in the document , and represents the total number of characters contained in the document .
Definition 8. The function is used to express the frequency of the query expression in the document .where is the feature weight coefficient of the number of mathematical expressions in the document, which is obtained by counting the number of expressions in all documents in the database. represents the number of expressions in the document that matches the query expression , and represents the total number of expressions contained in the document .
Definition 9. The function is used to express the frequency of the query keyword in the document .where is the feature weight coefficient of the number of keywords in the document, which is obtained by counting the number of keywords in all documents in the database. represents the number of keywords in the document that matches the query keyword , and represents the total number of keywords contained in the document .
Definition 10. The function is used to calculate the score of scientific document retrieval results.where is the scoring function of the result document when querying the input expressions and keywords, and is the five evaluation attributes of the document. and are the jth largest elements in and , respectively. is the number of evaluation values included in the evaluation attribute . The attributes of the document are shown in Table 3.
The sorting algorithm of retrieval result documents is as follows:

5. Experimental Process and Result Analysis
5.1. Experimental Data
For the image recognition part of mathematical expressions, we use the IM2LATEX100K dataset for training and testing. The IM2LATEX100K dataset contains 103,556 images of different mathematical expressions. The label data consist of the LaTeX format of mathematical expressions.
For the scientific document retrieval and ranking part, the public dataset NtcirMathIRWikipediaCorpus (NTCIR) is used, and 31,742 documents are extracted, which contains 518,929 mathematical expressions. In addition, Chinese scientific documents (CSD) are added to expand the dataset, which contains 10,372 documents and 121,495 mathematical expressions.
5.2. System Experiments
5.2.1. Image Recognition of Mathematical Expressions
The image recognition algorithm model [24–26] is used to recognize mathematical expression images and conducts a lot of experiments on different types of mathematical expression images in this study. According to the BLEU evaluation standard, the model result reaches 0.88.
For this recognition algorithm, five different types of mathematical expression images are selected for recognition and display in this study, and the recognition results are shown in Table 4 (the content of the image here is expressed in text).
5.2.2. Ablation Study
Ten groups of formulas and keywords are selected in Table 5 as queries for retrieval. The proposed method includes three main parts, and the performance is continuously improved by gradually increasing the functions of each part. The baseline experiment was image expression retrieval. The final reordering of our has the best performance. The average recall rates of this study are 77.4% and 77.8%. And the average precision rates are 66.3% and 69.2%. All of them are shown in Table 6.
5.2.3. Performance on NTCIR Dataset
In this section, the method in this article is compared with some traditional methods and current existing methods using the NTCIR dataset. FDS + Word Embedding [10] combines the FDS and Word Embedding to retrieve scientific documents: FDS is used to parse expressions, and Word Embedding is used to generate the word vectors of keywords in scientific documents, hereinafter referred to as Method 1. And SearchOnMath [29] is a mathematical formula retrieval tool that aims at accurately matching mathematical expressions, However, SearchOnMath implements pure mathematical expression retrieval and does not consider the important information of the scientific document itself, hereinafter referred to as Method 2. MIaS [4] is based on the fulltext search engine Apache Lucene. MIaS processes text and math separately. The text is tokenized and stemmed to unify inflected word forms, hereinafter referred to as Method 3.
In this study, NDCG is used to evaluate the ranking results, which is the search result after the normalization of DCG (discount cumulative gain). The calculation method is as follows:wherewhere is the number of search results, is the relevance score, is the ideal value, and indicates that the search results are all related to the query expression.
The query formula and keywords in Table 5 are taken as the query, and the method in this study and other methods top10 experts ranking results are shown in Figure 4. Method 2 starts with a higher value than the method in this article, but as the number of expression retrievals increases, the method in this article is all higher than Method 2. The average NDCG of this method is higher than the other three methods. And the average value of NDCG (n = 10) is 0.865 on the NTCIR dataset in this study. The experimental results show that the ranking performance of the proposed method is better and the retrieval result is more reasonable.
5.2.4. Performance on CSD Dataset
In this section, the method in this article is compared with Method 1 using the NTCIR dataset. Chinese scientific documents (CSD) are added to expand the dataset, which contains 10,372 documents and 121,495 mathematical expressions. The experimental results are shown in Figure 5.
It can be seen that the NDCG of the method in this study is higher than the comparison method. The average value of NDCG (n = 10) is 0.88 on the CSD dataset in this study. So the results of the method in this study are more reasonable, and the retrieval and ranking performance is improved.
5.2.5. Retrieval System
A large number of experiments are conducted for different expressions. The first ten search results are selected for display in this study. When the input formula image is “” and the keyword is “Poisson,” some of the search results are shown in Table 7.
First of all, the method in this study identifies the LaTeX form of the formula as “P\left({X = k} \right) = \frac{{{\lambda ^k}}}{{k!}}{e^{\lambda }},” finding out a collection of documents similar to the formula, and the XLNet model is used to obtain the word vector of “Poisson” and document expressions context keywords, and the similarity between them is calculated. Finally, according to the keywords and formula information, the similarity calculation of the documents is performed again using the hesitant fuzzy set so as to sort and output. FileName is the name of the document where the expression is located, and Score is the document score in Table 7.
6. Conclusion
Based on the retrieval and ranking mode of combining mathematical expression image and text, this study proposes a multimodal retrieval and ranking method for scientific documents based on HFS and XLNet. This method obtains the LaTeX structure information of mathematical expressions through image recognition algorithms and solves the singlemodal problem of scientific document retrieval. The similarity between mathematical expressions is obtained by the evaluation of hesitant fuzzy sets, which solves the problem of the unity of evaluation of traditional mathematical expressions. In combination with the context of mathematical expression, the words with similar query keywords are obtained according to XLNet, which enriches the singleness problem of mathematical expression retrieval. Finally, the similarity between of attributes of mathematical expressions and the keywords in the documents is calculated through the hesitation fuzzy set, which makes the ranking of the retrieval results of scientific documents more reasonable.
This experimental method also has some shortcomings. In the future, the following points will be considered for improvement:(1)Only the mathematical expressions whose recognition results are in LaTeX form are analyzed, and different forms of mathematical expressions (such as MathML) will be analyzed(2)The evaluation attributes of documents will be further improved, and the evaluation attributes of document similarity will be increased(3)Only images and texts are analyzed, and an attempt will be made to expand the multimodality more widely and apply voice or video to retrieval
Data Availability
Our data still need to be studied in the next stage, so it is not convenient to provide it directly. The data can be made available upon request via email to the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This work is supported by Hebei Natural Science Foundation, China (No. F2019201329), and the Science and Technology Project of Hebei Education Department (No. QN2018214).