Abstract

Question answering (QA) systems have attracted considerable attention in recent years. They receive the user’s questions in natural language and respond to them with precise answers. Most of the works on QA were initially proposed for the English language, but some research studies have recently been performed on non-English languages. Answer selection (AS) is a critical component in QA systems. To the best of our knowledge, there is no research on AS for the Persian language. Persian is a (1) free word order, (2) right-to-left, (3) morphologically rich, and (4) low-resource language. Deep learning (DL) techniques have shown promising accuracy in AS. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the AS task; most of them are exclusively in English. In order to address the need for a high-quality AS dataset in the Persian language, we present PASD; the first large-scale native AS dataset for the Persian language. To show the quality of PASD, we employed it to train state-of-the-art QA systems. We also present PerAnSel: a novel deep neural network-based system for Persian question answering. Since the Persian language is a free word-order language, in PerAnSel, we parallelize a sequential method and a transformer-based method to handle various orders in the Persian language. We then evaluate PerAnSel on three datasets: PASD, PerCQA, and WikiFA. The experimental results indicate strong performance on the Persian datasets beating state-of-the-art answer selection methods by on PASD, on PerCQA, and on WikiFA datasets in terms of MRR.

1. Introduction

Question answering (QA) systems are a branch of artificial intelligence that employ machine learning techniques with the aim of automatically answering questions asked by humans. In general, humans investigated several ways to find answers of questions, such as asking experts and searching through text-based documents. Due to the availability of digital and nondigital text resources, it is time-consuming to investigate all of these resources to answer questions [1]. Recently, the advancement of machine learning and deep learning techniques, high computing speed, and web resources encouraged researchers to take advantage of the computer’s ability to find answers among web resources [2].

Information retrieval (IR) systems were the initial types of QA systems. Traditional search engines were actually IR systems. These systems do not find the exact answer of the question; instead, they provide relevant web pages to the user, which may include answers, and the user should find the exact answers from the returned web pages. In contrast, QA systems seek to find the exact answer to the questions. Generally, QA systems can be divided into two categories: (1) knowledge-based QA systems and (2) information retrieval-based (IR-based) QA systems. Knowledge-based systems [3] deploy structured documents such as massive knowledge graphs for finding the exact answers. In these graphs, the nodes are entities—objects, events, situations, or abstract concepts—and the edges connect a pair of entities and show the relationship of interest between them. While deploying knowledge-based QA systems has shown great performance for some specific domains, building and updating knowledge graphs is a time-consuming process. IR-based QA systems [4] attempt to find the answer of a question inside unstructured documents such as web pages. These systems eliminate the need to building and updating knowledge graphs; instead, they have to deal with new challenges such as machine reading comprehension (MRC). MRC systems scan unstructured documents and extract meaning from the raw text [5].

IR-based QA systems consist of four general components: (1) question processing, (2) information retrieval, (3) answer extraction, and (4) answer selection. The question processing component extracts required information from the user’s question and applies necessary modifications to the question, if needed. The information retrieval component also retrieves relevant passages to the user’s question from the documents and pulls them. The answer extraction component then extracts the exact answer of the questions from the retrieved passages. The answer selection component tries to detect the best answer for the user’s question. Nowadays, most of the QA systems concentrate on factoid questions, questions that can be answered with facts expressed in a few words [6].

Many QA systems have been developed for the English language [6]. Recently, some research studies have been performed for some other languages [79]. Most of the works on QA for non-English languages have focused on the question processing [10] and answer extraction components [11]. While it has been shown that the performance of answer selection component has a significant impact on the overall performance of a QA system [12], a limited number of research studies have been performed on the answer selection component.

To the best of our knowledge, there is no research on answer selection methods for the Persian language. About 110 million people from Iran, Tajikistan, Afghanistan, and six other countries speak Persian. Persian is a free word-order, morphologically rich, low-resource, and right-to-left language [13]. This language is written from right to left and rich in morphology. The standard word order of the Persian language is subject-object-verb (SOV), although all other orders (SVO, OSV, VSO, etc.) are acceptable. In addition to this, this language is low-resource; that is, there are not enough resources for training learning algorithms for this language. Due to being low resource of the Persian language, in this article, we generate the first large-scale native dataset for answer selection in Persian. In addition, due to being free word-order of this language, we present a novel method to address the answer selection problem in QA systems for the Persian language. In this method, we parallelize a sequential method containing convolutional neural networks (CNNs) [14] and recurrent neural networks (RNNs) [15] and transformer-based methods like bidirectional encoder representations from transformers (BERT) [16] to handle various orders in the Persian language. Moreover, to handle the morphological rich problem of the Persian language, we use the BERT language model. Özçift et al. [17] demonstrated that BERT can overcome the morphological rich problem. The following research questions were explored in this article:(i)Does using a native dataset for answer selection task show better performance than using a translated dataset for the Persian language?(ii)Does our novel method have more appropriate performance on the native dataset than state-of-the-art methods for the Persian language?(iii)Is there any difference between methods for standard word order (SOV) and other word orders?(iv)Does multilingual BERT show better performance than monolingual BERT on the Persian language?(v)Does using the output of the question processing component improve the performance of the answer selection component for the Persian language?

Since there is no large-scale native answer selection dataset for training and evaluating answer selection models for the Persian language, in this article, we generate a large-scale native dataset for the Persian language called PASD (Persian Answer Selection Dataset). The PASD contains about 20,000 questions and 100,000 question-answer pairs. In addition to this, we translate the WikiQA [18] dataset to Persian named WikiFA in order to evaluate the translation method for the Persian language.

Our method called PerAnSel is a novel method that uses two deep learning methods in parallel for the Persian language. PerAnSel consists of two components: (1) SOVWO (subject-object-verb word order) and (2) OWO (other word orders). SOVWO utilizes 1-D CNN and LSTM (long short-term memory) networks in order to handle standard word order (SOV). OWO utilizes transformer-based models in order to handle nonstandard word orders (VSO, OSV, etc.).

The contributions of this article are as follows:(i)Due to the lack of a large-scale native dataset for the Persian language, we provide a large-scale native dataset for the answer selection task in the Persian language.(ii)We propose a novel method called PerAnSel, for answer selection in the QA systems for the Persian language. PerAnSel uses sequential models such as LSTM and 1-D CNN to process sentences with SOV word order. These algorithms are commonly used when we are dealing with sentences with SOV word order, because SOV is the standard word order of the Persian language and most of sentences are stated in this word order. PerAnSel deploys a transformer-based language model to process sentences with other word orders. The transformer-based model is composed of fully connected neural networks and attention mechanism [19], which enable it to address the morphologically problem in the Persian language [17].(iii)In order to address the answer selection problem for the Persian language, we use transformer-based models and sequential models in parallel.(iv)Inspired by Ref. [20], we present a question processing method for the Persian language. The experiments show that this improves the accuracy of QA systems.

Our dataset (PASD) and all the codes implemented in this article are freely available for public use at https://github.com/BigData-IsfahanUni/PerAnSel. First, for evaluating the performance of the proposed dataset, we implemented some state-of-the-art models and fine-tuned them with the PASD dataset. After investigating the quality of PASD, we evaluate the PerAnSel model using the PASD dataset. We achieved an MRR (mean reciprocal rank) [21] score of using PerAnSel, which is better than state-of-the-art models.

This article is organized as follows: In Section 2, related works are described. In Section 3, the process of generating translated and native datasets are explained. In Section 4, the proposed method for the answer selection is presented. In Section 5, baseline models, implementation details, and evaluation metrics are described. In Section 6, the experiments results and discussion on answer selection and question processing methods, and error analysis are explained. Finally, the article is concluded in Section 7.

In this section, a comprehensive survey is provided for existing answer selection studies. These studies are classified into two groups: (1) those works that build a dataset for answer selection and (2) those that proposed some answer selection methods. Here, we first present those works performed on the English language and then those methods that are proposed for the other languages.

2.1. English

English is a widely used languages in all over the world [22]. There are many works that have focused on the English language for QA systems.

2.1.1. Datasets

One of the early datasets for the answer selection task is TrecQA. This dataset was created from the TREC-8 to TREC-13 QA tracks, which use TREC-8 to TREC-12 tracks for the training set and TREC-13 track for the dev set and test set. TrecQA consists of two different versions: the raw version and the clean version. The raw version [23], which is the first version of this dataset, contains 1229 questions for the training set, 82 questions for the dev set, and 100 questions for the test set. In the clean version [24], the questions that do not have any answers or all of the answers of the question are correct or incorrect are removed. By applying these changes, 1229 questions remain for the training set, 65 questions for the dev set, and 68 questions for the test set. To generate the training set, they used two approaches: (1) manually judgement and (2) automatic judgement. In the manually judgement approach, they employed some crowdworkers to annotate 94 questions, afterwards they named this training set TRAIN. While, in the automatic judgement approach, they leveraged automatic methods to annotate 1229 questions, and they named it TRAIN-ALL.

To create the WikiQA dataset, Yang et al. [18] employed the Bing search engine query logs. They believed that the questions searched in the search engines are more similar to real-world questions of the users. Based on this fact, they extracted some questions from the Bing query logs and detected the Wikipedia pages the questions were related to. Eventually, they generated candidates’ answers from the sentences of the summary section of the related Wikipedia page. Some questions in this dataset only include incorrect candidate answers. These items are included in the original version of this dataset but are ignored in most research studies. This dataset contains 2118 questions for the training set, 296 questions for the dev set, and 633 questions for the test set.

The InsuranceQA dataset [25] is the first released dataset in the insurance field for answer selection task and collected from Insurance Library website (http://www.insurancelibrary.com). The questions are composed by real users, and the answers to the questions are high-quality answers prepared by professional users. A unique feature of this dataset is the huge number of correct candidate answers to each question. For each question, 500 candidate answers are considered. The incorrect candidate answers are the correct candidate answers to other questions. This dataset contains 12889 questions for the training set, 2000 questions for the dev set, and 2000 questions for the test set.

The SelQA dataset [26] presented a new dataset with annotated questions of various topics from Wikipedia. They eliminated the limitation of the number of questions and scopes of topics that existed in other datasets. They also proposed a new annotation scheme to create a large corpus. This dataset contains 5529 questions for the training set, 785 questions for the dev set, and 1590 questions for the test set.

2.1.2. Methods

The methods proposed to solve the answer selection for the English language can be categorized into six groups: (1) feature-based, (2) Siamese-based, (3) attention-based, (4) compare-aggregate-based, (5) language model-based, and (6) other methods.

Feature-based methods utilized feature engineering on questions and candidate answers to solve the answer selection task. These methods select the final answer based on common words between the question and the candidate answers [27]. Since feature-based methods use exact match between questions’ and candidate answers’ words, they cannot distinguish synonymous words. Even using lexical sources such as WordNet [28] could not fix this shortcoming. Then, the dependency trees and edit distance algorithms [29, 30] were employed to feature-selection. In these methods, the candidate answers are ranked based on the increasing order of edit distance between the question dependency tree and the candidate answer dependency tree.

Siamese-based models are based on Siamese neural network architecture. Siamese neural network is a neural network that employs a shared-weight neural network to process two different input vectors to generate an output vector representation for each input [31]. In the answer selection problem, two inputs are a question sentence and a candidate answer sentence. When the output vectors are generated for the question and the candidate answer, the generated output vector representations are compared, and their relevance is calculated. Yu et al. [32] utilized the Siamese neural network and deep learning LSTM to solve the answer selection task. This model used a convolutional neural network (CNN) as the shared-weight neural network and used logistic regression to compute the relevance between the question and the candidate answer. He et al. [33] presented multi-perspective convolutional neural network (MPCNN) model. They used a CNN with multiple window sizes and multiple types of pooling as the shared-weight neural network. They also employed multiple distance functions such as cosine distance, Euclidean distance, and element-wise difference to calculate the relevance. They showed that using this model generates high-quality representation vectors for the question and the candidate answer. In this regard, Rao et al. [34] presented a novel pairwise ranking approach and implemented the MPCNN model by this approach. The authors believed that using pairwise ranking rather than using pointwise ranking leads to the generation of high-quality output vector representations for the question and the candidate answer. Kamath et al. [35] used a simple recurrent neural network (RNN) as shared-weight neural network and employed logistic regression to calculated the similarity between the question and the candidate answer. However, they showed that integrating question classification and answer selection component eliminates the requirement of a heavy-weight neural network to solve the answer selection task.

Rather than processing the question and the candidate answer separately based on the Siamese neural network architecture, attention-based models, inspired by the attention mechanism [19], use context-sensitive interaction between the question and the candidate answer to calculate the similarity. Yang et al. [36] leveraged an RNN to implement the attention mechanism for answer selection task. He et al. [37] showed that using CNNs instead of RNNs in the attention-based models leads to the generation of more high-quality output vector representation for the question and the candidate answer. Finally, Mozafari et al. [38] showed that using the attention mechanism, convolutional neural networks, and the pairwise ranking, at the same time, improves the quality of the output vector representations.

The compare-aggregate-based models follow the Compare-Aggregate framework [39]. In this framework, context-sensitive interaction between smaller units such as word units or token units is used. By aggregating the calculated values of the interactions, the relevance between the question and the candidate answer is calculated. He and Lin [40] presented the first method that uses the compare-aggregate method for answer selection. They performed word-level matching instead of sentence-level matching and used a CNN to aggregate the interaction values. Wang et al. [41] showed that word-level matching in two directions of words order of inputs, and using a BiLSTM (bidirectional LSTM) to aggregate the matching values, makes an output vector representations more meaningful than the He and Lin [40] method.

Recently, language model-based models have been widely used, and their results have shown that their performance is better than the prior methods. These models use pretrained language models instead of convolutional neural networks or recurrent neural networks. This feature enables the model to gain sufficient knowledge of source languages, and the model understands the meaning of the question and the candidate answer better. Yoon et al. [42] proposed one of the first models that use a language model to solve the answer selection task. In their research, the ELMo (embeddings from language model) language model [43] was employed. Mozafari et al. [44] showed that using recurrent neural networks on top of the language models such as BERT [16] leads to the generation of more high-quality output vectors than a mere use of language model output vector. Laskar et al. [45] showed that using heavier language models such as RobertA (robustly optimized BERT pretraining approach) [46] enhances answer selection models’ performance. However, Mozafari et al. [47] showed that the weight of the language model is not a criterion to have a high-performance answer selection model. They indicated that the DistilBERT language model [48], a lighter model than the BERT language model, has a better performance. Shonibare [49] showed that various rankings, such as pairwise and triplet rankings, can improve answer selection models that utilize language models. Han et al. [50] also showed that utilizing the passages of candidate answers along with questions and candidates’ answers increases the model performance.

There are some methods that are not in earlier categories. In these methods, the authors investigate novel paths to solve answer selection. Shen et al. [51] implemented the KABLSTM model. This model employed knowledge graphs; thus, they proposed a context-knowledge interactive learning architecture. Jin et al. [52] proposed a new ranking method and used a multitask learning framework.

2.2. Other Languages

For the Chinese language, several datasets are provided. Some of these datasets are closed domains and were created for medical purposes, whereas others are open domains. Several datasets are also provided for languages such as Portuguese and Arabic. Native and translation methods have been used for generating these datasets.

2.2.1. Datasets

The cMedQA dataset [53] is a closed-domain medical dataset for the Chinese language. This dataset consists of online medical questions and answers from the XunYiWenYao website (http://www.xywy.com). This dataset contains 50,000 questions for the training set, 2,000 questions for the dev set, and 2,000 questions for the test set.

Zhang et al. [54] improved the cMedQA dataset and generated a twice number of questions. This new dataset contains 10,000 questions for the training set, 4,000 questions for the dev set, and 4,000 questions for the test set.

The cEpilepsyQA dataset [55], like the cMedQA datasets, includes XunYiWenYao website medical questions. The difference is in selecting the negative answer candidates for each question. In this dataset, negative answer candidates are more similar to the correct answer. This dataset contains 3920 questions for the training set, 490 questions for the dev set, and 490 questions for the test set.

The DBQA dataset [56] is an open-domain dataset. During producing the dataset, annotators are asked to extract a sentence from documents and generate a question for the sentence. This dataset contains 8772 questions for the training set, 5779 questions for the dev set, and 2500 questions for the test set.

The MilkQA dataset [57] is a closed-domain dataset prepared for the Portuguese language. The questions are about dairy. Some people asked the questions, each with various backgrounds, but Embrapa’s customer service experts answered the questions. This dataset contains 2307 questions for the training set, 50 questions for the dev set, and 300 questions for the test set.

The WikiQAar dataset [58] is an Arabic dataset produced by translating the WikiQA dataset into Arabic. The number of questions in this dataset is the same as the WikiQA dataset.

The CQA-MD dataset [59] is a closed-domain Arabic dataset for community question answering in the domain of medical forums. This dataset is collected from WebTeb (http://www.webteb.com), Al-Tibbi (http://www.altibbi.com), and medical corner of Islamweb (http://consult.islamweb.net). This dataset contains 1031 questions for the training set, 250 questions for the dev, and 250 for the test set.

Currently, there is only a work on building native answer selection dataset for the Persian language. Jamali et al. [60] created the PerCQA (Persian Community Question Answering) dataset, a dataset for community question answering, based on questions and answers posed by users in the Ninisite (https://www.ninisite.com) forum. PerCQA contains about 692 questions for the training set, 99 questions for the dev set, and 198 questions for the test set. To the best of our knowledge, currently, there is no large-scale native QA dataset for answer selection in Persian, neither as a monolingual nor as a cross-lingual dataset. In this article, we present the first large-scale native dataset for the Persian language, called PASD. This dataset contains 17567 questions for the training set, 1000 questions for the dev set, and 1000 questions for the test set. Every question in the PASD dataset has five candidate answers.

2.2.2. Methods

There are also some research studies performed on non-English languages such as Chinese and Arabic. For example, Zhang et al. [54] proposed a multiscale attentive network to capture the interaction between questions and candidate answers. Zhang et al. [61] took advantage of the Siamese neural network architecture and proposed a hybrid model by combining convolutional neural networks and recurrent neural networks. Finally, Chen et al. [55] presented the embeddings of Chinese texts in character level, and used the co-attention mechanism and fusion layer to capture the interaction between user’s question and candidate answers. Almiman et al. [62] presented a weight ensemble model for Arabic language, which ensembles the output of three classification models to predict final prediction score. To the best of our knowledge, currently, there is not a method for answer selection task for the Persian language. In this article, we also present a method for answer selection for this language, called PerAnSel.

Table 1 provides a review of the datasets, and Table 2 provides a summary of the models.

3. Dataset

State-of-the-art models in machine learning tasks deploy deep learning algorithms. Deep learning algorithms require a considerable amount of data for training. In order to use deep learning algorithms in answer selection tasks, a large-scale dataset consisting of annotated data is required. As mentioned earlier, no research has been conducted on answer selection in the Persian language. There is also no large-scale native dataset for the answer selection task in Persian language. In this section, we describe two datasets for answer selection task in Persian language: (1) WikiFA and (2) PASD. To create the PASD dataset and implement our model, we need to use BERT language model. In the following, we describe this language model and several its derivations.

BERT [16] is a transformer-based language model published by Google. It was a revolution in the NLP (natural language processing) community in various tasks, including text classification, question answering, and natural language inference. BERT’s key technical innovation is applying the bidirectional training of transformer to language modeling. Devlin et al. [16] employed the encoder of the transformer [63] to learn language representation. Transformer encoders consist of self-attention components instead of LSTMs. Unlike LSTM, the self-attention mechanism is fast to train because all the words are processed simultaneously. In transformer encoders, self-attention layers process an input simultaneously. Algorithm 1 indicates the algorithm of the BERT language model.

Input: a sentence/pair of sentences
Output: new embeddings for input tokens
for alldo
  for alldo
   for alldo
    
    
    
    
   end for
   
   
  end for
  
end for
return inputs

Assume indicates an -dimensional vector, and indicates an matrix. There are self-attention layers in each encoder transformer. The self-attention generates vector as the output. This vector is produced using three vectors—Query , Key , and Value , which are the result of the multiplication , embedding vector for the token, by , , and . , , and are learnable parameters, which are learned during the training phase. The following equations show these operations and in (4), demonstrates the softmax function:

The outputs of to are concatenated, and vector is produced. By multiplying by matrix , the final vector is produced as final output of the all self-attention layers. is a learnable matrix. The following equation shows the multiplication:

The generated vector is transferred to a multilayer perceptron, and is produced. This is a new embedding vector for token. and are learnable parameters. This multilayer perceptron is shown in the following equation:

The vectors are transferred to the next encoder. This operation is repeated to the number of encoders.

MBERT (multilingual BERT) [16] is a BERT-base language model trained on the Wikipedia documents in 104 languages using a masked language modeling (MLM) objective. The model has 177M learnable parameters. DistilmBert [48] is a distilled version of the BERT-base multilingual model. The model is trained on Wikipedia in 104 different languages. The model has 134M parameters compared to 177M parameters for MBERT. On average DistilmBERT is twice as fast as MBERT. ParsBERT [64] is a BERT-base language model, which is trained on a massive amount of public Persian corpora including Wikidumbs (https://dumps.wikimedia.org/fawiki/), MirasText (https://github.com/miras-tech/MirasText), and six other manually crawled text data with more than 3.9M documents, 73M sentences, and 1.3B words. ALBERT-Persian (A Lite BERT) [65] is an Albert base for the Persian language. The model is trained like ParsBERT on Wikidumbs, MirasText, and other crawled text data.

3.1. WikiFA

We build WikiFA by translating an English dataset to Persian. We create this dataset in order to evaluate the translation method for the Persian language. We use machine translation to translate the instances in WikiQA [18] to the Persian language. To this end, we deploy Google translate API and translate each record in WikiQA to Persian. Assume that each record in WikiQA is in the form , where is question id, is English question, is document id, is document title, is candidate answer id, is English candidate answer, and Label is candidate answers’ label. For each record , we add to WikiFA where is the translation of in Persian, and is the translation of in Persian. Figure 1 shows the production process of the WikiFA dataset.

3.2. PASD

There are some machine reading comprehension datasets for Persian [66, 67]. We build PASD by using the PersianQuAD dataset [67]. PersianQuAD is the first large-scale native machine reading comprehension dataset for question answering for the Persian language. It contains about 20000 questions posed by native annotators on a set of Persian Wikipedia articles. To build PersianQuAD, the annotators were shown the paragraphs of the Persian Wikipedia articles; then, they were asked to pose some questions on the paragraph and highlight the answer within the paragraph text. In order to use a question answering dataset to create an answer selection dataset, two challenges should be addressed:(1)In the question answering dataset, the answer to each question is within the paragraph, while for the answer selection dataset, candidate answers must be proper sentences.(2)In the question answering dataset, only the exact answer is specified for each question, while the answer selection dataset also requires incorrect candidate answers.

To address the first challenge, we retrieve the sentence that contains the answer, as the answer sentence. answer_start value indicates the start-index character of the exact answer in the paragraph. To detect the answer sentence, the paragraph first is tokenized to its sentences. Then, by aggregating the length of sentences, the sentence containing the answer_start value is considered the answer sentence. Algorithm 2 describes this process.

Input: paragraph, answer_start
Output: the index of answer sentence
(1) SentenceTokenizer {The is tokenized to sentences}
(2)
(3)for all in do
(4)if Length(sent) then
(5)  return sent
(6)end if
(7) Length(sent)
(8)end for

To address the second challenge, that is, to specify an incorrect candidate answer for each question, one can use random sentences from the corresponding paragraph, as incorrect candidate answers. However, these lead to low-quality incorrect answers. To produce a high-quality answer selection dataset, incorrect answers should be similar to correct answers, semantically and grammatically.

In this article, we present a retrieval-based approach to produce appropriate incorrect answers for each question. We first downloaded the Persian Wikipedia documents (https://dumps.wikimedia.org/fawiki/20201220/fawiki-20201220-pagesarticles-multistream.xml.bz2), which are used for building the PersianQuAD dataset. We extracted individual paragraphs from the documents by the wikiextractor library (https://github.com/attardi/wikiextractor). We then used the information retrieval component to retrieve the most relevant paragraphs to each question in PersianQuAD dataset. As for the retriever, we used the whoosh library (https://whoosh.readthedocs.io/en/latest) and implemented a passage retrieval component. It receives the Persian Wikipedia paragraphs and a question in the PersianQuAD dataset as inputs, and returns the top 10 paragraphs related to the question. Figure 2 shows the procedure of retrieving relevant paragraphs to each question in the PersianQuAD dataset, using the passage retrieval component.

To extract the answer of the question from the retrieved paragraphs, we used the answer extraction component. We fine-tuned the MBERT model [16] on the PersianQuAD dataset and prepared a model to find the exact answers (https://github.com/BigData-IsfahanUni/PersianQuAD). By passing the question and the returned paragraphs to the model, it finds the exact answer in the paragraphs. After finding the exact answers in the paragraphs, we asked two human annotators to determine whether the extracted answers can be considered incorrect answers to the questions. Finally, we select four incorrect answers for each question. Figure 3 shows the procedure of extracting candidate answer sentences using an MBERT QA model. The distribution of interrogative words of the PASD dataset is similar to the PersianQuAD dataset. Table 3 shows statistics of the PASD dataset based on distributions.

Finally, we asked human annotators to determine the expected answer type (EAT) for each question in the PASD dataset. We used the coarse-grained EAT classes, which are commonly used as EATs [20]: HUM, LOC, ENTY, and NUM. HUM class shows that the question is looking for a person or an organization as an exact answer. In this regard, LOC is looking for a location, ENTY is looking for a product or an object, and NUM is looking for a date or a time.

Overall, in comparison with PersianQuAD whose records include a question and an exact answer, the records of PASD contain a question, an exact answer, an answer sentence, an annotated answer sentence, and an EAT. Moreover, each question has a correct answer and four incorrect answers. The PASD is generated for using in answer selection systems, while the PersianQuAD is appropriate for MRC systems. We demonstrate the statistics of the PASD and WikiFA datasets in Tables 4 and 5, respectively.

4. The Proposed Method

In this section, we present the PerAnSel method for answer selection task for the Persian language. As mentioned earlier, an IR-based QA system consists of four main components: (1) question processing, (2) information retrieval, (3) answer extraction, and (4) answer selection. First, the system receives a question from the user. In the first step, we extract the EAT [20] from the question and pass it to the answer processing component. In the second step, a retriever is used to retrieve the most relevant paragraphs to the question. In the third step, an answer extraction method is utilized to extract the candidate answers to the question from the retrieved paragraphs. Finally, in the fourth step, the PerAnSel selects the best answer from candidate answers’ pool. Figure 4 shows the architecture of the QA system and the PerAnSel method. Algorithm 3 shows the process of our system. The details of each step are explained in the following sections.

Input:
Output:
(1)
(2)
(3)
(4)
(5)
(6)for all do
(7)
(8)
(9)
(10)end for
(11)return
4.1. Question Processing

This component extracts EAT from the question. EAT shows the type of the answers to the questions [35]. For example, the EATs for the questions who is the best soccer player in history? and where is the highest mountain in the world? are Person and Location, respectively.

We implement a method based on the BERT language model to detect the EAT of the question. In this method, the question is passed to the kernel as an input sentence. Then, the [CLS] token output vector is transferred to a fully connected network. The hidden layer is , and the output layer is . The output layer shows the EAT of the question. Figure 5 shows the architecture of this method.

4.2. Information Retrieval

As mentioned earlier, the QA systems find the answer to each question in the web pages. To this end, some methods are proposed such as ad-hoc IR methods [68] and neural IR methods [69]. Recently, neural IR methods have been mostly used in QA systems. These methods encode the question and each paragraph using neural networks and generate a dense vector representation for the question and the paragraph. Then, the similarity between these inputs is measured. Finally, the most relevant paragraphs are returned.

4.3. Answer Extraction

To find candidate answers to the question, a machine reading comprehension method is used. To this end, a BERT language model for QA can be used. This method encodes the question and relevant paragraphs using the BERT model. Then, the output vector of each token of the relevant paragraph is passed to a fully connected network, and a score is measured for each token. Finally, based on the scores, the start span token and end span token are specified. The sentence that contains these tokens is returned as the candidate answer.

4.4. Answer Selection

The answer selection component selects the best answer among a set of candidate answers to the question. In this article, we propose the PerAnSel method. PerAnSel is an answer selection method presented for the Persian language. The PerAnSel method is a Siamese-based method based on pairwise ranking [33] and consists of three main components: (1) preprocessing, (2) sentence representation, and (3) relevance measurement. The preprocessing component gives higher priority to the candidate answer sentences, which have the same EAT as the question. The sentence representation component generates a meaningful vector for the question and the answer candidates. The relevance measurement component measures the relevance between the question and the candidate answer in the proposed method. The sentence representation components consist of two main components: (1) SOVWO and (2) OWO . In the following sections, we describe these components.

4.4.1. Preprocessing

In this component, we deploy Hooshvare NER (https://github.com/hooshvare/parsner) to determine the NEs (Named Entities) in the candidate answer sentences. For example, Messi was the best player of LaLiga in 2015 includes three entity types: person, organization, and time. The annotated sentence is shown in the following equation:

The answer selection component uses the EAT of the question and gives higher priority to the candidate answer sentences, which have the same EAT at the question within their NEs. To this end, the NEs in the candidate answer sentence should be mapped to the corresponding class in EATs. Table 6 shows the mapping between EATs with the corresponding NEs in Hooshvare NER. This component then replaces all tokens of candidate answer sentence whose type is EAT with the SPECIALTOKEN token. Algorithm 4 indicates the preprocessing step in the answer selection.

Input: EAT, Candidate Answer(a)
Output: answerEAT
(1)
(2) Tokenizer(answerEAT) {The answer is tokenized to its tokens.}
(3)for all do
(4)ifthen
(5)  
(6)end if
(7)end for
(8)
(9)return answerEAT
4.4.2. Sentence Representation

We prepare a method called PERSEL (PERsian SELection) to generate a dense vector representation for the question and the answer candidate. As shown in Figure 6, the PERSEL consists of SOVWO and OWO methods. In this method, we generate vector by using the SOVWO method and vector by using the OWO method. Then, is generated based on Equation (8). and show the coefficient of SOVWO and OWO methods, respectively, for the Persian language. These coefficients are learned during training phase. Algorithm 5 shows the process of the sentence representation component.

Input: sent
Output:
(1)
(2)
(3)
(4)return SentRepoutput

(1) SOVWO. We examine SOVWO to show the performance of sequential models on sentences with SOV word order. This method is appropriate for standard word order such as SOV, because most sentences of the Persian language are stated in this order. As shown in Figure 7, the SOVWO method consists of a 1-D CNN and LSTM subcomponents. For the CNN subcomponent, the window size is 4, the padding value is 3, the number of filters is 300, and the pool function is also Max-pooling. Moreover, for the LSTM subcomponent, the hidden layer is vector.

In the SOVWO method, the input sentence first is tokenized. We then present each token by its corresponding word embedding vector from pretrained fastText 300-dimensional vectors [70]. Afterward, we concatenate the word embedding vector of the input sentence’s tokens and generate a matrix to represent the input sentence. Finally, this matrix is transferred as the input sentence representation to the CNN and the LSTM subcomponents. The output of the CNN subcomponent is vector, and the output of the LSTM subcomponent is vector. By concatenating the output vectors of the subcomponents, vector is generated for the input sentence. Algorithm 6 shows the process of the SOVWO method.

Input: sent
Output:
(1) tokenizer
(2) fastText
(3)
(4)
(5)
(6)return SOVWOoutput

(2) OWO. We examine OWO to deploy the power of fully connected neural networks and the attention mechanism for sentences with nonstandard word orders. This method is appropriate for all word orders such as SVO and OSV. As shown in Figure 8, this method utilizes an LSTM and a fully connected neural network. The hidden layer of the LSTM is a vector.

In the OWO method, the kernel is the BERT language model. The input sentence is tokenized using Wordpiece or SentencePiece (We use Wordpiece for MBERT, DistilmBERT, ParsBERT; SentencePiece for AlbertFA). By passing these tokens to the BERT language model, is generated for each token. By concatenating these vectors, is produced, each row of which is the output vector of an input token (|S| is length of the input sentence.). Afterward, we pass to the LSTM and take the . The [CLS] token output vector is passed to a fully connected neural network. The hidden layer is , and the output layer is . Finally, we concatenate the output of vector and and generate vector. Algorithm 7 shows the process of the OWO method.

Input: sent
Output:
(1) tokenizer
(2)
(3)
(4)
(5)
(6)
(7)return OWOoutput
4.4.3. Relevance Measurement

This component measures the relevance between the question and the candidate answers. This method is composed of a fully connected neural network. In this component, we generate a value that specifies the relevance. To perform this, we concatenate the output of the sentence representation for the question and the candidate answer. Then, we pass this vector to a fully connected neural network. The hidden layer is , and the output layer is . Algorithm 8 shows the process of the relevance measurement component.

Input:
Output:
(1) { is the concatenation operator, }
(2)
(3)
(4)return { is sigmoid function}

5. Experiments

5.1. Baseline Models

As mentioned in Section 4, we proposed a method called PerAnSel for answer selection task for the Persian language. We consider four kernels for OWO method containing ParsBERT [64] and AlbertFA [65] for Persian, and DistilmBERT [48] and MBERT [16] as multilingual kernels. We compare this method to two baseline methods: (1) ASBERT and (2) CETE.

In the ASBERT [49], they focus on the ranking methods. They employ Siamese and triplet networks to encode input sentences by the BERT language model for answer selection tasks. In the CETE [45], they focus on the language models. They utilize language models such as ELMo, BERT, and RobertA to encode sentences for answer selection tasks.

5.2. Implementation Details

In order to implement the PerAnSel method, we used the PyTorch framework in Python 3.7. We trained and inferred the model in Google Colab (https://colab.research.google.com) environment on the NVIDIA Tesla T4 16 GB. The batch size is 8 and 4 for the question classifier and the answer selection method, respectively. The activation function is Gelu for language models and Relu for fully connected networks, LSTMs, and CNNs.

To train models, we consider the learning rate and train the proposed model on 4 epochs for the question classifier and 2 for the answer selection method. WarmupLinearSchedular [71] is used to schedule the learning rate. WarmupLinearSchedular is a learning rate schedule where the learning rate increases linearly from a low rate to a constant rate thereafter. This reduces volatility in the early stages of training. The AdamW optimizer is used to train all models. Table 7 shows the number of training parameters of the methods. The training time of the models is shown in Table 8.

In order to evaluate the question classifier, we used the accuracy metric. Accuracy shows the proportion of correct predictions to the whole number of predictions. Equation (9) shows the accuracy metric. To evaluate the answer selection method, we used the MRR metric. The MRR is a measure for evaluating methods, which generates a list of possible responses to some queries, ordered by relevancy [21]. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first relevant answer: 1 for first place, 1/2 for second place, 1/3 for third place, and so on. The mean reciprocal rank is the average of the reciprocal ranks of results for queries. In our system, the queries are the questions, and the responses are the relevant answers. Equation (10) shows this metric:

In Equations (9) and (10), Q shows the questions in the dataset. rj is also the inverse of the first rank of the qj answer.

6. Results and Discussion

6.1. Answer Selection

In this article, we present PASD, the first large-scale native answer selection dataset. We also present the PerAnSel method to solve the answer selection problem for the Persian language: (1) SOVWO, (2) OWO, and (3) PERSEL. For methods that use BERT inside them (OWO and PERSEL), we examined four versions of the BERT (ParsBERT, AlbertFA, DistilmBERT, and MBERT) in each model. Hence, we build eight BERT-based answer selection systems according to the core answer selection method and BERT-version examined. Table 9 shows the description of the systems.

We also implement two baseline systems: (1) ASBERT and (2) CETE. We train each of the answer selection systems using the training set of the datasets and evaluate them with the test set. We evaluate each of the answer selection systems according to MRR measurement described in Section 5.2. Table 10 and Figure 9 show the performance of the answer selection systems on WikiFA, PerAnSel, and PerCQA [60] datasets, respectively. We also show the and for the PERSEL method in Table 11.

We derive the following observations from the results:(i)The SVOWO method has the worst performance than the other proposed methods.This is because of the lack of model knowledge from the language and the answer selection task. The method consists of CNN and LSTM networks with no prior knowledge and has training parameters with random weights.(ii)The OWO and PERSEL methods performance is improved by transferring the kernel to ParsBERT, AlbertFA, DistilmBERT, and MBERT, respectively.This is because of the quality and the volume of the information, which is used to train the language models.(iii)The PERSEL method has the best performance.We hypothesize that this method supports all kinds of word orders such as SOV, SVO, and OSV. The SOVWO processes SOV word order and the BERT component processes other word orders.(iv)The OWO and PERSEL method have better performance than the CETE method.This is because using the [CLS] token output of the BERT language model has more unsatisfactory performance than using the output vector of all token outputs.(v)The OWO-MBERT and PERSEL-MBERT have better performance than the ASBERT method.We hypothesize that this can be attributed to the fact that using pairwise ranking and Siamese architecture performs better than the mere use of Siamese architecture merely.(vi)Experimental results on WikiFA and PASD datasets show that the performance of the native dataset (PASD) is better than the translated dataset (WikiFA).This is because the quality of the dataset language significantly impacts the accuracy and performance of the answer selection system.(vii)Despite the fact that PASD and PerCQA are native datasets, the experimental results show that models have better performance on PASD than PerCQA.We hypothesize that this can be attributed to the fact that deep learning models require amount of annotated data for training to have acceptable performance.(viii)Experimental results on the WikiFA dataset show that unlike PASD and PerCQA, is less than .We hypothesize that this is because that the words of translated sentences are in various orders than native sentences, which mostly are in the SOV word order.(ix)The and are closer together for the PerCQA dataset than the PASD dataset.This is because the language of PerCQA is informal Persian and the language of PASD is formal Persian. In the PASD dataset, native annotators try to compose sentences in standard word order (SOV). So, the effect of SOVWO is more significant than OWO.

6.2. Question Classifier

In Section 3, we presented the PASD dataset to be used in answer selection task. In Section 3.2, we enhanced the dataset for question processing and also presented a question classifier, which use PASD as the training set and classifies the questions. In this section, we evaluate the question classifier both intrinsically and extrinsically. In intrinsic evaluation, we measure the performance of the question classifier in terms of accuracy. In extrinsic evaluation, we measure the impact of the question classifier on the answer selection task. Table 12 shows the accuracy of the question classifier with four kernels examined and trained on the PASD dataset.

Table 12 shows that by using MBERT as the kernel of the question classifier, the best accuracy is obtained. This can be attributed to the quality and the volume of the information that is used to train the language models. Table 12 indicates that monolingual language models such as ParsBERT and AlbertFA have less accuracy than multilingual language models such as DistilmBERT and MBERT. Moreover, the superiority of MBERT rather than DistilmBERT can be attributed to the number of learnable parameters.

In order to measure the impact of the question classifier component on answer selection task, as mentioned in Section 4.4.1, we utilize the output of the question processing in answer selection systems. Table 13 shows the performance of the answer selection systems, using the question classifier component on PASD, PerCQA, and WikiFA datasets. As for question classifier kernel, we used MBERT, which shows the best performance.

Here we observe:(i)The performance of BERT-based methods is better than non-BERT methods.(ii)Combining the question classifier with the PERSEL method performs best.(iii)The performance of the model on the WikiFA dataset is reduced by combining the question processing component

     We hypothesize that this can be attributed to the fact that the detection of the EAT for automatically translated sentences in WikiFA is more challenging than native sentences, because the syntactic and semantic structures of translated sentences are low quality.

6.3. Error Analysis

In this section, we analyze errors of the question classifier and answer selection method and indicate which interrogative words these methods are compatible with. Table 14 shows the error analysis on question classifier, and Table 15 shows the error analysis results on the PERSEL method on the PASD dataset.

According to Tables 14 and 15, and Figure 10, we observe the following:(i)Table 14 shows the most error is related to the word. Because there is no corresponding EAT to questions. In other words, the exact answer of questions is a multiword expression, which is not equal to any EATs. Also, answering this type of question requires reason and logic.(ii)Table 15 shows that using the question processing component is very effective in answering some questions.Because the MRR of six interrogative words (what, how, when, where, who, which) is improved rather than a system without using the question classifier.(iii)Figure 10 demonstrates that the MRR measure for each interrogative word is improved, except for word.

     This is because the exact answer of questions is a multiword expression, which is not equal to any EATs.

7. Conclusion

In this article, we present the first large-scale native answer selection dataset for the Persian language called PASD. We also propose an answer selection model called PerAnSel for the answer selection task in Persian QA systems. Evaluating PerAnSel on the Persian language shows the superiority of PerAnSel over the state-of-the-art methods. The Persian language is a free word-order language. The standards word order in Persian is SOV, but other word orders are also correct. In PerAnSel, we parallelize a sequential and a transformer-based method to handle various orders in the Persian language. The results show that sequential models such as LSTM and 1-D CNN work better on standard word order (SOV) and transformer-based models such as BERT language models composed of fully connected networks and attention mechanism works well for other word-order types, in the Persian language. As for future work, we can mention the use of generative methods to generate datasets [72]. In these methods, in addition to the translation and native datasets, an automated dataset produced by generative methods can be employed.

Data Availability

Data used to support the findings of this study are available from the GitHub Repository at https://github.com/BigData-IsfahanUni/PerAnSel and https://github.com/PerCQA/PerCQA-Dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

J.M. conceived the study and helped with the methodology and software; J.M., A.K., and P.M. validated the study. J.M. did the formal analysis; A.K. and P.M. investigated the study; J.M., A.K., and P.M. provided the resources; J.M. and A.K. curated the study; J.M. prepared the original draft; A.K. and P.M. reviewed and edited the manuscript; J.M. visualized the study; A.K., P.M., and M.N. supervised the study; and M.N. administered the project. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

The authors thank all the members of the Big Data Research Group, University of Isfahan. They also thank the colleagues at the University of Isfahan for helping them to prepare the PASD.