Abstract

Representation of language is the first and critical task for Natural Language Understanding (NLU) in a dialogue system. Pretraining, embedding model, and fine-tuning for intent classification and slot-filling are popular and well-performing approaches but are time consuming and inefficient for low-resource languages. Concretely, the out-of-vocabulary and transferring to different languages are two tough challenges for multilingual pretrained and cross-lingual transferring models. Furthermore, quality-proved parallel data are necessary for the current frameworks. Stepping over these challenges, different from the existing solutions, we propose a novel approach, the Hypergraph Transfer Encoding Network “HGTransEnNet. The proposed model leverages off-the-shelf high-quality pretrained word embedding models of resource-rich languages to learn the high-order semantic representation of low-resource languages in a transductive clustering manner of hypergraph modeling, which does not need parallel data. The experiments show that the representations learned by “HGTransEnNet” for low-resource language are more effective than the state-of-the-art language models, which are pretrained on a large-scale multilingual or monolingual corpus, in intent classification and slot-filling tasks on Indonesian and English datasets.

1. Introduction

The pretrained language models, such as ELMo [1], BERT [2], RoBERTa [3], and XLNet [4], play vital roles in modern neural NLP systems, which learn a widely applicable and informative representation of words and sentences [57]. With the optimization of high-quality semantic representation, the performance of models for most of the downstream tasks such as text generation [8] or text classification [9, 10] is upsurging. Recently, the multilingual-BERT [11] and Bilingual Generative Transformer (BGT) [12] for low-resource languages draw attention in both research literature and industry. However, pretraining a specific and reliable word embedding model from scratch for the low-resource language requires large-scale corpus and expensive computing costs. Meanwhile, it is unwise and redundant to pay so many efforts for each low-resource language as there are hundreds of low-resource languages in the world. Consequently, to our knowledge, the popular strategies for embedding low-resource languages are composed of two branches: (i) utilizing the multilingual pretrained word embedding model [2, 11] and fine-tuning with annotated training data directly; (ii) cross-lingual transferring [13] based on multilingual embedding pretrained model from a resource-rich language in some designed methods, such as aligning vectors [14, 15] or mixing codes [16].

The blue circles denote the resource-rich natural language sentences (i.e., English corpus), each of which has its encoded representation from the pretrained embedding model. The green circles denote the low-resource natural language sentences (i.e., Indonesian corpus). By learning the high-order semantic representation from hypergraph, the representation of the Indonesian corpus is generated.

Under these two solutions, we also have to face and overcome corresponding challenges. For the first challenge, despite the large scale of multilingual pretrained models, many words of a low-resource language are still not included in the vocabulary, which leads to the out-of-vocabulary problem. Fine-tuning the multilingual pretrained model is an approach to update and adapt for the relevant data. However, it is technically challenging, as the hyper-parameters picking for the fitting needs to be carried out carefully [17]. Otherwise, this will cause either losing valuable information learned from the origin multilingual embedding model or merging the new corpus into the embedding latent space poorly. And lots of irrelevant and useless language embedding will also sparse the word representation model [11]. The second challenge is from the cross-lingual branch. They not only suffer obstacles of the multilingual pretrained model but also accumulate more representing loss during training or fine-tuning [17]. And they rely heavily on large amounts of a parallel corpus, which is expensive to collect and hard to control quality [13].

In this study, we address these above problems via a Hypergraph based Transfer Encoding Network (HGTransEnNet), which allows learning high-order representation of semantic information (rather than fine word-level representation) for low-resource language benefiting from high-quality monolingual pretrained word embedding model in a transductive clustering manner. The proposed HGTransEnNet is built upon a cross-lingual hypergraph structure as illustrated in Figure 1, which is fed with a bilingual but nonparallel corpus. Hypergraph structure takes advantage of collecting knowledge and explores and learns the co-relationship of high-order semantic representation shared between the low-resource language and resource-rich language within the same domain. We conduct extensive experiments on the annotated Indonesian dialogue dataset and the English dialogue dataset (MultiWOZ [18]). Our approach achieves better performance than existing methods on all of the domains in terms of intent classification and slot-filling tasks. And we investigate the model’s performance on a different scale of feeding training data, rare domains, out-of-vocabulary, and other languages by abundant comparison experiments. This study is mainly divided into four sections. In the abstract, we briefly describe the main issues to be addressed, the structure of our proposed framework, and the experimental results. Second, in the Introduction, we discuss two challenges that exist in the field of low-resource language representation learning, as well as how our model addresses these challenges. We divide related work into three subsections (Low-resource Language, Spoken Language Understanding, and Preliminary on Hypergraph Learning) to introduce our study on the basis of a solid theoretical foundation. Third, we disassemble and explain the structure of the proposed model in detail in the Hypergraph Transfer Encoding Network. Finally, in the Experiment, we first introduce datasets and compared methods and metrics. Then, we use figures and tables to demonstrate the effectiveness and robustness of our proposed HGTransEnNet. The main contributions of this study are summarized as follows:(i)We propose a hypergraph-based framework for representing high-order semantic information by transferring and learning from resource-rich language data.(ii)Our framework is not only capable of effectively solving embedding for low-resource languages but also has the potential ability to overcome the out-of-vocabulary problem for intent classification and slot-filling tasks.(iii)The proposed method outperforms state-of-the-art related methods in intent classification and slot-filling tasks on the Indonesian dialogue dataset (ID-WOZ), which will be released as one of the resource contributions, as well as other widely adopted multilingual dialogue datasets (Multilingual WOZ 2.0, Multilingual NLU).

2.1. Low-Resource Language

Many works make efforts on representing low-resource languages [1921]. Cross-lingual transfer learning has become a popular topic aiming to discover the underlying co-relationships between the source and target languages. Reference [20] proposes to integrate English syntactic knowledge into a state-of-the-art model and shows that it is reasonable to leverage English knowledge to improve low-resource language understanding. Reference [22] conducts the cross-lingual word embedding mapping using zero supervision signals. Reference [14] proposes a self-learning framework in a small size of word dictionary to learn the mapping between source and target word embeddings. Reference [13] utilizes multilingual embeddings obtained from training Machine Translation systems [15] in Thai and Spanish. Reference [23] investigates to align the cross-lingual sentence-level representations by leveraging the large monolingual and bilingual corpus and achieves state-of-the-art performance in several cross-lingual tasks. In line with these methods, encoding semantic information directly within the same cross-lingual latent space could avoid semantic misunderstanding. But relying on aligned parallel sentence pairs can suffer from noise and imperfect alignments [16]. What is more, it is quite challenging to collect enough high-quality bilingual parallel corpus with fine-labeled annotation. To our knowledge, our approach is designed with a high-order structure–hypergraph modeling, to overcome these training and collection problems. Simply with the help of easy-obtained monolingual corpus and the off-the-shelf pretrained language model, our proposed framework is capable of achieving comparable and reliable performance on the semantic classification task (i.e., intent classification) on a low-resource language dataset.

2.2. Spoken Language Understanding

There are several complex components in dialogue systems, mainly separated into three parts: “Natural Language Understanding (NLU)” [2428] (a.k.a., Spoken Language Understanding, SLU), “Dialogue Management (DM)” [29, 30], and “Natural Language Generation (NLG)” [31, 32]. Some trials on end-to-end modeling [31, 3337] have also been considered. The difference among languages mainly reflects on the first part and the third part [38, 39], the former of which is the foremost challenge to tackle in this work. In recent years, brief concepts have been extracted efficiently from the growing data on the Internet. A method called Swarm Intelligence (SI) is widely used in the construction of automatic text summary frameworks. Moreover, several NLU models based on SI perform well in both single-document and multi-document summarization [40, 41]. Research in SLU fields has not only been applied to dialogue models but has gradually expanded to the field of chatbots. Kabiljo et al. [42] propose an ADA (Academic Digital Assistant) chatbot supported by natural language understanding to deal with the impact of COVID-19 [43] on the education system. Matic et al. [44] thoroughly investigate the structure of common chatbots and introduced corresponding metamodels. This study also designs mapping rules between common natural language understanding models, which can be used to make chatbot architecture more flexible. Gupta et al. [45] propose a novel health care chatbot based on the RASA framework, which can initially predict the disease of the patient and give certain treatment suggestions through dialogue without any hassle. SLU typically involves identifying the intent and extracting semantic constituents from the natural language query, two tasks that are often referred to as intent detection and slot filling [28]. Usually, the performance on these two tasks can efficiently embody the quality of semantic representation for understanding spoken languages. Therefore, we mainly demonstrate the ability of our framework by conducting experiments on these two tasks in this work.

2.3. Preliminary on Hypergraph Learning

Hypergraph learning has been widely applied in many tasks, such as identifying nonrandom structure in structural connectivity of the cortical microcircuits [46], identifying high-order brain connectome biomarkers for disease diagnosis [47], and studying the co-relationships between functional and structural connectome data [48]. Hypergraph learning was first introduced in [49], in which each node represents one case, each hyperedge captures the correlation between each pair of nodes, and the learning process is conducted on a hypergraph as a propagation process. By this method, the transductive inference on hypergraph aims to minimize the label differences between vertices that are connected by more and stronger hyperedges. Then, the hypergraph learning is conducted as a label propagation process on the hypergraph to obtain the label projection matrix [50], or as a spectral clustering [51]. Other applications of hypergraph learning include video object segmentation [52], images ranking [53], and landmark retrieval [54]. Hypergraph learning has the advantage of modeling high-order correlation modeling, but the mining and learning of the co-relation among different languages for semantic understanding on the hypergraph have not been well investigated.

3. Hypergraph Transfer Encoding Network

In this section, we introduce the detailed structure of our proposed Hypergraph Transfer Encoding Network (HGTransEnNet), as shown in Figure 2. In the first stage, the encoding hypergraph is constructed from the resource-rich language dataset and low-resource language data, which includes the initial vertex feature matrix (denoted as X0) and the hypergraph incidence matrix (denoted as H). Then, the second stage learns the semantic representation for low-resource language sentences from the pretrained resource-rich language model by the designed hypergraph encoding convolutional layers (denoted as “HGEnConv”) in a transductive learning manner. Finally, we obtain the encoded features of low-resource language for further intent classification and slot-filling tasks. Next, we introduce each individual step of the proposed framework detailed furthermore.

3.1. Encoding Hypergraph Construction

Same as a fundamental hypergraph, our encoding hypergraph is defined as  =  , where and E denote a set of vertices and a set of hyperedges respectively. Each hyperedge is assigned with a weight by the diagonal matrix . We let each vertex denote a semantic feature of sentence, including resource-rich language (i.e., English) and low-resource language (i.e., Indonesian). The crucial components of constructing stage in HGTransEnNet are the sentence vertex feature matrix X ∈ R × and the incidence matrix H ∈ R × , where the denotes the number of vertexes , the denotes the dimension of the vertex feature, and the denotes the number of hyperedges |E |. As shown in Figure 2, we firstly group the English data and the Indonesian data based on the same intent class ( ∈ Y) within the same domain, since the same semantic classification could share a similar combination of latent space patterns. In this grouping manner, the sets of hyperedges are generated, which are denoted as the blue oval dotted frame or the red oval dotted frame in Figure 2. The initial vertex feature matrix X0 ∈ R × is formed by  =  + sentences, i.e., the sum of English data and Indonesian data, whose structure can be formulated as:where and denote the pretrained encoded feature vector for English sentence and the informative features matrix, respectively. and denote the number of English sentences and Indonesian sentences, respectively. and stand for the blank-semantic feature vector for the target Indonesian sentence and the target Indonesian features matrix, respectively. Note that the informative encoded features matrix of English sentences are encoded by the 5 pretrained English-BERT-Base-cased models [2, 55], provided by the popular repository bert-as-service1. The features matrix of Indonesian sentences are initialized randomly within a word bank but share the same dimension R, which are generated and updated by several layers of hypergraph encoding convolutional layers. In our approach, to leverage as much as informative encoding from high-quality pretrained English-BERT, we set up the co-relation between each English vertex and Indonesian vertex in the incidence matrix H, based on the same classification of the data (e.g., the same intention), denoted as the “blue” oval wireframe and the “red” oval wireframe. Our incidence matrix H is calculated by × |E| and the entries are defined in Eq. (2):

If a vertex is connected by a hyperedge , then the value of the corresponding element of the incidence matrix (, ) is 1, otherwise it has the value 0. The degree of a vertex is defined as (), and the degree of a hyperedge is defined as (). The diagonal matrices of the hyperedge degrees and the vertex degrees, denoting as De and Dv, respectively, could be generated as shown in:where  ∈ R1 × |E| denotes the weight matrix of hyperedges. () denotes the weight of each hyperedge; we here set it to 1, i.e., () = 1, all of the co-relationships between sentences share static same weight.

3.2. Hypergraph Transferring Learning

The vertex feature matrix F R × , as denoted in Figure 2, is composed of semantic informative vectors R × and semantic-blank initialized vectors R × , representing the English sentences and the Indonesian sentences, respectively. The hyperedge is the bridge of transferring and learning the representation for the low-resource language from the existing pretrained model. The hyperedge feature matrix F R × is the product of the vertex feature matrix F and the learnable parameter matrix Θ(). Then, the transfer operation by the incidence matrix H makes the blank vectors filled and updated from the C. Since there is no unique mathematical definition of translation on the hypergraph from the spatial perspective, we take the widely adopted classical spectral hypergraph convolution operation [56] as the base structure of the hypergraph encoding convolutional layer (HGEnconv(·)). The hypergraph Laplacian L, i.e., the normalized positive semi-definite Laplacian matrix of the resulting hypergraph, is obtained by:where I ∈ R × is an identity matrix. is the incidence matrix, is the weight matrix for hyperedges. Therefore, the convolutional operation, i.e., HGEnconv(·) of HGTransEnNet, can be formulated in:where the X() ∈ R × is a signal with vertexes fed in a layer . And the X( +1) is the output of a layer . The denotes the nonlinear activation function like (·). Θ(l) denotes a learnable parameter in the layer . Finally, by adding the fine-tuned parameters bias, the HGEnconv(·) finishes fusing and generating the high-order representation for the Indonesian sentences.

3.3. Semantic Encoding Aggregation

After a few layers (i.e., ) of transferring hypergraph convolutional operation HGEnconv(·), a set of output feature maps are obtained Xin = {X1, X2, ..., Xk} ∈ R × , denoting different scale of transferring. We design an encoding aggregation function EnAggr(·) to control the rate of transferring, shown in

ratio represents the proportion of encoding aggregation. The higher the transfer ratio, the more the coupling scale, which means that the Indonesian representation enjoys more fusion. The representation of Indonesian sentences Xblk is extracted from the global representation matrix Xin. We treat different feature channels (a.k.a., attributes) as different semantic factors that represent and affect the final intent classification and slot-filling tasks. Considering that the feature channels of each sentence are used to describe different attributes with the same dimension latent embedding space, we select the most remarkable or the average value for each attribute via a column-wise aggregating operation (i.e., AttrAggr(·)) to reserve each attribute information as much as possible, which can be defined as:

AttrAggr(·) is the column-wise aggregation function that can be max-pooling, mean-pooling, etc. indicates the number of aggregated attributes. The final encoded representation for Indonesian data Xout ∈ R × is aggregated by the following algorithm 1. And then, if the task needs word-level representations (e.g., slot filling), the embeddings for each word can be extracted, as shown in:where denotes the number of tokens contained in each sentence. Finally, we feed the output of our framework forward to the BiLSTM with the attention mechanism and CRF layer [57] to train the intent classification and slot-filling model further. Note that in this work, we mainly focus on the representation of language, the BiLSTM-CRF model could be exchanged for other related classified models; therefore, we left these bunch of extending experiments as our future works.

Input: Xin={X1, X2, ..., Xk}, [0, 1],
Output: Xout = 
(1)Xout , Xblk ← InitializeTensor(Φ)
(2)EnAggr(Xin,)
(3)for X do
Xblk(i)=← Extract(Xi)
(5)Xblk ← VerticalStack(Xblk, Xblk(i))
(6)end for
(7)Xdblk ← AttrAggr(Xblk, )
(8)Xout ← Transpose(Xdblk)
(9)return Xout
(5)Xblk VerticalStack(Xblk, Xblk(i))
(6)end for
(7)Xdblk AttrAggr(Xblk,)
(8)Xout Transpose(Xdblk)
(9)return Xout

4. Experiments

4.1. Datasets and Evaluations

We take the English dataset MultiWOZ [18] as the resource-rich language corpus and our Indonesian dataset named ID-WOZ, as the low-resource language corpus. In terms of collection and annotation, we adopt the Wizard-of-OZ [58] dialogue-collecting approach, which has been shown to be effective for obtaining a high-quality corpus at relatively low costs and with a small-time effort. Following the success of MultiWOZ [18], we conduct a large-scale corpus of natural human-human conversations on a similar scale. Based on the given templates for various domains, users and wizards generate conversations using heuristic-based rules to prevent the overflow of information. We design and develop a collection-annotation pipeline platform with a user-friendly structure for building the dataset. At the stage of annotation, we divided the number of well-trained annotators (i.e., 80 local people, 70 of whom spoken ID as their native language, 10 of whom were bilingual citizens, plus 2 main organizers) into two groups to produce dialogue and annotation. A quarter of annotators (i.e., 20) are trained following the guidance we provide to play the wizard role. After collecting 1 k dialogues initially (about one week), while the collecting conversation is still ongoing, the second group of annotators (i.e., 62) joins in work toward the detailed full-labeled corpus, including domains, actions, intents, and slots. A brief example is shown in Figure 3. It consists of nine domains, namely plane, taxi, wear, restaurant, movie, hotel, attraction, hospital, and police. And we organize a group of annotators to label the corpus including actions, slots, and intents. As the hospital and police domains in MultiWOZ contain very few dialogues (5% of total dialogues) and only appear in the training dataset, we choose to ignore them in our main experiments, following [59]. Therefore, we only adopt four domains restaurant, hotel, taxi, and attraction shared by MultiWOZ and ID-WOZ datasets in our main experiments. Statistics of them are shown in Table 1. We use the F1 score as the evaluation metric, which is the harmonic mean of precise (P) and recall (R) and is widely adopted in the intent classification and slot-filling tasks. Given a set of training data and corresponding testing data, we split the training data into 5 folds. In our implementation, five-fold cross-validation is employed to investigate the optimal parameter setting within training datasets. To verify the stability of the proposed method, we run the experiments ten times for each set of parameter settings and compare their mean performance.

4.2. Compared Methods and Implementation

We first unify the sentence length based on the longest sentence by padding, i.e., each sentence containing tokens (  = 64). And the following related approaches for word embedding are compared with our network:(i)Random Initialization. In this simple method, we set the dimension of word embedding the same with EnglishBERT-Base-cased (i.e., R1 × 768), and randomly generate the word embedding bank.(ii)Machine Translation (MT). In order to embed reasonably, under the lacking Indonesian pretrained model situation, we compare with the machine translation preprocessing method. And then we utilize the English-BERT-Basecased (i.e., R1 × 768) to encode the corpus. Note that mostly the last two layers of BERT are used as the embedding output.(iii)Multilingual-BERT (ML-BERT) [2]. Released on the popular repository–Multilingual-BERT. It contains 104 languages, 12 layers, 768 hidden nodes, 12 heads, and 110 M parameters. So that the Indonesian corpus can be represented directly by this pretrained model (i.e., R1 × 768).(iv)Indonesia-fastText [60, 61]. This work released pretrained word vector models for 157 languages based on each monolingual corpus, including Indonesia. They pretrained it on the Common Crawl and Wikipedia using fastText. This Indonesian word vector model is trained using CBOW with position-weights, in dimension 300 (i.e., R1 × 300), with character n-grams of length 5, a window of size 5 and 10 negatives.(v)Cross-lingual Transfer [14]. There are multiple ideas of cross-lingual transfer in recent years, we reproduced several of them and reported performance of this practical and reliable method. This method is capable of learning the Indonesian word embeddings notwithstanding still requiring a few bilingual data, which we will also release (i.e., R1 × 768).(vi)Indonesian-BERT (ID-BERT). To compare with the state-of-the-art embedding method, we pretrain a specific BERT for Indonesian from scratch with about 3.3 billion tokens from Indonesian websites’ document-level corpus, which covers news reports, research assay, daily articles, and other text genres. The size of its vocabulary is 0.9 M, which is much larger than Multilingual-BERT (0.12 M). We believe that this size of the vocabulary is sufficient to cover most situations. The training takes one week using Google Cloud TPU v3_8; the Indonesian-BERT-Base (Cased, L = 12, H = 768, A = 12) is eventually obtained (i.e., R1 × 768).

The left user plays the wizard role pretending the assistant chatbot, and the right green plays the customer. One dialogue may contain several different domains.

After embedding words by the above models as well as our proposed network, each sentence is represented by a word embedding tensor, whose dimension is R × . Besides feeding the original word embeddings of sentences to our network, we also follow the sentence encoder method [5] to experiment. We adopt the average pooling of the word embedding feature map into a 1-dimensional vector R and overlay the BiLSTM and CRF layers as the base model [28] to finish the task.

Compared with pretrained ML-BERT and the ID-BERT, our framework is capable of classifying the intention and slots of sentences more accurately, especially in several complex classes, such as hotel_name, area, destination, request_area, and inform_departure.

4.3. Results and Discussion

The comparison results of all the methods are summarized in Table 2. Based on these quantitative results, we have the following analysis.

4.3.1. Intent Classification: Our Proposed Framework

HGTransEnNet outperforms others on this task across all of the boards in the F1 score, achieving 87.21%, 85.03%, 91.26%, 91.44%, and 87.69%, 85.23%, 91.38%, and 91.59% for restaurant, hotel, taxi, attraction domains on the BiLSTM with attention mechanism and BiLSTM with attention mechanism and CRF, respectively. Compared with the nearest baseline models, our method averagely achieves a gain of 0.98% (  = 1.371 ∗ 10 – 2), 0.91% (  = 3.075 ∗ 10 – 3), 1.58% (  = 7.864 ∗ 10 – 3), and 0.45% (  = 2.609 ∗ 10 – 2) compared with the nearest baseline models, cross-lingual, Multilingual-BERT, Indonesian-fastText, and Indonesian-BERT, respectively.

4.3.2. Slot Filling

Our model is also capable of outperforming other methods on slot filling. The accuracy reaches 76.94%, 77.28%, 87.58%, 88.22% and 77.08%, 77.86%, 87.47%, 89.01% on F1 score for restaurant, hotel, taxi, attraction on the BiLSTM-Attention and BiLSTM-Attention-CRF, respectively. Because our model mainly focuses on the general high-order semantic representation and the slot filling is a kind of more relying on the word-level subtle task. Our method does not achieve relatively high performance, gaining of 1.74% (  = 1.715 ∗ 10 – 2), 1.68% (  = 1.047 ∗ 10 – 2), 2.29% (  = 5.913 ∗ 10 – 3), and 1.00% (  = 4.183 ∗ 10 – 3) compared with the nearest baseline models, cross-lingual, multilingual-BERT, Indonesian-fastText, Indonesian-BERT, respectively.

4.3.3. Validation and Analysis

As shown in Figure 4, we select the top three methods for intent classification and slot-filling tasks in hotel and taxi domains and draw the bar chart to analyze qualitatively. For intent classification, the model relies on understanding the general semantic information of the given sentences and classifying it into different classes. We can see that inform_type, request_area intents in hotel domain and the request_arrive, request_leave intents in taxi domain are higher than the other two methods, because they are obscure and hard to detect. We analyze and draw the conclusion that the specific ID-BERT achieves slightly higher performance than our approach in the precision (P) evaluation metric (1.04%). Because it owns the richest background knowledge of Indonesian. Therefore, it also proves that our model has slightly weaker learning ability without the support of large-scale datasets. However, we perform better than others in the recall (R) metric about 2.17%, which reflects our model can learn the high-order general semantic representation from the English language model in the transductive clustering manner. The slot-filling task requires the model to detect the value of slots and recognize the class of slots at the same time. In this procession, representation for every word is critical. Our proposed framework HGTransEnNet outperforms others on this task across all of the boards in the F1 score. For the complex types of slots, like hotel_name, area and destination, departure in hotel and taxi domains, respectively, our model has the ability to leverage the language semantic knowledge by the grouping manner and outperforms others (Tables 3 and 4).

4.4. Analysis on Training Data Amount

One of our proposed approach’s strengths is to learn more informative semantic representations for low-resource language. The main limitation of most low-resource languages is lacking high-quality annotated collected corpus. Therefore, to investigate the performance of the model within different amounts of training data, we conduct a series of experiments incrementally. As shown in Figure 5, we feed the model annotated data batch by batch, i.e., 1 k, 2 k, 4 k, 8 k, 16 k, and full-scale. We here select restaurant and taxi domains as examples. The statistics line chart is shown in Figure 5, where the two leftmost sub-graphs denote intent classification and the two rightmost sub-graphs denote slot filling. And we compare three approaches to stand for three popular strategies, namely “machine translation”, “cross-lingual” [14], and “HGTransEnNet”(our introduced method), denoted as blue, yellow, and green lines, respectively. As shown in Figure 5, the red line represents the benchmark performance from ID-BERT pretrained by us former. Based on the quantitative results shown in Figure 5, we have the following observations:

The red line denotes the performance of pretrained ID-BERT, regarded as the reliable benchmark. The green line stands for our proposed HGTransEnNet, showing its strength over other compared methods.(1)For machine translation method, the main issue is the quality of translation. We conduct the BLEU [62] test for the entire MultiWOZ, and the performance of translation is 28.46 (BLEU-5) on 30 k sentences. However, during the translation of dialogue messages, one incorrect word could cause complete misunderstanding. We pick several examples and show them in Figure 6. The top half of the examples stand for the mistakes implied in the slot-filling task. Apparently, the translation method suffers a few mistakes when accounting for the “name”, “area”, “address”, etc., which are quite challenging problems to solve. And the below part shows that a tiny translation mistake will cause the wrong result of the entire intention classification for the sentence and will lead to totally different progress of the dialogue. We can see that with the help of pretrained English-BERT, the performance of the translation method has the ability to get close to the benchmark, but cannot reach better.(2)When the scale of fed annotated low-resource language data gets larger, the strength of cross-lingual becomes more obvious. It is capable of avoiding misunderstanding caused by translation and mitigating the shrink effect of the English corpus, which makes it achieve the best performance and even better than the benchmark performance of ID-BERT, when the Indonesian data reach around 16 k for restaurant and taxi domains.(3)The accuracy reaches 86.44%, 76.00%, 89.45%, and 85.21% on F1 scores in intent classification and slot-filling tasks, respectively.(4)Based on Figure 5, we can draw the conclusion that our method is capable of getting close to the benchmark performance with the less collected corpus. Enjoying the designed representation transferring and aggregation modules, our network manages to perform better than compared related approaches stably and reaches higher achievement in the same scale of training data. However, we can observe that both the traditional pretrained model and our proposed hypergraph model achieve poor performance when the amount of data is extremely small. The main reason is that the length of the dialogue data is short, and the hypergraph structure barely uses sufficient contextual correlation to understand semantic information.

4.5. Performance on Rare Domains

We also conduct extensive experiments on other rare domains (i.e., plane, police, movie, hospital, and wear), which reflect the local cultural background in Indonesia. The MultiWOZ dataset contains the police and hospital domains but the scale is small. The other three domains are special in our collected ID-WOZ. We use our model trained by main domains and large-scale data to generate the sentence encoding and fine-tune for rare domains furthermore. In the plane, movie, and wear domains, the specific ID-BERT achieves a slightly higher performance in the slot-filling task. Because in these domains, the training data are not only relatively limited but also lack the English corpus in the same domain. But still, generally speaking, our approach shows its ability in the small-scale data situation and outperforms others across all of the board in intent classification and most in slot filling. Overall, the results on rare domains reflect that our approach is capable of transferring the semantic representation for Indonesian and outperforms the pretrained embedding model ID-BERT.

4.6. Ablation Study for Adjacency

Different from the incidence matrix, the adjacency matrix is another popular solution for associating the nodes. In the broad sense, this approach can also be viewed as the graph convolutional neural network [63] when the incidence matrix H connects each pair of sentences and becomes the symmetric matrix. To compare with GCN, we also conduct a comparison experiment in all of the domains. In most domains, its performance is close to the random initialization approach, or even worse. So we report a few results of this section in Table 5, which do not have much meaning to discuss in this implementation strategy. The result demonstrates that lacking the ability to group and learn high-order information will make the model perform poorly and lose skills.

4.7. Analysis on Out-of-Vocabulary

To verify how our approach performs in the out-of-vocabulary situation, we furthermore conduct a validation experiment, which is reported in Tables 6 and 7. When the language embedding model encounters some unfamiliar words, we expect the reasonable solution is to group the coming new words into some existing semantically similar groups. Therefore, we take the Indonesian words as the out-of-vocabulary words of the English-BERT-Base-Cased model. We believe that different vertex does not matter in different syntax or in different languages, as long as they share the same group. There are general semantic latent attributes that represent the same classification features. We know that the main purpose of the pretrained language model is to embed the natural language words or sentences into a specific latent space. The words or sentences having similar semantic or syntactic information share similar latent mapping and have close spatial distances. In this analysis scale, the achievement of our proposed HGTransEnNet has the ability to imitate a similar sophisticated embedding latent space. We select the top 1 k frequent words from the corpus by TF-IDF [65] and calculate the Euclidean distance between them and corresponding Indonesian word embeddings. From beginning to end, we never feed the model parallel corpus, and even though each word is embedded to a complex dimension vector (R1 × 768), the mean distance is 11.6327. (The distance of most synonyms in English within the English-BERT model is around 9.8722.) And the scatter plot demonstrates an obviously clustering effect of sharing similar semantic words. This proves that in the latent space, our model is capable of embedding the language into a similar semantic cluster.

4.8. Performance on Other Languages

Besides the exploration above, we also conduct several thorough experiments on the performance of approaches in other languages.

4.8.1. Multilingual Datasets

We briefly introduced the experiment datasets here. Multilingual WOZ 2.0 [66] is expanded from the restaurant WOZ dataset by including two more languages, 1200 dialogues of each, i.e., German and Italian. Following the settings of [64], we use 600 dialogues for training, 200 for validation, and 400 for testing. The corpus contains four slots types: food, price, area, and request. And the multilingual task-oriented natural language understanding dialogue dataset (denoted as “Multilingual NLU”) proposed by [67] contains English, Spanish and Thai, across three domains (alarm, reminder, and weather). It contains 12 intent types and 11 slot types.

4.8.2. Compared Methods

We adopt two more related methods in this experiment besides those introduced in Section 4.2.(i)Zhang et al. [16] proposed a code-mixing approach to tackle the cross-lingual dependency parsing task. By adopting the code-mixing transfer method, it is capable of leveraging syntactic knowledge to transfer to the target language. Therefore, we here utilize this transferring idea to implement the multilingual dialogue NLU tasks as one of the compared methods, with the Multilingual-BERT as the embedding pretrained model.(ii)Liu et al. [67] designed a zero-shot adaptation method for a cross-lingual task-oriented dialogue system, noted as “Attention-Informed Mixed-Language Training (MLT)”. It leverages a few task-related parallel word pairs generated by the attention layer from the trained English model and existing bilingual dictionaries. We here implement it with multilingual-BERT as the embedding pretrained model.

4.8.3. Analysis on Other Languages

From the results summarized in Table 8, we have the following observations: considering the difference among these languages, grammar, syntactic, cultural background, language family, etc., our proposed transferring representation method outperforms others across all of the board. Note that though the quality of the annotated dataset also affects all of the model’s performance, the kind of language still plays a more critical role in the cross-lingual intent classification and slot-filling tasks. The domain, quality, and data scale of Multilingual WOZ 2.0 and Multilingual NLU are distinctive, but all of the models perform reliably in German, Italian, and Spanish. Based on the trained model, we analyze the Euclidean Distance of different languages in detail. (1) Since within our transferring method, the model can embed multiple languages corpora into one single latent space, which makes it possible to compare their latent distance. As shown in Figure 7, we can intuitively see that Spanish is the closest language to English. Italian, German, and Thai are getting farther gradually. This conclusion is also reflected in Table 8. All of the methods in Spanish, Italian, and German outperform those in Thai. (2) The visualization [68] of our model attention encoding result in different languages is shown in Figure 8. The above half of Figure 8 comes from our collected dataset, i.e., English and Indonesian, and shows a quite reasonable attention bias. The model can allocate more weights on several more domain-specific slots, such as “restaurant name”, “address” or “food name” either in English or Indonesian. The below half shows the representation of the model for other languages, in which the attention weights also make sense. For instance, though the model may make a few mistakes for slot filling, more weights are focused on the related keywords like “time”, and “location”. We can see that our approach has the ability to capture semantic and syntactic information in different languages.

5. Conclusion and Future Work

This study presents a Hypergraph Transfer Encoding Network for the tasks of intent classification and slot filling on low-resource language, in which the encoding hypergraph is constructed from both the low-resource language dataset and the high-resource language dataset. The semantic representation of low-resource language is generated by the well-designed hypergraph encoding convolutional layers (HGEnConv). It is achieved by learning the high-order semantic representation in a transductive clustering manner from the pretrained resource-rich language model. In addition, we construct a well-annotated Indonesian dataset named ID-WOZ, which consists of multiple domains, to fairly evaluate baselines and our proposed HGTransEnNet. Experiments on MultiWOZ and ID-WOZ demonstrate the superior performance of our model to state-of-the-art neural models on intention classification and slot-filling tasks. And our method can also facilitate the exploration of the out-of-vocabulary problem in the semantic representing scale. As stated before, representation learning for the low-resource language is still a highly data-dependent task. Traditional pretraining models and cross-lingual models rely heavily on large amounts of the parallel corpus or multi-language datasets. Future work will consider zero-shot learning, attention mechanism, and high-order relationships in small sample data to encode embeddings for low-resource languages. We will also explore its capability in other tasks in future works.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declare that there are no conflicts of interest.