Abstract

Conversational technologies are transforming the landscape of human-machine interaction. Chatbots are increasingly being used in several domains to substitute human agents in performing tasks, answering questions, giving advice, and providing social and emotional support. Therefore, improving user satisfaction with these technologies is imperative for their successful integration. Researchers are leveraging Artificial Intelligence (AI) and Natural Language Processing (NLP) techniques to impart emotional intelligence capabilities in chatbots. This study provides a systematic review of research on developing emotionally intelligent chatbots. We employ a systematic approach to gather and analyze 42 articles published in the last decade. The review is aimed at providing a comprehensive analysis of past research to discover the problems addressed, the techniques used, and the evaluation measures employed by studies in embedding emotion in chatbot conversations. The study’s findings reveal that most studies are based on an open-domain generative chatbot architecture. Researchers mainly address the issue of accurately detecting the user’s emotion and generating emotionally relevant responses. Nearly 57% of the studies use an enhanced Seq2Seq encoding and decoding of the input of the conversational model. Almost all the studies use both the automatic and manual evaluation measures to evaluate the chatbots, with the BLEU measure being the most popular method for objective evaluation.

1. Introduction

The advancement of conversational technologies has led to a massive increase in the integration of chatbots in several domains. A chatbot is a dialog system that interacts with humans in natural language via text and voice or as an embodied agent with multimodal communication [1]. Chatbots are desirable by organizations because they provide proactive service and immediate assistance to consumers and cut operational costs [2]. They are used extensively to automate several tasks such as tracking deliveries, making reservations, requesting flight information, and placing orders. Their 24/7 availability and quick response to general queries make them an appealing solution for organizations. More recently, chatbots are also being used to provide social and emotional support in healthcare and personal lives [3].

Chatbots are the fastest-growing communication channel worldwide across multiple domains [4]. The enormous benefits of integrating chatbots in service and social disciplines lead organizations to invest highly in this technology. However, research indicates that users are still uncomfortable with chatbot communications and prefer interacting with a human agent [2]. Moreover, a review on chatbot usability and user acceptance shows that people prefer natural communication over machine-like interactions and believe that a human can understand them better [5]. The study also reveals that user satisfaction is imperative to successfully integrating and adopting chatbots. Therefore, improving user engagement and satisfaction with chatbot interactions has become crucial to provide a better experience and encourage users to embrace the technology [6].

In the last few years, Artificial Intelligence (AI) and Natural Language Processing (NLP) technologies have been driving the development of chatbots to enable advanced conversational capabilities [7]. Chatbots have evolved from utilizing pattern matching and rule-based models to using AI-powered deep learning technologies that drastically excel in natural conversation [8]. The advancement in AI and NLP has enabled the development of chatbots that generate dynamic responses that do not exist in the database, thus making the conversation natural. However, despite these technologies, the responses generated by the chatbots are often dull and repetitive, which leads to user disengagement and frustration [9].

Understanding emotion and responding accordingly is the essence of effective communication [10]. Hence, the emerging trend in chatbot development is to create empathetic and emotionally intelligent agents capable of detecting user sentiments and generating appropriate responses [11]. Salovey and Mayer [12] proposed the term emotional intelligence, which refers to identifying, incorporating, comprehending, and controlling emotions. Emotions play a significant role in making or breaking a conversation. Users get frustrated when chatbot responses are irrelevant [13], while a chatbot that verbalizes emotions can enhance the user’s mood [14]. Moreover, users often anthropomorphize chatbots, which in turn influences their interaction and behavior [15]. Chatbots that mimic human behavior and emotions lead to increased rapport, higher motivation, and better engagement [16]. Therefore, researchers are investigating ways to improve a chatbot’s empathetic and emotional capabilities [17]. Ongoing research focuses on conversational agents capable of perceiving the user’s emotion and responding appropriately with emotional cues to better engage users.

1.1. Problem Statement

Investigation into the development of emotionally intelligent chatbots is a recent trend as researchers continue to find better ways to generate human-like empathetic conversations. Although chatbots have existed for decades, the use of AI-driven techniques in empathetic conversational systems is relatively new. This area of research is confronted with several challenges such as the accurate recognition of emotion and emotional state of the user while keeping track of the history of the conversation and generating appropriate responses that are not dull and repetitive. Moreover, emotionally intelligent chatbots that generate diverse responses require a massive dataset [1]. Therefore, it is imperative to gain insights into the datasets used by empirical studies. Furthermore, the performance of a chatbot is measured by various evaluation strategies, making it vital to study the evaluation measures suitable for emotionally intelligent chatbots. Thus, it is imperative to study the state-of-the-art techniques in developing emotionally intelligent chatbots and report the findings to the research community to further the development in this field.

While several systematic reviews on chatbots exist in the literature, these studies differ from our study in their objectives. Some reviews examine chatbot applications and usage in a variety of domains, such as healthcare [3], neuropsychiatric disorders [18], education [19], business sectors [20], and personal assistants [21], with no focus on emotional aspects of the conversation or technical aspects of chatbot development. The study byMohamad Suhaili et al. [22] provides deeper insights into the technical aspects of chatbot development; however, the review is focused on service-oriented chatbots.

There are only a handful of studies that have reviewed empathetic chatbots. A systematic review by Rapp et al. [5] focuses on the human-computer interaction (HCI) perspective of chatbot usage by investigating the usability and user acceptance of human-like chatbots. Our study is differentiated from this study by focusing on the technical aspect of empathetic chatbot development rather than the emotional or psychological aspect of user interaction. The study by Wardhana et al. [23] provides a review of empathetic chatbot development based on the chatbot type, model, and inference techniques. Another study by Pamungkas [24] provides a survey on the approaches to building an empathetic chatbot. Ma et al. [25] survey empathetic dialog systems based on three aspects which include affective dialog, personalization, and knowledge. Notwithstanding their recognized contributions, these studies do not perform a thorough analysis using a systematic approach. Moreover, the past reviews have not provided insights into the challenges and techniques of emotion generation addressed by empirical studies. Furthermore, the studies do not provide researchers with the datasets and evaluation measures that are used in the development of emotion-aware chatbots. Our review offers a novel contribution to the study of emotionally intelligent chatbot development by providing an in-depth analysis of the challenges in emotion generation, techniques used, and evaluation criteria of empathetic chatbots. To the best of our knowledge, there is no systematic review that investigates the development of emotionally intelligent chatbots and their challenges, techniques, and evaluations.

Considering the above factors and the research gap, the objective of this paper is to provide a systematic literature review of the most relevant studies that investigate the development of chatbots enriched with emotional capabilities. We aim to use a methodical approach to discover, categorize, and present our findings on several aspects of emotionally intelligent chatbots and discover the gaps relevant to computer science researchers interested in advancing research in this field. Our study, in particular, is aimed at comparing and contrasting the overall characteristics among the studies, such as chatbot language, the domain of study, and trends. We examine the main problems tackled by researchers in developing emotion-aware conversational agents. We also investigate the techniques and approaches employed by studies in developing chatbots. Lastly, we study the evaluation measures used by the studies to evaluate their solutions. To that effect, our study analyzes contributions in this field and is aimed at answering the following research questions:

RQ1: what are the general characteristics of the studies?

RQ2: what problems are addressed by the studies?

RQ3: what approaches and techniques are employed in chatbot development?

RQ4: what evaluation measures are used to evaluate chatbot performance?

The remaining sections of this study are structured as follows. Section 2 presents the background information on chatbots with an overview of chatbots, chatbot architecture, and the role of emotional intelligence. Section 3 details the methodology of the systematic review and the phases involved. Section 4 presents the findings of the study. Section 5 presents the discussion of the results. The conclusion, limitations, and further research avenues are presented in Section 6.

2. Background Information

This section presents an overview of chatbots, describing the various classifications used in the literature for describing them. We discuss the significance of emotional intelligence in chatbots, followed by a general chatbot architecture of the different types of chatbots and methods of integrating emotional intelligence in chatbot technology. The following subsections introduce the concepts of chatbot development, in particular incorporating emotion in a chatbot in order to better understand the terminologies and classifications used in the review.

2.1. Overview of Chatbots

Chatbots, also known as conversational agents, are dialog systems that interact with humans in natural language via text and voice or as embodied agents with multimodal communication [1]. A chatbot’s primary function is to respond to user requests provided in textual-based or voice-based input. The chatbot processes the user input and generates an appropriate response.

There has been a surge in chatbot development in the last few years, with bot applications manifesting their presence in various domains [26]. Businesses deploy chatbots to provide efficient customer services by responding to customer queries and automating tasks [20]. Chatbots are used for teaching and learning activities, student advising, and administrative tasks [19]. Chatbots have become pervasive for psychiatric care and evaluation of medical diagnoses in the healthcare sector, raising awareness [18, 27]. Chatbots are also popular as social companions [11]. Social chatbots are not designed to accomplish a specific task but rather to engage with humans to fulfill their need for communication and social belonging [28]. Chatbots offer a cost-effective means of delivering services to consumers by eliminating repetitive and time-consuming human-agent communication while enabling the agents to focus on high-end complex tasks [2].

Several taxonomies are used in the literature to classify chatbots. Hussain et al. [29] categorized chatbots based on their purpose as task-oriented and non-task-oriented. The primary function of a task-oriented chatbot is to respond to domain-specific user queries and often perform tasks such as reserving a ticket. A non-task-oriented chatbot interacts with humans in open-ended, domain-specific conversations, also called open-domain chatbots. The primary function of these chatbots is to act as virtual companions where the dialog is open-ended.

Adamopoulou and Moussiades [9, 11] classified chatbots based on their response generation method as rule-based, retrieval-based, and generative chatbots. A rule-based chatbot selects a response based on a predefined set of rules. The responses are not dynamic and often repetitive. The strength of a rule-based chatbot lies in its ability to provide precise answers. However, it cannot detect lexical errors and works well when the input message is well formed. Moreover, a rule-based chatbot answers user queries without keeping track of previous responses and is ideal for a question-answer system.

A retrieval-based chatbot fetches responses from a sizeable predefined corpus using keyword matching or machine learning techniques to get the most appropriate response. Personal assistants such as Alexa, Siri, and Google Assistant are retrieval-based as they respond to user requests by retrieving information from a broad range of sources [26]. On the other hand, a generative chatbot generates responses using machine learning techniques, thereby constructing diverse responses by learning from the corpus. The responses are generated by translating input utterances to output data using statistical machine translation and predictive analytics techniques, thus making the conversation natural. A limitation of the generative model is that it requires massive training data. This limitation has led to the development of generative chatbots mainly for open domains since domain-specific conversational data is not readily available [30, 31]. A recent trend is using a hybrid approach by integrating retrieval-based and generative models to create task-oriented chatbots that possess human-like conversational skills to provide a better user experience [31].

The ongoing quest for developing chatbots that mimic humans is evident from its inception. One of the first chatbots, ELIZA and PARRY, was based on pattern matching technology to imitate human responses [26]. Both chatbots used a rule-based approach for generating responses based on keywords limiting the conversation to a predefined set of responses. In 1995, ALICE [32] was developed using Artificial Intelligence Markup Language (AIML) and was more sophisticated in generating human-like responses. Nevertheless, these primitive dialog systems could not keep up with the growing expectations of users in both conversational style and prediction of the user’s intent. Chatbots these days are AI-driven and powered by Natural Language Processing (NLP) technologies that are capable of offering sophisticated solutions to meet the language and content expectations of end-users [26].

2.2. Emotionally Intelligent Chatbots

Despite the proliferation of chatbots in our daily lives, recent studies have shown that customers still prefer interacting with humans rather than bots [2]. This resistance is attributed to the poor conversational skills of chatbots which make the interaction unnatural and machine-like leading to frustration and communication breakdown [5]. Furthermore, end-users might be more willing to interact with chatbots if they are enriched with human-like interpersonal qualities [2]. Notwithstanding the limitation of chatbot conversational skills and high end-user expectations, conversational agents are still a desirable solution for reducing operational costs. Therefore, it has become critical for businesses to bridge the gap between customer expectations and chatbot technology.

Emotions play an integral part in an effective conversation. A study by Xu et al. [33] reveals that nearly 40% of customers’ interaction with agents on social media is emotional rather than informational. Several studies have shown that emotionally intelligent conversations lead to a good user experience resulting in fewer communication breakdowns [5]. A qualitative study by Svikhnushina and Pu [34] revealed that users are more likely to engage with emotion-aware chatbots and are eager to have a natural conversational experience with a virtual counterpart. Another study by Ghandeharioun et al. [14] disclosed that emotionally enriched responses by a chatbot could lift a user’s mood, thus enhancing customer experience and improving customer relationships. Xiao et al. [31] supported these findings by showing that users are more engaged with chatbots capable of sensing and verbalizing emotions in the conversation. It is evident from these studies that perceiving emotions and responding with an appropriate empathetic reply is crucial to enhancing user satisfaction with chatbot conversations.

A vast amount of ongoing research is dedicated to integrating emotional capabilities in chatbots to enhance their conversational skills. AI-driven chatbots can detect user sentiments in a conversation, thus triggering the chatbot to comprehend the user’s emotional state and generate an appropriate response. The following subsection presents an overview of an AI-driven chatbot architecture.

2.3. AI-Driven Chatbot Architecture

Chatbots are composed of several essential components, each playing an indispensable role and working together in a robust system that effectively serves its purpose. These components may be incorporated into text-based or voice-based agents [1]. In most cases, these components are organized in a pipeline based on their order of usage. Figure 1 presents the architecture showing the main components of a chatbot architecture.

2.3.1. Natural Language Processing (NLP)

The first component is the Natural Language Processing (NLP) unit that processes the structured input using tokenization, lemmatization, and stemming techniques. Some chatbots apply these techniques to incoming user requests as a preprocessing strategy [22]. An additional Automatic Speech Recognition (ASR) component may exist in voice-based agents that extract text from the audio stream. In addition, the architecture may contain a nonverbal information extraction component, which can detect nonverbal information, like the user’s emotions [1].

2.3.2. Natural Language Understanding (NLU)

The structured data collected by the NLP unit is passed on to the Natural Language Understanding (NLU) component, which processes the data using various strategies. Usually, in this component, data structures are parsed to understand the user’s intent and all particulars associated with that intent [35].

2.3.3. Dialog Manager

The dialog manager component examines the understandable structured data, maintains the dialog framework such as the semantic frame, and encodes the data to determine what action should be taken next. The dialog managers may request clarification from users if the semantic structure is incomplete to ensure that the dialog context is relevant and that all ambiguities are resolved [11]. The dialog manager relies on external or internal sources of data. Internal data sources might be embedded as a template or rules in Artificial Intelligence Markup Language (AIML) to decipher user requests and retrieve responses. Additionally, the chatbot may construct its database internally from scratch or utilize existing databases outfitted with their domains and functions. Alternatively, chatbots may use third-party APIs to obtain external data sources [22].

2.3.4. Natural Language Generator (NLG)

Finally, the response generation component, NLG, is based on how the chatbot generates responses. It may use a retrieval-based, rule-based, or generative model. Retrieval- and rule-based models are simple in design and need essential intelligence to select the best response match. However, they have limited usability and flexibility [22]. In comparison, the generative model has incredible flexibility and can handle a variety of domains. However, they can be highly complex and expensive, and they need an extra degree of intelligence.

Researchers who study emotionally intelligent chatbots have adopted the general chatbot architecture. They implemented the neural-based approach, and they use models that enforce emotion-aware characteristics, such as emotion embedding and reinforcement learning models, in addition to encoder-decoder architectures that use Sequence-to-Sequence learning [24].

2.4. Deep Learning in Chatbot Conversations

Artificial neural networks are machine learning algorithms that may be supervised or unsupervised. Deep learning, being an unsupervised machine learning algorithm, can mimic how the human brain develops patterns and employs them for making decisions [29]. There has been an increase in the use of deep learning neural networks in conversational modeling, particularly Recurrent Neural Networks (RNNs), Sequence-to-Sequence (Seq2Seq) networks, and Long Short-Term Memory (LSTM) networks [22].

RNN is an artificial neural network class and a type of recursive artificial neural network. This method saves the output from a layer and feeds that saved output to the new input to forecast the following output. In the context of natural language, RNN captures the inherent sequential nature of words, where the meaning of words is understood through their relationship to the previous words in the sentence. Due to this approach, RNNs are well suited for chatbots since understanding the user input and producing contextually relevant responses is essential [29].

Research in emotionally intelligent chatbots employs encoder-decoder architecture with Seq2Seq learning. The Seq2Seq model utilizes RNN as its architecture, with an encoder processing the input and a decoder producing the output. This model was initially introduced in 2014 as a variation of Ritter’s generative model incorporating advancements in deep learning to enhance accuracy [29]. The Seq2Seq model is applied to chatbots to transform input status into output response. It is currently regarded as the industry’s best practice for generating responses because Seq2Seq maximizes the likelihood of the response and is capable of processing a large amount of data to generate the optimal response [24].

Despite its approximation of a good response, the Seq2Seq function fails to meet the chatbot’s true purpose of simulating human-to-human communication [35]. Therefore, LSTM, a type of RNN, is designed to overcome the long-term dependency problem of RNNs. LSTMs contain memory cells and gates that can retain previous information for long periods where input gates control the data stream, forget gates, and output gates. The LSTM or Gated Recurrent Unit (GRU) is the dominant variant of RNNs used to learn the conversational dataset in these models. An LSTM network outperforms the traditional RNN and other sequence learning networks and replaces these models in learning from experience.

Some studies have implemented LSTM with reinforcement learning tasks to get more generic responses and enable the chatbot to attain long-term conversation effectiveness [35]. In addition to this model, research has shown that the Conditional Variational Autoencoder (CVAE) model can also improve the diversity of responses. In CVAE, a latent variable is used to learn a distribution over possible conversational intents, and greedy decoders are used to generate responses [36].

2.5. Emotionally Intelligent Chatbot Technology

It is crucial to select preprocessing steps carefully when building an emotionally intelligent chatbot, as different preprocessing techniques suit different contexts. For example, the NLP process is primarily used to collect, tokenize, and parse information. Parsing is a technique that implements algorithms where the input is deconstructed according to a predefined rule, such as left-right or bottom-up [37].

Embedding techniques are commonly used in emotionally intelligent chatbot technologies. The embedding model transforms the input text data into a numerical form that is easily understood by the machine [24]. Various embedding methods exist, such as character embedding, word embedding, and sentence embedding. Word embedding is a compact vector representation of words in the lower-dimensional space. It is possible to represent words and phrases with matrices that produce massive sets of data as the size of the input increases, such as a bag of words and Term Frequency-Inverse Document Frequency (TF-IDF) [22]. Word2Vec and BERT models are the two popular word embedding models used in neural networks, which can also be used for emotion and semantic embedding. These models strive to maximize conditional probabilities for better word matching [37].

In terms of semantic relations among linguistic concepts, the Valence, Arousal, and Dominance (VAD) [38] space is widely used as the primary source of structure since it accounts for about 70% of the variance in meaning. VAD ratings have also been used in empathetic tutoring, sentiment analysis, and other affective computing applications. The three standard dimensions of emotion are Valence (the pleasantness of a stimulus), Arousal (the intensity of emotion produced by the stimulus), and Dominance (the degree of power produced by the stimulus). There are three levels of emotion intensity in these words: very low (e.g., dull), moderate (e.g., watchdog), and very high (e.g., insanity) [39].

The artificial neural-based approach is extensively used to develop emotionally intelligent chatbots. Artificial neural network-based chatbots apply both the retrieval-based and generative approaches for producing responses. However, the research trend is heading towards generative approaches [24] as it offers diverse responses. This paper explores the research studies investigating AI technologies to generate emotionally intelligent responses to report state-of-the-art techniques.

3. Research Methodology

This study explores existing literature on the development of emotionally intelligent chatbots by adopting the systematic review framework of Kitchenham and Charters [40]. This framework was chosen because it defines the guidelines for conducting reviews in the technical field instead of other frameworks like Tranfield et al. [41] that are more oriented towards qualitative studies in the medical field. A rigorous theoretical framework is essential to guiding the comprehensive data collection and inquiry methods required for our investigation. Moreover, the methodical process ensures the reliability of our findings. The systematic literature review guidelines by Kitchenham and Charters [40] outline a thorough method for collecting, analyzing, and documenting findings from secondary data sources. We aim to answer our research questions following this methodology to uncover the latest trends and technologies to develop emotionally intelligent chatbots.

The review process is divided into three phases: planning the review, conducting the review, and reporting the results. Each phase is further subdivided into several steps, each of which is described in the sections below.

3.1. Planning the Review

In recent years, a vast amount of research has been conducted to improve user satisfaction with chatbot conversations by detecting user sentiments and generating appropriate emotional responses. Therefore, it is crucial to provide researchers with the current state of the art regarding emotionally intelligent chatbots, including the techniques used to embed emotions in computer-generated responses, the datasets used, and the evaluation processes adopted to measure the performance of the chatbots.

To begin our systematic review, we start with the planning phase that defines the search strategy and the inclusion/exclusion criteria and identify the data sources used for selecting the articles of the study. Finally, we describe the quality assessment checklist for assessing the quality of the articles and set a threshold for their inclusion.

3.1.1. Search Strategy

The primary aim of the search criteria is to investigate the latest advances in the development of emotionally intelligent chatbots. To that effect, we conducted a preliminary search of existing literature and systematic reviews to understand our study’s context, keywords, and scope. We used the Population, Intervention, Comparison, Outcome, and Context (PICOC) method outlined by Petticrew and Roberts [42] as a guideline to define our research directions. In this regard, our study’s population relates to the main keywords and their derivatives with similar connotations for emotionally driven chatbots, such as conversation agents, virtual or digital assistants for chatbots, and empathy or feelings for emotion. We used these keywords to define the search string for the search process presented in Section 3.2.1. The intervention in our study refers to the search context [42]. We used the identified keywords to filter studies that meet our objectives: Emotional, Chatbot, Conversational agent, and virtual assistant. In the comparison step of PICOC, we consider all possible approaches, models, development, algorithms, and evaluation metrics in developing emotionally intelligent chatbots. The outcome determines our data coding requirements and results, including the knowledge of techniques used in developing emotionally intelligent chatbot solutions and the problems addressed, the datasets, and the evaluation metrics used. Finally, we define the context as only empirical studies related to emotionally intelligent chatbot development.

3.1.2. Inclusion/Exclusion Criteria

Selecting articles for the review led us to outline essential criteria that define the characteristics of the studies included in the study. Table 1 summarizes the inclusion/exclusion criteria applied for selecting the articles. First, empirical studies related to the development of chatbots with emotion-embedded responses were included. Second, only peer-reviewed journal and conference papers were included in the study, thereby excluding books, book chapters, and reviews. Third, only articles published in the English language were included to eliminate the bias that may result from poor translation. Finally, the study period was determined to be between 2011 and 2022, as chatbot development with the integration of AI techniques has emerged in recent years. A ten-year period is sufficient to view the emotionally intelligent chatbot research trend.

3.1.3. Data Sources

Various data sources were considered to retrieve relevant publications for this study, ranging from general to computer science topics. Accordingly, the search utilized the following six digital databases: Scopus, IEEE Xplore, ProQuest, ScienceDirect, ACM Digital Library, and EBSCO. Furthermore, we also used a manual snowballing method to identify additional relevant studies by exploring references of all selected primary studies.

3.1.4. Quality Assessment Checklist

Quality assessment is crucial in systematic reviews to ensure the validity of the results and reduce bias that may be caused due to the inclusion of less robust studies [43]. Furthermore, the quality assessment also provides more detailed inclusion/exclusion criteria [40].

To ensure a rigorous assessment of the included articles in our review, we developed a quality assessment checklist consisting of eleven questions presented in Table 2. We considered the elements essential to our data extraction and coding phases, such as relevance to our study, clear identification of the problem statement, and validity of the results. Furthermore, we also considered the source’s credibility, which we evaluated using the ranking of the journal/conference and the number of citations of the study.

3.2. Conducting the Review

We implement the plan by searching and retrieving the articles in this phase. The articles were retrieved in Jan 2022. The articles were further screened using the inclusion/exclusion criteria and quality assessment checklist described in Section 2.

3.2.1. Search Process

An extensive range of search strategies was used to retrieve the studies from the identified databases to raise the probability of identifying highly relevant studies. We used logical operators AND and OR by combining the keywords identified in the planning process. Furthermore, the search was performed on the title, abstract, and keywords to ensure that relevant studies were not left out. The following is the search query syntax used in all the identified databases: (“chat botORchatbotORtalkbotORtalk botORpersonal assistantORvirtual assistantORdigital assistantORconversational agent”) AND (“emotionalORemotionORemotionsORempathyORsentimentORfeeling”).

In addition to the automated search, we also performed the manual snowballing search as detailed in the planning process. The results of the search are presented in Table 3. A total of 2219 results were retrieved with the highest number of studies from Scopus because it is generic and sources publications from all domains.

After retrieving the search results, we performed a bibliometric analysis of the results to analyze the research areas. Figure 2 shows the visualization of the terms in the results, constructed using VOSviewer [44]. The diagram presents the significance and interconnections between the frequently occurring terms extracted from the abstract, title, and keyword search results. The size of the shape and the label associated with the term determines its importance. The color of the terms determines the clusters in the visualization. Each cluster represents terms related to each other in that group. Moreover, the distance between the clusters represents the relatedness of the clusters.

The visualization of the terms in the extracted studies reveals several clusters. This shows that there are various dimensions of studies on empathetic chatbots from the perspective of usage, applications, usability and user experience, and chatbot development. The clusters are tightly overlapped, indicating that several aspects of the studies are interrelated. The clusters show that the current research trends on emotionally intelligent chatbots are on chatbot response generation, chatbot effectiveness, chat evaluation, and usability. Considering only the clusters with highly weighted terms, four main clusters can be seen in the visualization. The first and central cluster (red) includes the following keywords: emotional intelligence, emotional, conversational agent, research, and emotional response. This cluster implies that research is active in this area and related to chatbot empathetic response generation. The second cluster (purple) contains keywords such as effectiveness, conversational agent, framework, patient, problem, and technique, which entails that the area of research in this cluster is about chatbot usage and effectiveness. In the third cluster (green), the significant keywords are input utterance, factor, generation model, and human evaluation, which implies that research is related more to evaluating chatbot technology. Finally, in the fourth cluster, the main keywords are consistency, performance, human, affect, and technique, indicating that research in this cluster is about chatbot performance (light blue). Examining these clusters provides an idea of where research findings are located for better analysis and discussion of the studies.

3.2.2. Article Selection

In this phase, we applied the inclusion/exclusion criteria to screen the retrieved articles for eligibility following the PRISMA [45] framework. This framework provides a detailed guideline and structured approach to screen the documents. The steps of the screening process are outlined in Figure 3.

First, we removed the duplicate records. Then, we applied the inclusion/exclusion criteria to ensure that only relevant articles were included. Each author independently performed a title and abstract screening of the studies to remove irrelevant articles. At this stage, most of the articles () were excluded as they did not match the inclusion criteria. As discovered in the network analysis of the search terms, most of the articles were related to usability and user acceptance of chatbots. These articles were excluded as they did not contribute to the context of our study. Next, we performed a full-text screening of the remaining articles () to assess relevance and eligibility. Each author performed this step independently by equally dividing the studies to be reviewed. In cases where the eligibility was unclear, the authors discussed resolving the discrepancy. Finally, a quality assessment was performed of the remaining articles () after the full-text screening. In this step, the authors screened the articles initially screened by the other to reduce bias and ensure that each article was reviewed twice. Finally, 42 studies were included in the systematic literature review.

3.2.3. Quality Assessment

Each author performed the quality assessment independently using the assessment checklist presented in Table 2 using a scale from 0 to 1, where 1 represents that the criteria are wholly met, 0.5 represents partially met, and 0 represents not met. We assigned one point to the article having at least two citations per year regarding the number of citations. Table 4 presents the detailed quality assessment of the included articles, showing that all included articles are of good quality. It must be noted that the quality assessment is a means to determine whether the selected article is relevant to the contribution of this study, with no attempt to criticize any of the studies and their findings.

3.2.4. Data Analysis and Coding

The objective of this phase is to accurately record all findings of the study by collecting metadata from the primary studies included in the review. The metadata relates to the research questions of our study. We conducted a thorough data analysis of all the relevant features identified in the planning phase to accomplish this task. The metadata analysis includes various characteristics that are essential to answering our research questions, such as the characteristics of the study in terms of publication type and year. We also examine the technical aspects of the study, such as the chatbot’s language of development, the emotions detected and used, the problems addressed, the technique used for the development and evaluation measures of the chatbot, and the dataset used for evaluation.

3.3. Reporting the Review

The final phase of the systematic review presents the study results. After a detailed and in-depth analysis of the metadata extracted from the full-text review, we present our findings in Section 4 to answer each research question.

4. Results

This section presents the results obtained from the meta-analysis and in-depth review of the included articles with reference to our research questions. We analyzed 42 journal and conference papers published in the span of 10 years to determine the state-of-the-art technologies used to develop emotionally intelligent chatbots. The following subsections present the results of each research question.

4.1. RQ1: What Are the General Characteristics of the Studies?

This subsection presents the general characteristics of the reviewed articles. We analyzed the distribution of the studies by the year of publication, the region of study, the source type (journals vs. conference papers), the interface language, the chatbot type, and the domain of study. These characteristics provide an overview of the development trend of emotionally intelligent chatbots.

4.1.1. Source of the Articles

Figure 4 shows the distribution of studies by source type. Most of the studies included in the review are peer-reviewed journals (), while conference papers () constitute 40% of the studies. The overall distribution of the papers is balanced. Moreover, all the sources of the studies were verified in the quality assessment phase to ensure content validity and reduce bias resulting from inaccurate or poorly reported results.

4.1.2. Publication Year

Figure 5 presents the distribution of the reviewed articles by year of publication. It is evident from the graph that there is significant interest in emotion-aware chatbots over time. The graph also reveals a sharp increase in the investigation of emotionally intelligent chatbots in 2018. This may be attributed to the technological advancement of conversational technologies and a sudden surge of chatbot usage in 2016, referred to as the chatbot “tsunami” by Grudin and Jacques [26]. The Sequence-to-Sequence model [46] published by Google became a basis for most neural conversational agents, leading to the proliferation of generative chatbot studies.

4.1.3. Chatbot Type

Figure 6 presents the distribution of studies by chatbot type. A chatbot may be text-based, voice-based, or multimodal. The majority of the work in developing emotionally intelligent dialog systems is on text-based chatbots. Text-based chatbots are more favored than other forms due to the increased use of messaging technologies. Moreover, the chatbots need to be trained to produce an appropriate response. It is easier to train chatbots to generate text rather than speech, as the training data for text is more readily available than speech data.

4.1.4. Domain of Study

Figure 7 shows that most emotionally intelligent chatbots have been developed for the open domain. These chatbots specialize in natural and emotionally rich conversations without focusing on specific topics. High development in this area is due to the lack of conversational dataset availability. Moreover, the conversational dataset must be labeled with emotion as a preprocessing step before using it in the conversational dataset.

4.1.5. Chatbot Language and Region of Study

Figure 8 reveals that English and Chinese are the two most predominant interface languages to develop emotionally intelligent chatbots. Chang and Hsing [47] support our finding and claim that due to the popularity of social media in China, the Chinese language will soon be one of the most prevailing languages online. Figure 9 further shows that most of the studies originated from China. This finding shows that China has played a leading role in developing empathetic and emotion-aware chatbots since 2018. Moreover, one of the first emotionally intelligent chatbots, XiaoIce, developed by Microsoft, is vastly used in China to provide emotional support to users [48]. These findings reveal the popularity of chatbots in Chinese culture and that China is taking a lead role in developing AI technologies.

4.2. RQ2: What Problems Are Addressed in the Chatbot Development?

After conducting an in-depth analysis of the studies, we identified seven main problems addressed by all the studies. Figure 10 shows an overview of the main problems and how studies have addressed the problems. This section describes each problem and the approaches employed to resolve the problem.

4.2.1. Response Diversity

The studies highlight the limitation of the Seq2Seq model [46] as it produces dull and meaningless responses. Therefore, several studies () tackle the challenge of generating diverse responses that are emotionally relevant. Asghar et al. [39] argued that neural conversational models do not capture the complexity of emotions and often result in short and ambiguous responses. They used a heuristic search algorithm to ensure diversity in generated responses. Multiple studies employed a CVAE-based model to generate varied emotional responses [36, 4952]. Yao et al. [52] argue that a chatbot must generate diverse responses for the same input to simulate a human-like conversation. Their model uses a latent space variable and six emotion categories to generate multiple responses that generate multiple emotionally consistent responses.

Similarly, Liu et al. [36] also generate several responses and select the most appropriate one based on grammar, meaning, and emotional score. Zhang et al. [53] argued that an intervention mechanism is needed to improve response diversity. They consider the input emotion and model a responder state and topic preference to generate diverse responses.

4.2.2. Content Relevance

Several studies () focus on content relevance to achieve a natural conversation in human-computer dialog systems. Both Srinivasan et al. [54] and Sun et al. [55] used reinforcement learning with a reward function to ensure that the responses were content-specific and emotionally relevant. Several studies embedded topics and emotions in the decoder to generate appropriate responses that are emotionally appropriate [53, 5557]. Huo et al. [49] augmented the encoder-decoder with a topic-aware decoder to enhance the content relevance of the response. They differentiated words in output as emotion-related, keywords, and familiar words. In two separate studies, Wei et al. [58] and Wei et al. [59] focused on generating emotionally intelligent and content-relevant responses by embedding semantics and emotions in the input.

4.2.3. Poor Emotion Capture

A large number of studies () focus on accurately detecting the emotion of the input message. While some studies predicted the input emotion using a classifier [60, 61], several studies argue that emotions are complex and cannot be captured by a coarse-grained emotion label. To that effect, some studies predict the emotion by applying the principle of Valence and Arousal (VA) to embed affective meaning for each word in the input message [47, 62, 63]. Other studies built on the previous work and embedded each input word with a three-dimensional emotion embedding based on Valence, Arousal, and Dominance (VAD) [38] to achieve a more fine-grained emotion detection [36, 39, 51, 64, 65]. Li et al. [66, 67] argue that words in messages are usually connected and show that capturing the connections of words enables a deeper understanding of the user’s emotion.

Some studies focus on detecting the user’s emotional state rather than just predicting the sentiment from a single utterance. Lin et al. [68] identify the user’s emotional state by employing a tracker that determines the various emotional aspects of the input. They use multiple decoders to respond to each emotional category and generate an appropriate response. Hasegawa et al. [69] argue that natural conversation is achieved only when the user’s emotional state is predicted from historical conversational utterances rather than a single utterance. They generate the response based on a predicted target emotion using past utterances. Similarly, Li et al. [50] also utilize conversational data to generate more relevant responses.

On the other hand, Li et al. [66, 67] argue that it is crucial to understand the reason behind the user’s emotion and develop a chatbot that elicits the emotional cause by asking appropriate questions. They generate the response based on the chat history and the identified cause. Qiu et al. [70] track the user’s emotional state using a transition network. They model the dynamic emotion flow to predict emotions based on past utterances and generate the most appropriate response.

4.2.4. Irrelevant Emotional Responses

Several studies argue that emotions generated using an NLP chatbot are often not emotionally relevant and attempt to alleviate the problem by controlling the emotion exhibited in the response. Several studies control the generated response by embedding a target emotion in the response generator module [8, 61, 6769, 71, 72]. Zhou et al. [61] use internal and external memory to generate explicit emotional words in the response. Niu and Bansal [72] conditioned the response generator to generate polite, rude, or neutral responses.

Several other studies argued that a predefined label to condition the response generator suffers from poor quality of response [59], and furthermore, it cannot be assumed that the output emotion must be the same as the input emotion. To this effect, some studies attempted to generate more dynamic responses. Zhang et al. [73] generate multiple responses for six emotional categories and select the most appropriate response based on rankings. Similarly, Colombo et al. [64] use two Seq2Seq models to generate several responses and rank them based on emotion to get the most appropriate response. Zhou et al. [74] add an additional emotion classifier model for the responses over multiple emotional distributions, generating two types of responses, one for the specified emotion and one unspecified.

4.2.5. Lack of Emotionally Labeled Conversational Datasets

One of the challenges of developing a chatbot using machine learning is that it requires a massive dataset for training. While several conversational datasets are available for the open domain, datasets labeled with emotions are not readily available. Therefore, several studies resorted to classifying conversational data using a dynamic classifier as a preprocessing technique. Few studies tackled the challenge of the lack of a publicly available labeled corpus of conversational data. Rashkin et al. [17] developed an empathetic dataset of 25 thousand labeled conversations and tested it against well-known neural models. Zhou and Wang [75] generated a labeled dataset from Twitter using emojis as labels to depict the emotion of the input. Their dataset consisted of 64 emotional labels. Song et al. [76] argued that an emotionally labeled dataset of conversations is usually imbalanced, which leads to incorrect predictions. To alleviate the issue, they explicitly embedded emotional words in the input to increase the strength of the emotion. On the other hand, Srinivasan et al. [54] used reinforcement learning to address the unavailability of supervised training data.

4.2.6. Poor Language Model

Two studies addressed the problem of a weak language model for emotional responses. Ghosh et al. [77] extended the LSTM language model trained in a conversational speech corpus to generate text enriched in emotion. In another study, Casas et al. [60] attempt to understand the context and implicit emotions expressed in input data to generate empathetic responses. To that effect, they developed an enhanced language model for empathetic responses.

4.2.7. Other

While previous studies addressed challenges in enhancing an emotionally intelligent chatbot to enable perceiving emotion and generating appropriate responses, some studies focused on other novel areas. We classified it into three main areas.

(1) Domain-Specific Chatbot. Some studies addressed the problem specific to a domain. For example, Adikari et al. [78] stated that previous chatbots in the healthcare sector mainly focused on question-answer systems. They developed a rule-based chatbot that detects patient emotion using NLP techniques and generates a response using a template. In another study based on the healthcare domain, Wang et al. [79, 80] developed a chatbot that provides timely responses to users seeking emotional support. Hu et al. [8] claim that previous chatbots in customer care focused solely on grammar and syntax. They highlight the significance of emotional intelligence in customer care and develop a chatbot that integrates tones in responses by embedding target tones (empathetic or passionate) in output.

(2) Voice/Multimodal Chatbot. Few studies investigated emotionally intelligent voice-based and multimodal chatbots. Griol et al. [81] enhance communication in virtual educational environments by integrating emotion recognition in social interaction with multiple modalities. The study utilizes user profile data and context information from the dialog history to generate emotionally appropriate responses. Hu et al. [82] claim that emotion recognition in vocal responses is novel and explores emotion regulation in voice-based conversations. Their model comprehends the input emotion using acoustic cues and generates emotional responses by integrating emotional keywords in the generated response.

(3) Bilingual Chatbot Interface. Wang et al. [79] use a bilingual decoding algorithm that captures the contextual information and generates emotional responses in two languages. The model employs two decoders to generate primary and secondary language responses.

It is essential to note that some of the problems identified are also applicable to chatbots that are not emotionally intelligent; however, the development of these chatbots faces additional complexities. For example, all chatbots are confronted with the challenge of generating diverse and relevant responses. However, the additional challenge for emotionally intelligent chatbots is to ensure that the diverse and relevant response matches the emotion of the interlocutor. On the other hand, several challenges are specific to empathetic chatbots such as accurate detection of emotion, generation of emotional response, and lack of emotionally labeled datasets.

4.3. RQ3: What Approaches and Techniques Are Employed in Chatbot Development?

This section discusses the various approaches and techniques used in the studies to develop an emotionally intelligent chatbot. Figure 11 presents a taxonomy that classifies the major adopted models and divides them into four categories relating to response generation techniques: Seq2Seq model, rule-based model, CVAE-based model, and other models. These studies further used three different approaches to detect emotion in the input and response: lexicon-based, machine-based, and hybrid method that combines both types of learning. Lexicon-based learning and machine-based learning are two distinct emotion detection techniques used in emotionally intelligent chatbots; i.e., one captures the emotion using a dictionary, and the second captures the emotion by training a classifier. In contrast, the hybrid model adopts both these techniques in emotion detection. The taxonomy diagram reveals that lexicon-based learning is the most used method by studies that address the problem of capturing emotions accurately. The machine learning approach enables the detection of emotion in a more coarse-grained approach.

4.3.1. Response Generation Models

(1) Seq2Seq-Based Model. Nearly 50% of the studies () developed an emotionally intelligent chatbot using a Seq2Seq model, in which a query is represented by one sequence of words and the response by another sequence. Studies have been conducted to extend the model and improve the performance of Seq2Seq and address the limitation of having dull and meaningless responses by generating an appropriate emotional response.

(2) CVAE Model. Some studies () adopt the CVAE approach to develop an emotionally intelligent chatbot to generate diverse and affective responses and overcome limitations created by adopting the Seq2Seq model. CVAE allows a more diverse response generator, but syntax and grammar errors are compromised to a certain extent.

(3) Rule-Based Model. Only two reviewed studies use a rule-based approach to develop emotionally intelligent chatbots, using a hybrid approach to combine lexicons and machine learning to achieve the desired results. The first study extracts individual emotions from patient conversations using NLP techniques based on a psychological emotion model proposed by Plutchik that sets up an emotion dictionary from a variety of pretrained language models such as Word2Vec and GloVe Bag of Words. Furthermore, they use AI techniques and multiple classifiers to detect the group’s emotions. Both kinds of emotions, the group and individual emotions, are used to capture the emotion expression sequence. A rule-based system is used to generate responses based on negative emotions expressed by patients to predict and generate an automated personalized empathetic alert [78]. The second study uses a rule-based approach to detect, predict, and build a statistical response generator based on an utterance’s tags. The training data were automatically obtained from Twitter, in which a classifier is trained to predict and generate specific emotions based on conversational history [69].

(4) Other Approaches. Ten studies () utilize approaches that do not fall into the previous categories. Chen et al. [6] use an encoder-decoder architecture in which the semantic and multiresolution emotional contexts are encoded. In addition, they implement 2-CNN-based semantics with an emotional discriminator used to capture fine-grained emotion using NRC emotion vocabulary for response generation. Wu et al. [83] use encoders-decoders that create emotional label datasets to generate various emotional responses. The model by Lin et al. [68] consists of an emotion detector that uses a transformer encoder and an empathetic listener. The model utilizes an independently parameterized transformer decoder with a metalistener to fuse listeners’ information and produce an empathetic response. Casas et al. [60] used a pretrained DeepMoji DailyDialog dataset to build an emotion classifier using a labeled training set to predict emotional states in text-based messages. Furthermore, Griol et al. [81] combine information from the user profiles with emotional content extracted from the user’s utterances and apply an emotional recognizer in the dialog manager to choose an adapted system response. Rashkin et al. [17] use a generative pretrained transformer and an emotion classifier trained on the DailyDialog (DD) dataset to predict emotional states using an encoder-decoder model. However, Sun et al. [55] use a topic class embedding based on the LDA vector that generates a topic keyword. An emotion embedding vector generates the emotion keyword by reinforcement learning to generate accurate emotional responses.

4.3.2. Input Emotion Detection Models

(1) Lexicon-Based Learning. In several studies (), lexicon-based learning models are primarily used to detect and embed emotions to develop emotionally intelligent chatbots. Asghar et al. [39] and Zhong et al. [65] adopt the 3D semantically augmented affective space VAD (Valence, Arousal, and Dominance) [38] paired with an external cognitively engineered affective dictionary in order to implement emotion embedding techniques to enhance emotion diversity. Furthermore, using the bidirectional Seq2Seq model with a reinforcement framework that provides rewards and adopting the VAD affective space to append embedding emotion values would enable better emotion detection and allow to overcome limitations and generate an appropriate emotional response [54]. On the other hand, other studies apply emotion embedding using the VA vector based on two dimensions of emotion: Valence and Arousal. Valence measures the positivity or negativity of emotion, whereas Arousal measures emotion detection and activity. This study is based on neural networks using a dialog corpus that reflects a positive emotion elicitation strategy [62, 63].

Chang and Hsing [47] propose a two-layered BiLSTM-based model where word embeddings are constructed by encoding forward and backward sequences of characters into a continuous latent space. They capture the emotion enriched with semantic representations to provide a capture of more fine-grained emotions. Furthermore, Ghosh et al. [77] used the Linguistic Inquiry and Word Count (LIWC) text analysis program based on a dictionary. Each word is assigned an LIWC category in which the categories were selected based on their association with social, affective, and cognitive. They use the text analysis program to identify keywords within a text and extract emotions and features. Another study applied the LDA model to derive a topic dictionary and specify the topic related to emotion, and this technique would overcome the limitation of a supervised labeled dataset [56].

(2) Machine-Based Learning. Many studies () used a machine learning approach for emotion classification solely. Several studies use the Seq2Seq-based model with a GRU to improve the Seq2Seq model and improve detecting and generating response consistency [48, 58, 59, 61]. Some studies use a dynamic classifier and a BiLSTM to train the dataset to better capture emotion [53, 72]. In addition, many studies use the Seq2Seq attention model based on deep RNN and pair it with a GRU to target the specific emotion-attention [59]. GRU-RNN is an extension of a neural generator based on gated neural networks by adding three additional cells (refinement, adjustment, and output cells) to capture, control, and produce appropriate sentences [53]. Another study used a multilayer encoder-decoder extended with a Generative Adversarial Network (GAN). The discriminator output data are used as rewards for reinforcement learning, pushing the system to generate dialogs that are most similar to human dialogs [7]. Hu et al. [8] implemented a tone-aware model based on LSTM by adding an indicator vector capable of controlling the tones of generated conversations that allowed embedding target tones of empathy and passion into chatbot responses. Moreover, Niu and Bansal [72] developed a model consisting of 2 layers of the BiLSTM decoder, followed by a convolution layer with reinforcement rewards and trained for polite and rude labels that employ an LSTM-CNN politeness classifier to generate a polite response.

(3) Hybrid Model. Several studies () apply a hybrid model to overcome limitations created by adopting only one approach. For example, in addition to using the VAD lexicon vector presentation as an emotion embedding technique, many studies use Bidirectional LSTM (BiLSTM). This affective classifier can train a Seq2Seq network in an encoder-decoder setting to label the sentences according to their emotional content [64]. Likewise, Peng et al. [57] increase the emotion intensity by pairing the VA lexicon-based emotion model variations of autoencoders that produce sentences containing a given sentiment or tense using an emotion classifier. This classifier can increase the intensity of emotional expression and identify or capture emotion and intensify emotions that do not include any sentiment. Similarly, Song et al. [76] paired the LDA topic model with a classifier BiLTSM, and Huang et al. [71] used a BiLSTM with LIWC (Linguistic Inquiry and Word Count) dictionary to be trained on the dataset.

4.4. RQ4: What Evaluation Measures Are Used to Evaluate Chatbot Performance?

This section describes the datasets used for evaluating the chatbot performance and the different evaluation metrics used by the studies.

4.4.1. Datasets

A conversational dataset is required to evaluate the performance of a chatbot. Moreover, the dataset must be labeled with emotional tags to feed the encoder with emotional input and train the decoder to generate appropriate output.

Most studies have used conversational datasets from various sources, including social media and online websites, as shown in Table 5. The most popular datasets used are Weibo, followed by Twitter, which are open-domain conversational datasets. Only one study used a domain-specific conversational dataset for the healthcare domain [78]. Since none of these datasets are labeled with emotions, researchers have used a machine learning, lexicon-based, or hybrid approach to label the conversations with emotions. Several selected studies have used the NLPCC2013, NLPCC2014, and NLPCC2017 as corpora for labeling. These corpora can only be used in an open domain where the chatbot is not task-oriented-based. Some studies lack conversational and emotionally labeled datasets because of the limited publicly available datasets for training and evaluating the classifier systems, which poses a significant challenge [17]. They propose a new methodology for empathetic dialog generation and introduce a novel dataset of conversations grounded in an emotional context. Table 5 provides details about the various datasets.

4.4.2. Evaluation Measures

This section describes the methods used by the reviewed studies to measure the overall performance of emotionally intelligent chatbots in generating emotional responses. Almost all the studies have used both the automatic and manual evaluation methods to measure the effectiveness of their solution. In the automatic method, a test set is used to evaluate the model by comparing the generated responses with the existing responses using well-known metrics. The studies also used an automated method to measure the accuracy of emotion classification. Moreover, most studies use automated metrics to compare results against a baseline and other standard models. Several studies compared their models against a Seq2Seq baseline approach. On the other hand, the manual method employs humans to rate the responses against specified criteria.

(1) Automatic Evaluation. Table 6 summarizes the metrics used in an automated method. It shows the evaluation metrics for the response generation and classifies the input data. BLEU (Bilingual Evaluation Understudy) is the most common metric to evaluate emotionally intelligent chatbot responses. It is derived from a precision tool that automatically compares machine translation efficiency with human translation. BLEU is used to estimate the overlap between the generated and target responses. Thus, it measures how well the emotional response has been developed. However, BLEU’s low correlation with human judgment is not suitable for measuring conversation generation [61].

Perplexity is another way to evaluate how well a selected model generates an emotional response—the lower the perplexity score, the better the generation performance. Another measure is the Distinct-1 grams and Distinct-2 grams that measure the diversity of the response. As a result, words with many repetitions are penalized, and sentences with many Distinct- grams are rewarded. These metrics are devoted exclusively to the property of a given sentence and require no reference to ground truth [22].

Accuracy, F1-score, precision, and recall are the most common metrics for measuring emotion classification. Accuracy is the percentage of correctly predicted outcomes divided by the total amount of predictions [22]. F1-score is also used to assess machine learning models (or classifiers) as an alternative to accuracy. It measures how well the classifier balances precision and recall. In addition, it measures how the classifier balances between detecting or capturing the precise emotion and recalling it. Finally, data accuracy in a dialog indicates the number of times the data is aligned with the topic discussed. On the other hand, recall measures the number of replies that the chatbot can group into appropriate topics through human-computer interaction [22].

(2) Human Evaluation. Using human evaluators is another way to measure the performance of emotionally intelligent chatbots. Although automatic evaluation is more efficient and has fewer overheads than human evaluation, it does not consider whether the generated emotional response is appropriate and natural. Human evaluation is usually measured on a Likert scale. Several studies employed the Amazon Mechanical Turk (MTurk) participants () for evaluation. Multiple studies () used Fleiss’ kappa test to measure the annotator’s agreements and their consistency in rating [39]. Table 7 summarizes the evaluation criteria used for human evaluation.

5. Discussion

5.1. Chatbot Interface Language

Chinese and English are the most popular chatbot interface languages used by researchers to develop emotionally intelligent chatbots. The conversational datasets for these languages are retrieved from Twitter and Weibo. Only one study proposed the development of a bilingual chatbot [79]. In a multicultural environment, this is an essential solution where a chatbot must be able to converse in the user’s preferred language. This is an avenue open for further research and exploration.

5.2. Dataset Availability

A vast majority of research studies focus on developing an emotionally intelligent chatbot for an open domain, whereas only a few have focused on the closed domain, using a rule-based approach for generating responses. And only one of the reviewed studies sourced a domain-specific dataset for healthcare [78]. A generative chatbot that synthesizes human-like natural responses requires a massive dataset for training [39]. The unavailability of domain-specific conversational datasets is the main reason for the research gap in this field. A ripe area for exploration for researchers is the development of domain-specific datasets for education, business, and more as they can provide appealing solutions for empathetic customer service chatbots, advising chatbots, and more.

Moreover, the conversational datasets used for open-domain chatbots are not emotionally labeled. The reviewed studies have used extensive preprocessing of the datasets retrieved from Twitter and other datasets to extract conversations and classify them further with labels. However, an issue with this approach is that the dataset is usually imbalanced, and the classification is usually prone to errors. Rashkin et al. [17] addressed this challenge by developing a dataset of emotionally labeled conversations. The dataset consists of 25k conversational utterances. This is another area of research that needs further exploration where researchers may investigate the development of more emotionally labeled datasets to be used as the gold standard in the open domain.

5.3. Encoder-Decoder Model

Several studies use techniques to enhance the previously adopted model for developing emotionally intelligent chatbots, i.e., extending Seq2Seq to overcome its dull and meaningless response limitations. Many studies use a bidirectional classifier that is trained using an emotionally labeled dataset to develop the model [64]. However, the limitation of such models is that conversational models based on neural networks cannot capture the complexities of emotions and produce short and unclear responses. More recent researchers have utilized the CVAE model to alleviate this problem and generate diverse emotional responses. The studies have demonstrated that CVAE can solve this problem and increase the diversity of responses. Additionally, it overcomes the dullness and meaninglessness of Seq2Seq. However, it impacts the syntax of the responses [36]. A further area for exploration by researchers is to enhance the CVAE model to make it more robust to syntax errors.

5.4. Emotion Detection and Embedding

The primary focus of most studies was to accurately detect the input emotion or the user’s emotional state and generate appropriate affective responses. Several studies indicate that emotions are complex and cannot be captured accurately by a classifier [47, 62, 63]. By adopting a lexicon-based learning approach and using VAD vector spaces where each word is embedded with emotion, it is possible to overcome the inability of classifiers to detect fine-grained emotion [39]. The taxonomy diagram (Figure 11) shows that mainly lexicon-based approaches are used by studies that address the challenge of emotion capture. Only four studies have attempted to capture the user’s emotional state from multiple historical utterances. Connecting the meanings and emotions from previous utterances is essential to comprehend the user’s emotional state and foster a continuous conversation. This is still an unexplored area and requires further investigation.

5.5. Voice-Based/Multimodal Chatbots

All of the chatbots included in the review are text-based. Another area for exploration and further research is the development of voice-based and multimodal chatbots that are domain-specific.

5.6. Hybrid Chatbots

Finally, there are no studies investigating generative emotionally intelligent chatbots that are task-oriented. Non-task-oriented chatbots are usually rule-based because they provide precise information but at the same time suffer from machine-like responses. A task-oriented emotionally intelligent chatbot could assist the user in accomplishing a task, such as making a reservation, placing an order, and providing advising information, while embedding empathy in the conversation to eliminate user frustration and provide a good user experience. Moreover, such a chatbot could trigger human intervention if required by determining the user’s emotional state [6, 47, 74].

6. Conclusion

This section includes a summary of the paper and its significance, limitations, and new directions for future research.

Recent technological advances have made chatbots increasingly feasible to deliver information to various domains. Consequently, there are now a growing number of chatbots available for public use. Today, more attention is being paid to the development of emotionally intelligent chatbots. Developing chatbots that can generate emotional responses to user requests is challenging yet crucial to its successful adoption.

In this study, we conducted a systematic literature review exploring a spectrum of topics regarding the development of emotionally intelligent chatbots, exploring the technique of embedding and generating emotional responses, the challenges, the datasets used, and the evaluation processes used to measure the chatbot’s performance. This study was based on available publications from 2011 to 2022 using six digital databases: Scopus, IEEE Xplore, ProQuest, ScienceDirect, ACM Digital Library, and EBSCO. We use a systematic approach to gather and assimilate our findings. This study is aimed at generating evidence-based guidelines for researchers and developers to gain insights into emotionally intelligent chatbot development research. Thus, researchers and practitioners in the related fields will gain a deeper understanding of emotionally intelligent chatbots based on the findings of this study presented in the discussion section.

Our study shows that Chinese is the most commonly used interface language in developing emotionally intelligent chatbots. Weibo and Twitter datasets are the most popular datasets used to develop open-domain AI-powered chatbots. Additionally, most chatbots are developed for the open domain due to the availability of conversational datasets. However, these datasets are not labeled; therefore, a common preprocessing step is to label the dataset using a classifier, lexicon-based, or hybrid approach. Furthermore, we identified that the lexicon-based approach, such as the VAD vector, provides fine-grained emotion detection. Classifiers are also used to detect emotion and generate diverse responses, which is the ultimate objective of the evaluation. Most studies use automatic and human evaluation measures. BLEU and perplexity are the most commonly used metrics in automatic evaluation. Human evaluations are essential to test the quality of the responses. Several studies sourced participants from MTurk or used other human judges to evaluate the response diversity and emotional relevance. Statistical measures such as Fleiss’ kappa are used to determine the validity of human responses.

This study may have had limitations due to several factors. First, there was a limited amount of time to conduct the study. Moreover, although six bibliographic databases were used to retrieve relevant studies, the lack of research due to the relatively new and emerging topic resulted in the possibility of specific unexplored areas that the readers may notice. Furthermore, due to limited resources, some retrieval studies may not be thorough and may compromise the effectiveness of the study.

Data Availability

The search keywords and databases used in the systematic review are provided in the paper. Table 8 list of all the papers included in the systematic literature review. Furthermore, the data encoding of the papers analyzed during the current study is available from the corresponding author upon reasonable request.

Conflicts of Interest

All authors declare that they have no conflicts of interest.