Abstract

In recent years, the growing popularity of smart speakers (e.g., Google Home and Alexa) has facilitated young children’s interaction with internet-based devices and provided them with more opportunities to obtain access to online information. This review summarizes the current state of the research by examining smart speakers’ core characteristics, children’s conceptualization and interaction with smart speakers, and the influences on children’s learning and habits. Our review shows that (a) the natural language processing technology and central computing system (Internet) contribute to the uniqueness of smart speakers; (b) although children tend to attribute human characteristics (e.g., smart and friendly) to smart speakers, they might judge these voice assistant devices as neither explicitly living nor nonliving in ontological perception; (c) children’s overattributing certain knowledge (e.g., questions about personal information) to smart speakers does not necessarily mean that this device is believed to be omniscient; and (d) in terms of promoting children’s learning, smart speakers might not be more effective than a real human, and the interaction with smart speakers may not be conducive to children’s maintenance of civilized social norms. Implications for children’s conceptualization and interaction of smart speakers and the design of children-oriented smart agents are also discussed.

1. Introduction

Following touchscreen devices, smart speakers (e.g., Google Home from Google, Alexa from Amazon, and Tmall Genie from Alibaba) have become popular for children in the past five years. According to the latest survey released by Common Sense Media in 2020, 41% of young children aged 0-8 in the US have a smart speaker in their homes, while the proportion was only 9% in 2017 [1]. However, whenever new mass media or technological devices (e.g., wireless radio, television, smartphones, or tablets) are widely introduced into public life [2, 3], parents and educators always display strong concern: how would these emerging devices affect our children’s development? In recent years, as children have increasingly gained access to smart speakers, this emerging internet-based device has not escaped the attention and doubt of researchers and educators. Though some researchers have focused on this field, existing reviews mainly concentrated on children’s voice search [4] or outlined some potential directions about how smart speakers might influence children’s development [58]. Given that comprehensive understanding of how young children understand and interact with smart speakers is of great importance for parents and educators to help their children interact with this emerging technology rationally, in this review, based on the latest empirical findings, we summarize the current state of research on children and smart speakers and hope to answer several questions of interest to researchers as well as parents: (1) the core features of smart speakers; (2) how children aged 3-12 years conceptualize smart speakers; (3) how children interact with smart speaker devices; and (4) how smart speaker devices impact children’s learning and daily behavior.

To find studies that fit the scope of this review, the literature was identified by searching for “childAND (smart speaker OR voice assistant OR conversational agent OR voice search)” from Web of Science, ACM, and Google Scholar. Studies must address how children ages 3-12 understand and interact with smart speakers. Researches that focused on abnormal children (e.g., children with ASD) or described a new program design about smart speaker were excluded. Once relevant studies were identified, their reference sections were examined for other relevant studies. Overall, 40 research articles, conference presentations, or published proceedings were identified.

2. Core Features of Smart Speakers

According to “the basic model of human behavior with technologies” proposed by Yan (2020) [9], how users interact with technology determines their activities and effects (see Figure 1). Therefore, the human-computer interaction mode allowed by technologies should be the core feature of the device. In terms of interaction, smart speakers widely adopt voice-driven interfaces based on natural language processing technology [6]. Unlike earlier speech recognition systems, which plagued users with their poor flexibility, natural language processing no longer requires users to use specific words or follow certain patterns when giving commands or asking questions. Therefore, smart speakers could provide children who have limited literacy skills and immature fine motor skills unprecedented opportunities to interact with emerging internet-based devices [6, 10, 11].

Second, compared to other entities that also utilize voice interaction (e.g., social bots based on a specific corpus), smart speakers operate in a networked environment, suggesting that this kind of emerging device is equipped with a centralized central processing system (i.e., Internet) and thus makes it possible to process a much richer set of user commands. Furthermore, under the same wireless local area network, users can indirectly control other connected devices (such as smart TV, lighting, and air conditioning) through smart speakers, which, while expanding the potential capabilities of smart speakers [6], may prompt young children to perceive smart speakers differently from other devices [12].

3. How Children Conceptualize Smart Speakers

3.1. Is It Alive? Children’s Perception of Smart Speaker’s Ontological Categories

Progressing from touchscreen to natural speech processing utilized by smart speakers does not only imply a change in the mode of human-computer interaction. Given that speech is the most natural way for people to communicate with others and smart speakers have now employed this form, which was previously exclusive to human beings, children’s understanding of whether smart speakers are alive may become blurred while they communicate with devices [5]. By observing how children aged 3-12 interacted with smart speakers, researchers found that the majority of children believed that smart speakers were smart, friendly, and trustworthy. While interacting, children often asked questions about their identity and personality and even joked with the device as if it was a real person indeed (e.g., “Google, can I eat you?”; [11, 1316]). When asked to draw the smart speaker, elementary school children tended to display anthropomorphic features in their paintings (e.g., eyes, limbs, and facial expressions; [17]). During communications, children showed nonverbal behaviors such as nodding, smiling, shrugging, and frowning, which the smart speaker actually could not actually recognize [18]. Hence, some researchers believed that children anthropomorphized smart speakers [19].

However, the view “young children anthropomorphize smart speakers” implies the default assumption that young children believe smart speakers are nonliving in ontological categories, which might not hold true for artificially intelligent entities. Although previous studies have found that children as young as age 3 could distinguish between living and nonliving kinds and that such discrimination is clear and stable [20], the emergence of robots, as well as other artificial intelligent entities in recent years, has increasingly challenged these existing ontological understandings. Although lacking anthropomorphic or animal-like appearance and unable to move autonomously, smart speakers have been equipped with the most prominent feature which was exclusive to human beings—speech. Kahn et al. [12] proposed that these artificially intelligent devices, such as smart speakers and other entities based on embodied social computational systems, might be perceived as a hybridization of living and nonliving, which do not belong to any existing ontological categories but form a completely new ontological category (NOC). After observing 3- to 6-year-old children’s 40-minute interaction with a Google Home Mini and then examining their ontological reasoning, Xu and Warschauer [21] found that although the majority of children attributed human-like capacities to smart speakers (93% of children thought the Google Home Mini was smart, 86% thought it had memory, and 64% thought it had feelings), only 18% of children claimed it was indeed “human,” while 25% believed “it was neither an artifact nor a living object.” Likewise, Festerling and Siraj [14] suggested that children would interpret smart speakers as the superimposition of genuinely humanoid and nonhumanoid speakers. When smart speakers could not respond immediately or presented incorrect answers, children tended to believe that the device was controlled by a real human; however, when smart speakers displayed instant, standardized responses, and lack of common sense, children perceived the same smart speaker as a machine.

3.2. Omniscient God? How Children Perceive Smart Speaker’s Scope of Knowledge

Connected to the Internet, smart speakers could provide young children with unprecedented opportunities to obtain access to online information. Rücker and Pinkwart [22] suggested that children were likely to view a single internet-based device as equivalent to the Internet and believed that one single device itself was a giant database into which they could save and extract all kinds of information but failed to realize the limitations of the scope of knowledge of the device as well as the Internet. Given that children sometimes asked smart speakers questions that were beyond the Internet knowledge domain (e.g., asking where their mom is[11, 23]), some researchers speculated that children would perceive smart speakers to be omniscient [11, 23].

However, considering the precise definition of “omniscient—knowing everything about everything,” using the evidence “children sometimes asked smart speaker questions beyond its capacity” to infer the conclusion “children would believe internet-based devices such as smart speaker are omniscient” may be overgeneralizing. Inspired by the constructive theories of cognitive development, children’s understanding of novel entities and phenomena strongly relied on their existing concepts [2426]. This suggests that when children try to understand the knowledge state of smart speakers or the Internet, it is possible for them to make inferences based on their existing schema of a real human [2729]. Given that even young children have realized the limitation of people’s knowledge state through daily interaction, it might be difficult for children to confirm that there did exist a truly omniscient agent or entity [30]. Indeed, some empirical studies found that children did not believe that the Internet and internet-based devices would know everything [31, 32]. When children explored the boundaries of smart speakers or the Internet knowledge state by raising questions, the frequency of seeking facts (e.g., technology, history, language, spelling, and translation) increased with age, while the proportion of asking private information and other requirements beyond the scope of online information decreased [11, 16]. Thus, it is reasonable to speculate that children asking smart speakers inappropriate questions might be children’s exploration or use of trial-and-error. Children may conceptualize the knowledge state of smart speakers by firsthand observation and experimentation as well as secondhand testimony, similar to how they understand an individual’s knowledge state [33].

4. How Children Interact with Smart Speakers

Unlike screen media, the natural language processing interface used in smart speakers allows young children to interact with the device in a way that most closely resembles real human interaction. However, ultimately, smart speakers are machines rather than real persons. They could not reply to children’s commands in a personalized way that incorporates young children’s background knowledge and emotional state, which might be attributed to a unique pattern of computer-human interaction.

How do children interact with a smart speaker? There are various sorting methods based on different perspectives and observations. Some researchers concentrated on the goal of children’s use of smart speakers and classified children’s activity into “Exploration,” “Information Seeking,” and “Functional Tasks and Commands” [23]. Other researchers have focused on smart speakers’ functions and divided the interaction into “sending and reading text messages, making phone calls, and sending and receiving emails,” “answering basic questions,” “setting timers, alarms, reminders, schedules, etc.,” “controlling other media and other connected devices,” and “telling jokes or stories” [6]. However, the above classification only focused on the one-way commands of children to smart speakers while weakening the continuous and alternating interaction between the users and devices. To highlight the dynamic interaction process between children and smart speakers, we attempt to summarize the interaction between children and smart speakers from two perspectives: how children command smart speakers and how they react to smart speakers’ responses.

Previous studies have suggested that when children interacted with smart speakers, requesting information was the most common activity [11, 16, 23, 34, 35]. Based on children’s purpose, children’s questions could be categorized as “exploring the identity of the device” and “using smart speaker to seek another information” [23], including asking about personal preferences and characteristics of voice assistants [13, 14], background information [14], location [23], exploring children’s own personal information related to their family, and asking facts about science and technology, culture, language (e.g., word spelling), modus operandi (e.g., recipes and navigation), and other objective knowledge [11]. Apart from questioning, users can also operate the smart speaker for other extended functions and indirectly control other smart devices that are connected under the same local wireless network. For children, although a series of observations have suggested that children could use smart speakers to send messages [23], make phone calls [23], listen to music, and command the device to tell jokes [11, 36], the frequency of children’s use of these extended functions is much lower than that of seeking knowledge or information from the smart speaker device [11, 23, 35]. This may be because children primarily view smart speakers as a source of information rather than an entertainment tool [35]. Another possible explanation might be that most of the relevant studies are based on usage in real family settings, where children in full parental care are likely not to need to command smart speakers themselves.

In terms of children’s responses to the smart speaker, when the smart speaker is able to accurately recognize and respond to children’s speech and instructions, a series of observations found that children had a tendency to actively respond to the smart speaker’s questions and commands, showing active participation similar to that of a real person, as well as displaying basic features of interpersonal dialog such as shared attention and coordinated reciprocity [10, 37]. In addition, when responding to questions from the smart speaker, children not only actively used verbal communication but also often displayed nonverbal expressions such as shaking heads and shrugs [18]. These results suggest that smart speakers can trigger children’s natural interpersonal responses as a real human might do.

However, due to children’s (especially preschoolers’) limited verbal expression skills, when they communicate with electronic devices based on voice interaction, such as smart speakers, approximately half of their commands cannot be accurately recognized by the device, resulting in a breakdown in human-machine communication [11]. At this point, children actively employ a range of conversation repair strategies, including changing the topic of the question and the type of question asked [35], repeating instructions, increasing the volume, highlighting pronunciation and adding pauses between words or syllables, adding background information, replacing words, and reorganizing word order [10, 13, 23, 38]. Moreover, compared to adult users, children showed less frustration when faced with human-machine communication failures, were less likely to attribute communication failures to the device, and showed more attempts to use various repair strategies [10]. In addition, different repair strategies employed by users may reflect their attributions to smart speaker failures. For example, the strategies “repeat command” and “raise volume” which are most commonly used by children may indicate that children tend to attribute the failure of the breakdown to the smart speaker’s misunderstanding of characteristic words and the absence of attention in communication [10].

5. How Smart Speaker Affect Child Development

5.1. Is a Smart Speaker Helpful for Children’s Early Learning?

By observing smart speakers telling stories to children, researchers found that children used both verbal and nonverbal expressions to actively respond to smart speakers’ questions, showing interaction patterns similar to parent-child communication [18]. Hence, some parents and educators expect smart speaker devices to boost children’s learning and development as well as interactions with real people do.

When interacting with children, smart speakers can only use synthesized sounds to answer children’s questions but lack all nonverbal communication cues (e.g., fixation, facial expressions, body posture, gestures, and emotional feedback). They have no physical body and cannot move, let alone respond to the user’s actions. On the one hand, according to Social Presence Theory [39] and Social Agency Theory [40, 41], the social characteristics of media or agents should have an important impact on learners’ communication and learning effects. When smart speakers are involved in children’s learning process, they may not be regarded as humanoid social partners and cannot create a sense of social presence similar to the interaction with a real person. Thus, smart speakers should not facilitate children’s learning as well as real people do. On the other hand, the social features that smart speakers lack but real people have are likely to increase learners’ cognitive load, which occupies students’ attention which is necessary for processing the learning material [42]. Therefore, compared with real people, it is also possible for smart speakers to promote children’s learning by promoting cognitive resource investment in the content itself and the learning process. Xu and her colleagues [43] compared how a smart speaker and a real person impacted 3- to 6-year-olds’ content comprehension and language expression in a storybook reading. And results indicated that a real person companion was more effective than a smart speaker in terms of helping children’s story comprehension. This finding could provide support for Social Presence Theory and Social Agency Theory, which suggest that mechanized smart speakers could not be as effective as a real human in facilitating children’s learning engagement. In addition, when communicating with children, parents and educators often use rhetorical questions and guided interactions, which can serve as a scaffold for children to explore and further process information on their own. However, while in the dialog with the smart speaker, when children ask questions, smart speakers always present answers directly. Although some researchers have attempted to set scaffolding questions in smart speakers to help children understand [37], the number and role of scaffolds are still limited [36]. These results again suggested that smart speakers might not facilitate children’s learning and cognitive processing as well as real people do.

Based on these differences between smart speakers and real people, some researchers attempted to increase the social features of smart speakers to improve their facilitation of children’s learning but failed to achieve the desired results. For example, Yuan et al. [38] used the Wizard of Oz (where a real person simulates the inner workings of a smart speaker thus excludes the effect of confounding misrecognition) to add “personalization (changing language details for different children such as calling a child’s name)” and “anthropomorphization (giving the device a human identity)” to smart speakers and asked children aged 5-12 years to interact with a “question-only smart speaker,” “naming personalization of the device,” and “smart speaker with personalization and anthropomorphization.” The results showed that although children aged 5-12 years preferred smart speakers with personalization, the personalization and anthropomorphization of the speaker had no effect on promoting the effectiveness of children’s questions. Other researchers have also added subjective words related to feelings and emotions to the statements provided by smart speakers; but compared to the devices that only provided objective facts, the effectiveness of communication between 5- to 6-year-olds and smart speakers was not significantly improved [44]. Some researchers suggest that “Social Agency Theory” requires a certain threshold: smart speakers are too different from a real person in physical appearance; thus, adding only verbal human-like features is not enough to lead children to perceive the smart speaker as a “real person” [44, 45].

Overall, although the idea of using devices such as smart speakers to assist young children’s learning is very appealing, given the shortcomings that still exist in the design and flexibility, smart speakers cannot yet fully replace the role of parents and educators [46]. Although smart speakers will not hinder children’s learning, they should not be viewed as a substitute for parental companionship and social connection; instead, parents should teach their children to realize that it is only a tool or a machine [8].

It is also worth noting that though smart speakers may not facilitate children’s deep cognitive processing as well as real people. When researchers examined speech patterns used by young children, they found that children aged 3-6 years expressed clearer and more comprehensible language during conversations with smart speakers than with real people [43]. This result indicated that smart speakers might have potential advantages in developing young children’s verbal communication skills, especially promoting clear articulation and improving speech intelligibility [5, 47]. However, this facilitation is probably a byproduct of the inadequate maturity of natural language processing technology with a low recognition rate of young children’s speech and vocabulary.

5.2. Is a Smart Speaker Harmful for Children’s Social Interaction?

Regardless of how close the speech used by smart speakers is to real human communication, it is still ultimately a mechanical device; thus, the behavioral patterns developed by young children’s interactions with smart speakers may not be applied to real social interaction (e.g., politeness, which is of great value in real social human interaction and is meaningless in communication with smart speakers). Although it has been found that new manners learned by children from one specific smart speaker were easily transferred to another smart speaker, these behavior patterns were not easily transferred to real settings, especially when children interacted with strangers [48]. However, empirical research is still lacking on the more realistic question of whether behavior patterns that are suppressed in real situations but could play a role in interactions with smart speakers will be reinforced in real social situations. For example, studies have found that children often used loud voice and even intimidation tactics when smart speakers failed to recognize their language [23, 49]. Some parents are concerned about whether children would transfer these aggressive verbal expressions into conversations with real people, thus impeding the acquisition and maintenance of good language habits and norms of daily behavior [15, 36]. Secondly, children’s interactions with smart speakers are immediate (children give commands and smart speakers respond immediately), which is contrary to “delay gratification” [5]. Thus, too much interaction with and commands to smart speakers may be detrimental to children’s development of self-control skills. Third, because smart speakers are equipped with many characteristics of a good partner (e.g., patient listening, no ridicule, and keeping secrets), children may perceive the device as a trustworthy partner with whom they can build emotional and attachment relationships [5, 15, 23, 36, 50], which may reduce children’s social need in the real world and thus be detrimental to their social development.

However, as voice-based internet-based devices for public use in the home, smart speakers are more likely to be supervised by parents than other screen-based media (e.g., smartphones and tablets) when used by children [51]. Several studies even suggested that parents made full use of smart speakers to actively cultivate children’s daily habits [47]. Therefore, how smart speakers affect children’s behavioral habits in real contexts and whether children would transfer behavioral patterns from their interactions with smart speakers to real contexts still require direct evidence from empirical studies, especially tracking studies based on real use situations.

6. Implications

The rapid popularity of smart speakers among children has led businesses to think about how to design products to make them more popular [52]. Considered that most current smart speakers are designed as square or cylindrical, and some surveys suggest that the anthropomorphic shape of smart speakers can appeal to children [53] and meet the expectations of some children [15], some smart speakers were designed to anthropomorphic shapes. However, given that an entity’s appearance plays an important role in children’s ontological classification [20], it is possible that smart speakers with anthropomorphic shapes might blur children’s ontological categorization and be more conducive for children to transfer poor communication habits formed in human-computer interaction to real interpersonal interactions. Moreover, smart speakers with anthropomorphic shapes might make it easier for children to form emotional attachments and establish social connections with mechanized devices, thereby hindering children’s interactions in real social contexts. Thus, whether smart speakers with anthropomorphic shapes should be adopted widely remained to be seen. Secondly, studies have found that unlike smartphones or tablets, young children were more likely to view smart speakers as an information source rather than an entertainment tool [35], suggesting that children have a tendency to learn facts from these devices. Thus, it might be important to strengthen the quality control of information conveyed by smart speakers or embed filtering programs to help children shield inappropriate online information.

As for parents, despite that smart speakers allow young children to interact with the device independently through natural speech, considered that young children lack a sophisticated understanding of smart speakers and that social interaction could help children improve their understanding of emerging technologies [54], parents should increase their joint media engagement with their children. In addition, considered that in the process of interacting with smart speakers, young children may have a tendency to form attachments or master-servant relationships with devices [12] or display several behavioral patterns that are not encouraged in the real society (e.g., raise their volume to repair breakdown in human-machine communication), parents should pay attention to children’s interaction with the device and guide them properly. Perhaps in the future, smart speakers may identify children’s voiceprint, record it automatically, and then permit parents to review their children’s interaction with the device, which provides convenience for parents to monitor their children continuously.

7. Future Directions

While parents and educators have begun to pay attention to how children interact with smart speakers, there are many questions that need to be further discussed.

First, most children in existing studies either had no experience with smart speaker use prior to participating in the experiment (e.g., [11, 37]), or this vital context was not clearly reported in the paper (e.g., [10, 44]). This may lead to the possibility that behaviors exhibited by children in interacting with smart speakers are more likely an exploration of smart speakers. That is, the behavior patterns might not be relatively stable patterns based on their current level of cognitive development and experience but rather an exploration process of children facing new devices. Since behavior is a reflection of the current cognitive state, children could update their cognitive schema continually and then further adjust their behavior patterns [35]. Therefore, based on the observations of how children who are unfamiliar with devices interacted with smart speakers, children’s understanding of smart speaker devices may be underestimated.

Second, the gradual entry of smart speakers into the children’s lives may have a lasting impact on their learning and development in a subtle way. The existing findings are mainly based on observations of less than one month, and their conclusions are still limited to the revelation of short-term effects. Future studies can examine the long-term effects of smart speakers on children’s development through delayed tests.

Third, when children judge real people’s knowledge state and decide whether to ask them questions, their choices are influenced by the child’s own level of cognitive development (e.g., theory of mind; [5557]) and informants’ past performance [58]. Future studies may examine whether boundary conditions that influence children’s judgments of the knowledge states of real people could also be applied to children’s understanding of the knowledge and capability states of smart speakers [7].

8. Conclusion

Based on the core features of smart speakers, this review systematically summarizes how children aged 3-12 understand and interact with smart speakers and the impact of smart speaker devices on children’s learning and daily behavior. We conclude that natural language processing and network information processing, on which smart speakers are based, play a crucial role in children’s smart speaker use and understanding. Children tend to assign human characteristics to smart speakers but do not simply classify them as fully alive or inanimate. Although children ask questions beyond the capabilities of the smart speaker, whether they consider it “omniscient” remains to be explored. In the process of guiding children’s reading and speech development, smart speakers are not yet sufficient to facilitate development as well as a real human might. Moreover, children’s interactions with smart speakers may negatively affect their cognitive development and the formation or maintenance of good social norms.

Data Availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Ethical Approval

Ethics approval statement is not applicable to this article as no human participants were involved in the current study.

Conflicts of Interest

The researchers declare that there is no conflict of interest involved in this work.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (#31771236 to FW and #71974072 to WW).