Abstract

Automated vehicles (AVs) are able to detect pedestrians reliably but still have difficulty in predicting pedestrians’ intentions from their implicit body language. This study examined the effects of using explicit hand gestures and receptive external human-machine interfaces (eHMIs) in the interaction between pedestrians and AVs. Twenty-six participants interacted with AVs in a virtual environment while wearing a head-mounted display. The participants’ movements in the virtual environment were visualized using a motion-tracking suit. The first independent variable was the participants’ opportunity to use a hand gesture to increase the probability that the AV would stop for them. The second independent variable was the AV’s response “I SEE YOU,” displayed on an eHMI when the vehicle yielded. Accordingly, one-way communication (gesture or eHMI) and two-way communication (gesture and eHMI combined) were investigated. The results showed that the participants decided to use hand gestures in 70% of the trials. Furthermore, the eHMI improved the predictability of the AV’s behavior compared to no eHMI, as inferred from self-reports and hand-use behavior. A postexperiment questionnaire indicated that two-way communication was the most preferred condition and that the eHMI alone was more preferred than the gesture alone. The results further indicate limitations of hand gestures regarding false-positive detection and confusion if the AV decides not to yield. It is concluded that bidirectional human-robot communication has considerable potential.

1. Introduction

In current traffic, pedestrians and drivers use hand gestures and other bodily signals to inform, acknowledge, draw attention, or clarify situations [13]. These communication modes will no longer be available when automated vehicles (AVs) have taken over the driving task. Keferböck and Riener [4] argued that it is important to substitute today’s pedestrian-vehicle communication with AVs that can detect pedestrians’ gestures and actively communicate via external human-machine interfaces (eHMIs).

So far, a large number of studies have examined the effectiveness of eHMIs that display the AV’s state and intentions (for reviews, see [57]); for example, a VR-based study by De Clercq et al. [8] found that pedestrians feel safer to cross in front of an AV with eHMI (e.g., text or front brake light) compared to without eHMI. A variety of other studies also show that eHMIs provide performance improvement or enhance subjective clarity for pedestrians relative to control conditions without eHMI [913].

Although eHMIs have demonstrated their value in various experimental studies, it can be questioned whether the solution to the interaction between AVs and pedestrians should be sought in eHMIs. eHMIs have several drawbacks. In some concepts, the eHMI covers a large surface area of the AV [14, 15], which would entail technical complexity and high cost. Another point is that, in real traffic, the eHMI may not be noticed or understood [16].

In addition, it may be questioned whether eHMIs are the only way to achieve communication between pedestrians and AVs. Schieben et al. [17] provided a framework for communication between AVs and other road users and showed that various forms of interaction are conceivable, of which eHMIs are one. Other communication strategies include the use of the infrastructure, the design of the vehicle shape, and the AV movements themselves (the latter is also known as implicit communication [1822]). Moreover, the paper by Schieben et al. [17] makes clear that eHMIs do not necessarily have to show the AV’s state and intentions. Informing other road users about the AV’s perception of the environment and the AV’s cooperation capabilities is a fruitful alternative. The current paper proposes a communication strategy that places responsibility on both the pedestrian and the AV. More specifically, we investigate whether pedestrians prefer to make their intention clear using an explicit gesture, and whether the AV should be made more intelligent by recognizing and responding to this gesture.

Current AVs are already capable of detecting pedestrians and other vulnerable road users. However, an ongoing difficulty remains in the prediction of pedestrians’ intentions [23, 24]. Vinkhuyzen and Cefkin [25] noted that a “challenge … is the limitations of the technology in making observational distinctions that socially acceptable driving necessitates.” Because AVs may have difficulty in reading pedestrians’ natural body language that may signal crossing intent, it may be necessary to require pedestrians to use more explicit bodily communication, such as a hand gesture. It can be expected that the camera systems of future AVs will be able to detect a hand gesture.

The use of driver gestures has previously been studied for the control of in-vehicle information systems [26, 27] and maneuver-based AV control [28, 29]. It has been found that hand gesture usage is effective for letting manually driven vehicles stop for pedestrians [30]. However, to the best of our knowledge, no human factor studies have examined pedestrians expressing their intentions towards AVs through hand gestures.

The present study aimed to examine how pedestrians experience the use of hand gestures to increase the probability that the AV will stop for them. In addition to pedestrian-to-AV gesturing, this study examined how a subsequent response from the AV via its eHMI affects the pedestrians’ experience. In real-life applications, our hand gesture concept would require the AV computer vision systems to recognize these gestures. Various studies have already been performed in this area, such as the detection of gestures made by police officers [3133] or cyclists [34]. In the present lab-based study, we used a motion suit [9], which allowed us to measure the pedestrians’ bodily state.

2. Materials and Methods

2.1. Participants

Twenty-six participants (4 females and 22 males) were recruited among students and PhD candidates at the TU Delft. They had a mean age of 26.0 years (SD = 3.7 years). All participants were living in the Netherlands at the time of the study but had nationalities from different parts of the world (i.e., Europe, Asia, North America, South America, and Africa). The participants were offered a compensation of €10 for participating in the study and signed a written informed consent form before starting the experiment. The research was approved by the Human Research Ethics Committee of the TU Delft.

3. Materials

The experiment was conducted on a desktop computer running Windows 10 64-bit platform, from the brand Alienware, with Intel (R) Core (TM) i7-9700K CPU @ 3.60 GHz, NVIDIA GeForce RTX 2080 8 GB Graphics Card, 16 GB of RAM. The virtual environment was developed and run using Unity version 2018.4.6f1. The scripts and environment were adapted from De Clercq et al. [8] and Kooijman et al. [9]. The participants wore an Oculus Rift CV1 to experience the virtual environment. An Xsens Link motion suit was used to record the participant’s body movements, which were mapped onto an avatar in the virtual environment. By means of the Oculus Rift and motion suit, the participants could look and walk around while being able to see their own body from a first-person perspective [9]. The motion suit was connected to a transmitter via which the data were sent to the desktop computer. Data received from the Xsens Link were handled by the software Xsens MVN Analyze Version 2019.0.0 Build 1627.

3.1. Design

The experiment was of a within-subject design, with the following independent variables:(1)Opportunity to use hand gestures (two levels: yes and no).(2)The eHMI message “I SEE YOU” when the AV yielded (two levels: eHMI upon yielding of the AV and no eHMI upon yielding of the AV).(3)The yielding behavior of the approaching AV (two levels: yielding [if the participant makes a gesture] and no yielding).

The first two independent variables formed a total of four conditions that were offered in blocks of ten trials. The third independent variable was varied within these blocks and was contingent on the participant’s use of hand gestures. More specifically, participants experienced ten trials per condition, five of which involved an AV that yielded if the participant used a hand gesture and five of which involved an AV that never yielded. The four blocks of 10 trials were randomized per participant.

The four conditions were as follows:(i)Baseline, in which no hand gesture was allowed to be used by the participant and the vehicle would not respond via the eHMI. Even if the participant did use a hand gesture, the AV would not yield in response to the hand gesture. The AV yielded in a random 5 of the 10 trials; in the other 5 trials, the AV did not yield.(ii)eHMI, which was identical to Baseline but with the addition that the AV displayed “I SEE YOU” on its eHMI when the AV started to decelerate.(iii)Hand, in which a hand gesture was allowed to be used by the participant. By using a hand gesture, the participant could increase the likelihood that the approaching vehicles would stop for them. The hand gesture would only result in yielding if the AV was programmed to yield during that specific trial. If the hand gesture was used during a trial in which the AV was not programmed to yield, the AV would maintain speed without yielding. Thus, if the participant would raise their hand in all 10 trials, the AV would yield in 5 trials. If the participant never raised their hand, the AV would yield in 0 of the 10 trials. The eHMI was off in all trials.(iv)Combination, which was identical to the Hand condition but with the addition that the AV would respond by displaying “I SEE YOU” on its eHMI (see Figure 1) when the AV started to decelerate.

In summary, Baseline was identical to eHMI, and Hand was identical to Combination, except for the AV’s acknowledgment “I SEE YOU” in the eHMI and Combination conditions. In the Hand and Combination conditions, the participant could use a hand gesture to let the AV yield in 5 of 10 trials. In the Baseline and eHMI conditions, the AV always yielded in those 5 trials. Each participant interacted with a virtual AV in a total of 40 trials.

The AV had a constant approach speed of 50 km/h during all trials. In the Baseline and eHMI conditions, yielding AVs started to decelerate 50 m from the pedestrian and came to a standstill about 7 m from the pedestrian (the distance measured towards the center of the AV, and parallel to the road).

In all trials, the target vehicle was preceded by a lead vehicle that always maintained speed. Upon approach, the time gap between the lead vehicle and the target vehicle was 1.3 s. Figure 2 shows the distance between the AVs and the pedestrian, distinguishing between yielding and nonyielding target vehicles.

As noted above, participants were permitted to use a hand gesture in two of the four conditions (i.e., Hand and Combination). Hand gesture use was not obligatory, to test if the participants were willing to adopt this novel communication mode. The distance thresholds for the hand gesture were 30 and 50 meters from the pedestrian’s location. A hand gesture used while the AV was before the 50 m threshold or after the 30 m threshold would not cause the AV to yield. If the hand was raised while the AV was between these distance thresholds, the eHMI would turn on (in the Combination condition) and the AV would immediately start the deceleration (in the Hand and Combination conditions). If the AV yielded, it did so with a deceleration that depended on the distance to the pedestrian crossing and in such a way that the AV would come to a standstill before the pedestrian crossing.

Hand usage (a binary variable: no or yes) was identified for each simulation timestep based on whether one of the positions or angles of either arm would change to above a threshold value. The participants were not informed about these thresholds. The threshold values were predetermined during the development and pilot testing of the experiment. More specifically, hand usage was operationalized based on whether either of the following criteria were met:(1)The angle between the upper arm and the participant’s body (defined as an upright vector) was greater than 45°. Note that an angle of 0° would correspond to the upper arm hanging down towards the ground, an angle of 90° would correspond to the participant elevating his upper arm horizontally, and an angle of 180° would correspond to the participant pointing his upper arm to the ceiling. An angle of 45° was regarded as indicative of the fact that the participant had raised his arm.(2)The angle between the forearm and the direction of the upper arm was greater than 60°. An angle of 0° would correspond to a fully stretched arm. An angle of 60° was regarded as indicative of a bent arm.(3)The position of the hand was higher above the ground than the position of the elbow. When standing or walking, the height above the ground of the hands can be expected to be lower than that of the elbow. If the hand was higher than the elbow, this was regarded as a raised arm.

3.2. Procedure

Before starting the experiment, participants read and signed an informed consent form. The form mentioned that participants would be encountering 40 trials divided into four blocks of 10 trials. It was further mentioned that, during some parts of the experiment, the researcher would inform them that they could raise their hand to communicate to the AV that they want to cross. Participants were informed that each trial consisted of two AVs driving towards the pedestrian crossing. Participants were instructed to let the first AV pass and to make a step forward when they thought was a good time to do so. They could express this crossing intention before or after the target vehicle had passed.

After signing the form, participants completed a digital preexperiment questionnaire consisting of several demographic questions and four Likert-scale questions related to trust in automated vehicles and hand gestures.

During the experiment, the participants stood on the curb in front of a pedestrian crossing. The participants were instructed not to cross the road but only to make one step forward when they felt safe to cross the road. In this way, the participants had the task of making a crossing decision and were not merely observers of the approaching cars. Participants were asked not to express their crossing intention before the first AV (i.e., lead vehicle) had passed.

Before the Hand and Combination conditions, the participant was informed by the researcher as follows: “for the following ten interactions, you are allowed to use a hand gesture if you want to, but you do not have to.” Participants were told that the gesture involved raising their hand to show their intention to cross the road.

After each trial, the participant was asked two questions: “on a scale from 0 to 10, how difficult was it for you to predict the behavior of the car, where 0 is not difficult at all, and 10 is very difficult?” and “on a scale from 0 to 10, how sure were you that the car would see you, where 0 is not sure at all and 10 is completely sure?”.

The participants ended the experiment with a postexperiment questionnaire containing the same trust-related questions used in the preexperiment questionnaire. The postexperiment questionnaire also asked participants to rank the four conditions depicted as follows:(i)Baseline. “No communication” accompanied by a screenshot of the car without eHMI.(ii)eHMI. “Communication via eHMI” accompanied by a screenshot of the car with the eHMI depicting “I SEE YOU.”(iii)Hand. “No communication after hand gesture” accompanied by a screenshot of the car without eHMI.(iv)Combination. “Communication via eHMI after hand gesture,” accompanied by a screenshot of the car with the eHMI depicting “I SEE YOU.”

The participants completed the ranking four times: (1) based on the extent to which they felt safe to cross the road; (2) based on the extent to which they were sure the car had seen them; (3) based on the extent to which they believed their decision was affected by the fact that no eye contact with the driver was possible; (4) based on their general preference when interacting with AVs.

3.3. Analyses

Analyses were performed for the following variables:(i)Hand gesture usage, defined as whether the hand was raised in the 30–50 m distance interval.(ii)Hand-release time, defined as the first moment the hand was released after it had been raised. The hand-release time was expressed from the moment the AV had passed the 50 m distance threshold. In case the hand was never raised, no hand-release time was determined for that trial.(iii)Responses to the posttrial questions (difficulty and sureness, on a scale of 0 to 10).(iv)Responses to the postexperiment questions.

Statistical comparisons were performed by judging nonoverlapping confidence intervals, which were computed using a method for within-subject designs [35]. Furthermore, paired-samples t-tests were used to compare participants’ scores between the experimental conditions. An alpha value of 0.05 was used.

4. Results

Participants performed a total of 1040 trials (26 participants x 40 trials per participant). From those 1040 trials, 16 trials were discarded due to anomalies in the experiment or data recording.

Figure 3 shows the percentage of trials with hand gesture usage per condition. In the two conditions where a hand gesture was allowed (Hand and Combination), the participants used a hand gesture on average in, respectively, 72.1% (SD = 24.6%) and 68.7% (SD = 26.0%) of the trials. Figure 3 also shows that there were false positives for several participants in the Baseline and eHMI conditions.

Figure 4 shows at which moment the participants made hand gestures. In about 50% of the trials, participants had their hand raised already before the hand gestures could be picked up by the AV (i.e., distance between AV and pedestrian >50 m). Participants lowered their hand as the AV started to decelerate (Figure 4, top), especially for the Combination condition. Statistical analysis showed that the mean hand-release time since passing the 50 m mark at 5.80 s was 3.66 s (SD = 1.46 s) for the Hand condition and 2.85 s (SD = 1.54 s) for the Combination condition, a significant effect according to a paired-samples t-test, t(22) = 3.68, (3 participants did not produce hand-release data because they did not raise their hand).

In comparison, the difference in hand use was not seen in the nonyielding trials, in which the eHMI was always off (Figure 4, bottom). More specifically, the mean hand-release time since passing the 50 m mark at 5.80 s was 2.91 s (SD = 1.25 s) for the Hand condition and 2.76 s (SD = 1.09 s) for the Combination condition, a nonsignificant effect, t(24) = 0.64, (1 participant did not produce hand-release data because the hand was not raised).

Figure 5 shows the mean responses of participants’ difficulty in predicting the AV’s behavior and the sureness of being seen by the AV. A distinction is made between trials in which the AV did and did not yield. The eHMI made it easier for participants to predict the AV’s behavior (Figure 5(a)) and assured them of being seen (Figure 5(b)). As can be seen from the nonoverlapping confidence intervals, the effects were generally statistically significant; for example, for yielding AVs, a paired-samples t-test between the Baseline and Hand conditions’ difficulty scores indicated a significant effect: t(25) = 3.94, .

The results further showed that, in the case of nonyielding AVs, the Hand condition made it more difficult for participants to predict the AV’s behavior compared to the Baseline condition, a significant difference according to a paired-samples t-test, t(25) = 2.96, .

The results of the pre- and postexperiment questionnaire regarding trust in AVs are provided in Table 1. It can be seen that, after the experiment, participants exhibited a higher trust in the idea that hand gestures can be used to interact with AVs as compared to that before the experiment. Before the experiment, participants were skeptical towards the idea that self-driving vehicles will respond to hand gestures (M = 5.15 on a scale from 1 to 10).

The postexperiment ranking of the four experimental conditions is provided in Table 2. The Combination condition was ranked highest regarding safety, the sureness of being seen, and general preference. This was followed by the eHMI condition, Hand condition, and Baseline condition. The eHMI condition received higher rankings than the Hand condition. In other words, hand gestures alone (i.e., without receptive eHMI) were not highly rated. The presence of the eHMI, in the eHMI condition as well as in the Combination condition, appeared to alleviate the perceived effect of the lack of eye contact.

5. Discussion

In the present study, we examined the efficacy of one-way and two-way communication in pedestrian-AV interactions in a VR environment. The experiment results showed that pedestrians’ hand gesture use was moderately high, at about 70%. In other words, most participants were willing to use a hand gesture to make the AV stop. Possibly, some participants did not want to take the effort to raise their hand since no benefit could be obtained by doing so. In comparison, in real traffic, pedestrians may gain time or increase their safety if an approaching vehicle stops for them. It is also possible that participants in our experiment were trying out the use and nonuse of hand gestures.

The experiment further showed that the eHMI, which provided the confirmatory message “I SEE YOU,” improved the perceived predictability of the AV’s behavior as compared to no eHMI, as demonstrated by the relatively sharp decline in the percentage of hand gesture usage for the Combination condition in Figure 3. The latter effect can be explained by the fact that the eHMI turned on if and only if the vehicle yielded. The eHMI further caused pedestrians to lower their hands early (see results for the Combination condition vs. the Hand condition). In other words, the eHMI’s affirmative message prevented pedestrians from holding their hand in the air for an unnecessarily long time.

In case the AV did not yield, the hand gesture use made it subjectively more difficult to predict the AV’s behavior compared to the Baseline condition. A possible explanation is that the use of a hand gesture did not guarantee that the AV would yield (i.e., in 50% of the trials, the AV was programmed not to yield, regardless of hand gesture usage). These findings suggest that pedestrians may have difficulty in future traffic if only a portion of AVs is responsive to their hand signals. Such a situation could be likely, as future traffic is likely to consist of AVs of different brands having different computer vision abilities. In addition, it may be the case that approaching vehicles are unable to stop, for example, because the traffic rules forbid this or because of traffic behind them.

Based on an analysis of communication between road users in today’s traffic, Lee et al. [22] concluded that road users rarely use explicit communication such as hand gestures. They also pointed out that “there may be limited requirement for automated vehicles to adopt explicit communication solutions.” The present study was undertaken from a different point of view. In our study, the pedestrians provided an explicit hand gesture, which was detected and used by the simulated AV. The underlying idea was that future AVs will have difficulty in reading pedestrians’ implicit body language and that explicit gestures are therefore needed. Besides, future AVs may have to detect explicit gestures of vulnerable road users for safety reasons (e.g., detection of an extended arm of cyclists [34, 36]) and to comply with traffic rules (e.g., being responsive to signals used by traffic police [3133]). Furthermore, as others [8, 10, 37] have shown as well, the present study demonstrated that an eHMI makes visible the invisible. That is, it may be hard for pedestrians to detect the initiation of deceleration of an approaching vehicle; an eHMI makes such information salient, thereby improving subjective clarity.

Our study showed that false positives are of some concern. It is possible that pedestrians may be confused and raise their hand even for AVs that cannot stop for them, or that the AV may detect a hand gesture, even if the pedestrian did not intend to gesture to that particular AV. In other words, the hand gesture feature we introduced could add complexity to future traffic. More generally, the introduction of solutions to problems (in this case, the lack of communication in traffic) can create new problems (in our case, false positives and occasional confusion), a phenomenon Sheridan [38] referred to as “fixes to fixes” (p. 146).

It would be interesting to examine whether the eHMI should be isomorphic to the participant’s gesture in order to establish a more efficient dialogue between pedestrian and AV. Several researchers have already proposed eHMIs in the form of a gesture; for example, Fridman et al. [39] and Hudson et al. [40] tested an upraised hand, whereas Mahadevan et al. [41] proposed an animated hand above the vehicle. It would be interesting to determine whether gesture-based eHMIs are more effective than the present text-based eHMI.

In the present study, the AV in the Baseline and eHMI conditions yielded in a random 50% of the trials. This appears to be a realistic percentage relative to contemporary manual driving. An observational study by Sucha et al. [3], for example, found that “36% of the drivers failed to yield to pedestrians in situations where they were obliged to.” A study in China by Zhuang and Xu [30] found that 63.5% of drivers did not even change speed when a pedestrian stood on the curb, and only 3.5% of the vehicles yielded. In our study, the AV could either yield or not yield, with no other behaviors possible. Further research could include other types of AV behaviors, such as showing its intention via lateral movement [42], slowing down but not stopping in order to let the pedestrian cross [3], or adapting to the pedestrian’s behavior (e.g., braking if the pedestrian walks up to the curve and not braking if the pedestrian is showing hesitant behavior). Further research could also focus on the long-term consequences of the use of hand gestures. The results of our pre- versus postexperiment questions showed that, at the end of the experiment, the participants had gained trust in the idea of communicating with AVs through hand gestures. In the longer term, hand gestures may lead to misuse, where pedestrians raise their hand and cross the street without waiting for the AV to confirm [43]. Aside from regulations, there would be a need for research into a standardized set of hand gestures for pedestrian-AV interaction, considering cultural differences in gesture use [44, 45]. Standardized gestures would make it easier for AV developers to train their computer vision systems to recognize these gestures. For research into standardization, large groups of users from diverse target groups, including children and older persons, would be needed.

6. Conclusions and Outlook

This lab-based study concludes that pedestrian gestures can be used to let an approaching AV stop. Furthermore, an eHMI on the AV depicting the message “I SEE YOU” makes the encounter clearer for the pedestrian. The present study further demonstrated the value of bidirectional communication: if the AV confirms it has seen the pedestrian, the pedestrian knows that their hand can be released again.

In the past, driving involved keeping the hands on the steering wheel. In modern times, such as with most Level 2 automated driving systems, drivers still have their hands on the steering wheel continuously and may perform shared control [46, 47]. In the future, automated driving systems may still require some human involvement, but this may take the form of intermittent control of maneuvers and prediction-level interventions through gestures or touchscreens [28, 4850]. The present study suggests that gestures on behalf of the pedestrian could also have a role in future traffic.

Data Availability

The data and MATLAB script used to reproduce the figures in this paper are available via https://doi.org/10.4121/14406944.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors thank Dr. Pavlo Bazilinskyy for discussion in the early stages of this research. This research was supported by grant 016.Vidi.178.047 (How should automated vehicles communicate with other road users?), which was financed by the Netherlands Organisation for Scientific Research (NWO).