Abstract

Augmented Reality (AR) and Mixed Reality (MR) technology has potential for supporting mobile applications. However, nonvisual interaction modalities are undervalued and underused in AR/MR applications. Visual displays can be ineffective or inappropriate in some situations such as walking or driving. Meanwhile, nonvisual modalities are becoming increasingly important in mobile user experiences. In this paper, we report two studies investigating nonvisual interaction modalities such as audio and haptic displays with mobile AR/MR applications. In the first study, we investigate a range of design factors for haptic and audio displays, including rhythm, amplitude, and their combination in representing tourism information to users with a mobile phone. The results show a main effect for Interaction modality, with identification rates highest for information represented in a combined Haptic-Audio display. In the second study, we investigate target location tasks in 3D space using spatial audio feedback and a head-mounted display. We evaluate several design factors including audio feedback device, volume, rhythm, and the target’s horizontal and vertical position. Results show that the vertical positions are very difficult to locate, and overall our participants prefer audio cues with loud volume and fast rhythm. Finally, we propose practical audio and haptic display design guidelines for AR/MR applications.

1. Introduction

Although the concepts of Augmented Reality (AR) and Mixed Reality (MR) have existed since the 1960s, only in the past 20 years have technological advances made AR/MR a flourishing area of research [1]. Powered by mobile devices or HMD (head-mounted devices), AR/MR technology can enhance people’s vision through information delivered via various modalities such as vision, audio, and even touch, smell, and taste, thereby enhancing the user's experience of reality and the surrounding environment.

AR/MR technology has been widely used in education, engineering, entertainment, advertising, television broadcasting, and other fields. For example, AR/MR travel applications provide a variety of ways for visitors to explore and experience the world in their travels by navigation through unfamiliar environments by providing various feedback [2] and using 3D models to reconstruct historical buildings and experiences.

At the moment, for most AR/MR applications on the market, the visual display still dominates the interaction between the user and the device. However, usually the screen space is very limited and the visual display can easily become cluttered with confusing information and widgets, furthermore, watching the screen while mobile is not always viable for the user. For example, when walking or cycling it could be inconvenient or dangerous if the user is forced to read information on the screen. When one or both hands are occupied, it may also be difficult to access visual information. In addition, the visibility of a mobile screen can be compromised by sunlight, moving, or illegible text.

Currently, mobile AR/MR applications are highly dependent on visual displays, and the nonvisual interaction channel has been underestimated and underutilized. The reliance on the visual display of mobile AR/MR applications can pose a problem for users because AR/MR services are often used in a wide range of contexts. For example, users cannot always devote their visual attention to the mobile application interface. So in AR/MR applications, using audio and haptic displays and other interactive modes becomes more important.

When traditional visual displays are not the best choice in mobile computing settings, haptic displays can be an important alternative interaction mode. In addition, the natural effect of sound in actions involving mechanical shocks and vibrations suggests the use of auditory displays as a haptic interface enhancement. Spatial audio has been shown to allow users to explore multiple simultaneous sources and increase the level of immersion [3]. Audio can also be used to draw attention to objects that are not currently in the gaze direction.

Despite its potential for enhancing user experience and driving new applications, there is relatively little research and evaluation of nonvisual AR/MR interaction modalities for tourism applications. Therefore, in this paper, we evaluate the effectiveness of using haptic and auditory displays as outputs to enhance the AR/MR experience with a view to their use in tourism applications. We report two evaluations of nonvisual displays using haptic and audio modalities in mobile AR and head-mounted MR.

Mobile AR and MR applications are becoming popular in many areas including tourism [4]. For example, tourists are looking for some new and different experiences, and they like to explore places and attractions and get more knowledge [5] using new technologies. With tourism pattern changes, more and more tourists are looking for applications and tools to facilitate such new experiences [6]. Kansa and Wilde studied characteristics of information and service design by exploring the demands and motivations of visitors [7]. They proposed methods to deliver location-based services based on “low barrier of entry” principles of web architecture. They also found that improving transparency can strengthen service innovation capability.

The attributes of AR make it particularly appropriate for visualization of spatial environments, encouraging AR use for urban exploration in tourism [8]. For example, AR techniques can be used to reconstruct artifacts in real environments. Currently, smartphones have the necessary technical features for implementing AR functions in a small device, such as a powerful processor, a rear camera, GPS, compass, and many other sensors. The AR technology brought by smartphones has enormous potential for the travel industry [9]. AR applications now mainly focus on bringing digital content to the screen, which relies on visual displays. However, since the visual display cannot be applied to all user scenarios, alternative methods become very important when interaction through a visual display is not feasible.

Combinations of audio and haptic feedback have been used in many tour guide projects and applications. For instance, PocketNavigator [10] is an application with a haptic compass for pedestrian navigation, and it uses vibration patterns to guide users find specified places. Giachritsis et al. [11] found a method for developing intuitive navigation patterns by representing basic directions for landmarks and actions. Different amplitudes and rhythms of haptic and vocal feedback were also shown to be effective in helping visitors identify different tourist attractions [2]. Srikulwong and O’Neill [12] investigated the use of wearable haptic displays in pedestrian navigation and in a field evaluation they found users’ navigation accuracy with a tactile-based system was not much different from a visual-based system, while with a tactile-based directional display, the route completion time was much faster.

Haptic and audio feedback have also been tested and used in other applications. McGookin and Brewster [13] investigated people’s perception of auditory displays by encoding different information into “earcons”. Different earcons stand for rides that might appear in a recreational park or a theme park. Three factors of earcons were used to encode the ride’s information: timbre, intensity and register. Each earcon encoded the ride’s type, the ride’s intensity, and the ride’s cost.

We can use many different attributes to design touch and audio displays. For haptic icons, Ternes and MacLean [14] investigated rhythm combining frequency and amplitude and found that the two main characteristics for a user to distinguish haptic rhythms are the length and unevenness. Ryu et al. [10] reported investigations identifying frequency detection thresholds and range of amplitude using mobile devices. They also proposed a psychophysical magnitude function, mapping vibration frequency and amplitude to a perceived intensity, which could be used to predict the perceived intensity of a mobile device vibration. Information transfer related to the tactile-audio signal set was also investigated by Chen et al. [15]. Their results show that audio-assisted signals can effectively disambiguate haptic signals.

In both virtual and mixed reality environments, audio feedback can be beneficial for spatial perception. For instance, by encoding the distance information in an audio display, the localization accuracy of AR medical applications can be significantly improved [16]. Audio feedback can also affect the user’s understanding of visual space, and deliberately misaligned audio and visual information may lead to better perception of space in virtual scenes [17]. In addition, audio design can also serve as the primary interface for the transmission of game information and create engaging gaming experiences in location-based AR games [18]. Auditory interfaces for aviation applications have also been studied by NASA; for example, studies found that using spatialized sound to guide head direction could make a great reduction in visual acquisition time [19].

However, there is very little research on nonvisual display applications to help people identify or locate attractions and AR/MR designers have very few design guidelines that can be used for reference. Thus, in this paper, we investigate nonvisual display designs for AR/MR devices and applications.

3. Study 1: Nonvisual Display Design for Information Representation

By far the most common and convenient mobile device used by most users is a smartphone. Therefore, we are not using specially developed devices but deliberately use a standard consumer smartphone and its standard features.

Rhythm can be a very effective aid in nonvisual displays [18], and amplitude has also been used to present haptic and audio display information [19]. Smartphones can well support variation in the rhythm and amplitude of haptic and audio feedback; thus, we used rhythm and amplitude in nonvisual feedback designs to present information and investigated the effects of different interaction designs.

In this study, 3 different rhythms are used to express 3 different historical themes that are of particular interest: water (Figure 1(a)), architecture (Figure 1(b)), and human (Figure 1(c)). The combinations of intervals and pulses distinguish the rhythms. In Figure 1, these rhythms are presented on a single line with standard musical symbols. In addition to the three historical themes, we also represent 3 historical periods of attractions. 3 different amplitudes were designed to represent periods of ancient (low amplitude), medieval (middle), and Georgian (high). The experimental mobile application that we developed is running on the Android platform on the smartphone (Figure 2).

3.1. Experimental Evaluation
3.1.1. Experiment Settings

We used an LG Optimus p970 smartphone, which provided haptic and audio display from its vibration actuator and speakers, respectively. This Android smartphone supports control of vibration amplitude using Immersion TouchSense® Haptic Feedback Technology (https://www.immersion.com). It has a 4-inch screen display and weighs 109 g.

3.1.2. Independent Variables

The independent variables were Interaction Modality (Haptic, Audio, Haptic-Audio), Rhythm (3 rhythms as shown in Figure 1), and amplitude (low-vibrating at 33% intensity, 0.3 for Audio; Mid-vibrating at 66% intensity for Haptic, 0.6 for Audio; High- vibrating at 100% intensity, 1 for Audio).

3.1.3. Participants

Thirty volunteers were recruited. All of the participants were graduate students from 21 to 29 years old (19 males and 11 females). All participants had prior experience with smartphones.

3.2. Experimental Design

A repeated-measures mixed design was adopted in the experiment. The 30 participants were allocated into 3 subgroups, one for each Interaction Modality. Within each subgroup, each participant had to identify the combinations of 3 different Rhythm types and 3 Amplitude levels for that Interaction modality.

Before the experiment, a brief introduction was given, followed by a 10-minute exploration by the participants. In the testing phase, the participant held the smartphone in her hands and was asked to identify all 9 combinations of 3 Rhythm types and 3 Amplitude levels (Figure 3).

The participant was required to write down which historical theme and period were represented by the nonvisual displays. The presentation order was randomized, and each combination was presented for 5 seconds. The participant also had to complete a questionnaire about the 3 modality types.

3.3. Results

We recorded correct responses and calculated the percentages of correct identifications. After the trials we also collected subjective feedback from the participants.

3.3.1. Identification Rate

The overall average correct response rate was 76.6%. The highest rate was 86.7% for haptic and audio combined. The identification rate for haptic alone was 70% and for audio was 73.3% (Figure 4).

We performed a repeated-measures ANOVA for Modality × Amplitude × Rhythm. A significant effect was found () for Interaction modality (, ). Neither Amplitude nor Rhythm had an effect (), and no significant interaction effect was found (). Amplitude and Rhythm’s overall identification rates were 76.7% and 76.6%, respectively.

3.3.2. Participant Feedback

Participants were asked whether it was difficult for them to tell the difference between rhythm and amplitude. In all forms of interaction, more than half of the participants found it difficult to distinguish between different amplitudes (Figure 4). On the other hand, in all forms of interaction, few people found it difficult to distinguish between different rhythms or between different combinations of rhythm and amplitude. (Figure 5).

3.4. Discussion

The effectiveness of the haptic and audio displays to represent tourist information was evaluated in Study 1. A main effect of Interaction modality was found. The average identification rate was over 75%. Although the haptic and audio displays working together achieved the best performance, they can also be effective when used alone. As audio feedback often cannot be used effectively in travel and leisure scenarios, application designers could consider using more haptic displays, for example, in some quiet places (such as a museum or classical concert) and noisy places (such as busy streets or rock concerts).

Our results are in agreement with the results in [12] where participant feedback indicated that it was more difficult to tell different amplitude levels than different rhythms. This suggests that rhythm should be used more by designers, and the amplitude levels could be reduced if possible as suggested in [12].

4. Study 2: Evaluation on Spatial Audio for Mixed Reality (MR) Application

Besides mobile device, e.g., smartphone, based AR, head-mounted MR devices, e.g., Microsoft Hololens, also have great potential for tourism applications. With MR head-mounted displays, virtual 3D graphics can appear where the user’s gaze is directed. Audio information, on the other hand, can come from all directions. Thus, audio displays could play an essential part in MR tourism applications to enhance the immersive experience and guide users’ attention to certain points of interest in the real environment.

Previous studies have shown that spatial audio can enhance visual target acquisition in 3D environments [19]. We were motivated to investigate further to understand how various design factors (e.g., audio feedback devices, volume, rhythm, and target locations) can affect the target search and location task.

4.1. Experimental Design
4.1.1. Audio Display Device

Given the development of audio devices in MR applications, there are now different designs, such as earbuds which can deliver closed audio feedback with noise isolation as well as open speakers on MR HMDs such as Microsoft Hololens which can mix real and virtual audio without putting earphones in the ear. Thus, it is desirable to achieve better understanding of different forms of audio display designs and their effects on audio guided object location tasks with MR devices. In this study, both in-ear headsets and open speakers on Hololens were used. Figure 7 shows a participant performing a task using Microsoft Hololens with open speakers on the Hololens.

4.1.2. Audio Rhythm and Volume

As mentioned in Study 1, Rhythm and Volume are considered to be very effective in audio displays with recognition tasks [5, 20], but their effectiveness has not been evaluated in MR applications. Thus in Study 2 we investigated different Rhythm and Volume. We used minim for slow rhythm and crotchet for fast rhythm, played on a grand piano, as illustrated in Figure 6.

4.1.3. Horizontal and Vertical Position

To cover a range of positions around the user, we used 9 different target positions. The target appeared on the user’s left side, back, and right side horizontally and for each Horizontal Position there were 3 vertical positions: top (30 degrees above the user’s head height), middle (the same height as the user’s head), and bottom (30 degrees below the user’s head).

4.1.4. Independent Variables

There were 5 independent variables:(i)Audio Display Devices (2 levels): In-ear headphone, Hololens Speaker(ii)Volume (2 levels): Low (50% full volume), High (100% of full volume)(iii)Rhythm (2 levels): Slow (minim), Fast (crotchet)(iv)Horizontal Position (3 levels): Left, Back and Right(v)Vertical Position (3 levels): Up, Middle and Bottom.

4.1.5. Experiment Settings

We used a Microsoft Hololens to provide the 3D visual and spatial audio displays. The audio playback was provided by the built-in stereo speakers of the Hololens or by Sony MH755 in-ear headphones. The experimental application was implemented using Unity3D engine 5.5 and Microsoft Visual Studio 2015 update 3 on Windows 10. The spatial audio on Hololens was rendered by a Microsoft head-related transfer function (HRTF) spatializer plugin in Unity. The virtual 3D target was a white sphere with the diameter of 20 cm, 3 meters away from the user’s head position. The height of the middle position was the same as the user’s head height. During the experiment a cursor (white dot) indicated the center of the user’s gaze direction.

4.1.6. Participants and Procedure

We recruited 16 volunteers (12 males and 4 females) ranging from 21 to 35 years. Their mean age was 27.13 ().

Each participant was given a brief introduction to the experiment followed by a 10-minute self-guided exploration trying all possible combinations of the audio feedback and target positions. Then in the test phase, participants were asked to search for and select the target as quickly as possible, remaining in the same location while searching for the target. The target was displayed as a 20 cm diameter white sphere and a cursor was rendered as a white dot in the user’s gaze center, following the user’s head motion. When the cursor aligned with the target, the target disappeared and then reappeared in its next position. The order of the Audio device, Volume, and Rhythm was counterbalanced; the order of horizontal and vertical was randomized.

After the trials, the participant answered a questionnaire about her preferences for different Device, Volume, and Rhythm (from 0: strongly dislike to 10: strongly like), as well as the difficulty of searching for the target in different positions (from 0: very difficult to 10: very easy). Participants were also asked to provide their comments about different audio displays in MR applications. The whole experiment took about 30 minutes.

A repeated-measures within-participants design was used. There were 2 sessions for different audio devices. In each session, there were 3 trials for each combination of Volume × Rhythm × Horizontal Position × Vertical Position (thus 108 trials in total for each session). Sessions with the same Volume and Rhythm were contiguous, and the order of Depth was the same for each feedback, giving 8 orders of the Device, Volume, and Rhythm combinations counterbalanced across the 16 participants. For each participant, 216 test trials were performed in total.

4.2. Results
4.2.1. Movement Time

We recorded the movement time for the user to locate the target. A repeated-measures analysis of variance (ANOVA) for Device × Horizontal Position × Vertical Position × Volume × Rhythm was used to analyze the movement time.

Main effects were found for Horizontal Position (, ), Vertical Position (, ), and Rhythm (, ). Neither Device nor Volume had a significant effect (). Interaction effects were found for Device × Volume (, ), Device × Rhythm (, ), and Vertical Position × Volume (, ). The mean movement time is shown in Figures 8, 9, and 10.

Post hoc Bonferroni pairwise comparisons showed that the movement time for targets in the Back was significantly longer than Left and Right (), with no significant difference between Left and Right (). Post hoc Bonferroni pairwise comparisons also showed that the movement time for targets in the Middle was significantly faster than for Top and Bottom (), with no difference between Top and Bottom (). The movement time with slow rhythm was significantly longer than with Fast ().

4.2.2. Movement Angle

We also recorded the total head movement angle for the user to locate the target. Again, a repeated-measures analysis of variance (ANOVA) for Device × Horizontal Position × Vertical Position × Volume × Rhythm was used to analyze the movement angle.

Main effects were found for Horizontal Position (, ) and Vertical Position (, ). No main effect was found for Device, Volume, nor Rhythm (). Interaction effects were found for Device × Volume (, ) and Device × Rhythm (, ). The mean head movement angle is shown Figures 11 and 12.

Post hoc Bonferroni pairwise comparisons showed that the movement angle for targets in the Back was significantly larger than Left and Right (), with no significant difference between Left and Right (). Post hoc Bonferroni pairwise comparisons also showed that the movement angle for target in the Middle was significantly smaller than Top and Bottom (), with no difference between Top and Bottom ().

4.2.3. User Preference

We also collected user preference data using questionnaires. Overall, users preferred using Hololens speakers with high volume and fast rhythm (Figure 13(a)). They also expressed that it was easy to locate the Horizontal Position of the target, while the vertical position of the target was very difficult to find. Thus the bottom and top positions were disliked by our participants (Figure 13(b)).

4.3. Discussion
4.3.1. Movement Time

The fast rhythm could help users to locate the 3D virtual target faster than the slow rhythm (Figure 9(a)), while the volume and audio display devices had no significant effect on the users’ performance (Figure 9(b)). This might be because the fast rhythm could provide more frequent feedback of the target location to guide the users. In addition, the fast rhythm might also encourage the users to finish the task quickly.

Although different audio feedback devices and volume showed no significant main effect, we found interaction effects for Device × Volume (Figure 9(c)) and Device × Rhythm (Figure 9(d)). With the slow rhythm and low volume, users located the target more slowly with the Hololens speaker than with in-ear earphones. The interaction effect suggests that stronger spatial audio cues may be required if the Hololens type speaker is used to provide the spatial audio feedback (i.e., loud volume or fast rhythm). Some participants commented that the audio from the in-ear earphones was clearer than the Hololens speaker and could help to insulate the noise from the physical environment, while the audio from speakers was more “natural” and could help users to blend the virtual and real information.

Another interesting finding was the interaction effect for Vertical Position × Volume (Figure 10), indicating that with low volume users would locate the target more slowly when it was at the top position and faster when it was at the bottom. This may be because the users tended to associate low volume with the bottom, i.e., spatially low, audio source.

Not surprisingly, different target positions had a strong effect on users’ head movement time. For example, the target in the back required the longest movement time, and the target in the middle required the shortest head movement time (Figure 8). The users’ comments strongly suggested that it was much easier to distinguish the horizontal positions than the vertical positions. The differences in the audio feedback from top, middle, or bottom position were small; thus they had to look up and down to locate the target. This confirmed Microsoft’s concern that for spatial sound the elevation accuracy may be less accurate than azimuth accuracy (https://developer.microsoft.com/en-us/windows/holographic/spatial_sound). Our results strongly suggested that the spatial audio alone, either from the Hololens speaker or from the in-ear earphones, was not sufficient for locating vertical target position; thus more cues from other modalities (e.g., visual) should be provided.

4.3.2. Movement Angle

For the head movement angle during the target locating process, only target positions (i.e., horizontal and vertical position) had a main effect (Figure 11). The spatial audio characteristics (i.e., Device, Volume, or Rhythm), on the other hand, had no main effect (Figures 12(a) and 12(b)). Similar to movement time, we also found interaction effects for Device × Volume (Figure 12(c)) and Device × Rhythm (Figure 12(d)), suggesting more head movement was required to locate the target with the slow rhythm and low volume using the Hololens speaker.

5. Conclusion and Future Work

In this paper, we investigated the design of audio and haptic displays for AR/MR tourism applications. We looked at several design factors including rhythm, amplitude, audio feedback devices, as well as target positions in MR applications. Our studies suggest a set of interesting findings for nonvisual displays in AR/MR tourism applications.

The haptic and audio displays together could achieve the best performance for nonvisual display in the context of a mobile tourism application. Users preferred the Hololens speaker with loud volume and fast rhythm to locate horizontal targets in head-mounted MR applications.

Based on our findings, we have synthesized some practical design guidelines for nonvisual displays with AR/MR applications.(1)For mobile phones, designers should leverage rhythm more than amplitude for haptic and audio displays and consider reducing the levels of amplitude.(2)For head-mounted MR scenarios, designers should design spatial audio for targets distributed horizontally and provide additional cues for locating targets distributed vertically.(3)For head-mounted MR devices, open speakers are good for immersive MR experiences, and sufficient volume could improve target locating performance.(4)For head-mounted MR applications, audio feedback with fast rhythm could be used to enhance user performance if target locating speed is essential; otherwise slow rhythm could deliver a more comfortable user experience.

Nonvisual displays could be very useful for tourism AR/MR applications and could encourage a more exploratory and playful tourism experience. Future work will include investigation of other design factors and field studies in the real tourism scenarios.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

Gang Ren’s research is partly supported by Natural Science Fund Project of Fujian Province (no. 2017J01784), Social Sciences Planning Projects of Fujian Province (no. FJ2016C095), and Xiamen Overseas Scholar Project (no. XRS2016 314-10). Eamonn O’Neill’s research is partly supported by CAMERA, the RCUK Centre for the Analysis of Motion, Entertainment Research and Applications (EP/M023281/1).