Abstract

Electronic travel aids (ETAs) have been in focus since technology allowed designing relatively small, light, and mobile devices for assisting the visually impaired. Since visually impaired persons rely on spatial audio cues as their primary sense of orientation, providing an accurate virtual auditory representation of the environment is essential. This paper gives an overview of the current state of spatial audio technologies that can be incorporated in ETAs, with a focus on user requirements. Most currently available ETAs either fail to address user requirements or underestimate the potential of spatial sound itself, which may explain, among other reasons, why no single ETA has gained a widespread acceptance in the blind community. We believe there is ample space for applying the technologies presented in this paper, with the aim of progressively bridging the gap between accessibility and accuracy of spatial audio in ETAs.

1. Introduction

Spatial audio rendering techniques have various application areas ranging from personal entertainment, through teleconferencing systems, to real-time aviation environments [1]. They are also used in health care, for instance, in motor rehabilitation systems [2], electronic travel aids (ETAs, i.e., devices which aid in independent mobility through obstacle detection or help in orientation and navigation) [3], and other assistive technologies for visually impaired persons [4].

In the case of ETAs, the hardware has to be portable, lightweight, and user-friendly, allow for real-time operation, and be able to support long-term operation. All these issues put designers and developers to a challenge where state-of-the-art technology literally comes at hand in the form of high-tech mobile devices, smartphones, and so on. Furthermore, if ETAs are designed for the visually impaired (The term Electronic Travel Aid was born and is almost exclusively used to describe systems developed to help visually impaired persons with navigating their surroundings safely and efficiently. Nevertheless, visually impaired persons are not strictly the only group who might benefit from ETAs: for instance, nonvisual interaction focused towards navigation is of interest to firefighters operating in smoke-filled buildings [5].), even more aspects have to be considered. Beyond the aforementioned, the devices should have a special user interface as well as alternative input and output solutions, where feedback in the form of sound can enhance the functionality of the device. Most of the developments of ETAs for the visually impaired aim at safety during navigation, such as avoiding obstacles, recognizing objects, and extending the auditory information by spatial cues [6, 7]. Since visually impaired persons rely on spatial audio cues as their primary sense of orientation [8], providing them with an accurate virtual auditory representation of the environment is essential.

ETAs evolved considerably over the past years, and a variety of virtual auditory displays [9] were proposed, using different spatial sound techniques and sonification approaches, as well as basic auditory icons, earcons, and speech [10]. Available ETAs for the visually impaired provide various information that ranges from simple obstacle detection with a single range-finding sensor, to more advanced feedback employing data generated from visual representations of the scenes, acquired through camera technologies. The auditory outputs of such systems range from simple binary alerts indicating the presence of an obstacle in the range of a sensor, to complex spatial sound patterns aiming at sensory substitution and carrying almost as much information as a graphical image [7, 11].

A division can also be made between local mobility aids (environmental imagers or obstacle detectors, with visual or ranging sensors) that present only the nearest surroundings to the blind traveler and navigation aids (usually GPS- or beacon-based) that provide information on path waypoints [12] or geographical points of interest [13]. While the latter group focuses on directions towards the next waypoint, meaning that a limited spatial sound rendering could be used (e.g., just presenting sounds in the horizontal plane) [14], the former group primarily provides information on obstacles (or the lack of them) and near scene layouts (e.g., walls and shorelines), supporting an accurate spatial representation of the scene [6].

Nevertheless, most of these systems are still in their infancy and at a prototype stage. Moreover, no single electronic assistive device has gained a widespread acceptance in the blind community, for different reasons: limited functionalities, ergonomics, small scientific/technological value, limited end-user involvement, high cost, and potential lack of commercial/corporate interest in pushing high-quality electronic travel aids [3].

While many excellent recent reviews on ETA solutions are available (see, e.g., [3, 4, 6, 7]), to our knowledge none of these works critically discusses or analyzes in depth the important aspect of spatial audio delivery. This paper gives an overview about existing solutions for delivering spatial sound, focusing on wearable technologies suitable for use in electronic travel aids for the visually impaired. The analysis reported in this paper indicates a significant potential to achieve accurate spatial sound rendering through state-of-the-art audio playback devices suitable for visually impaired persons and advances in customization of virtual auditory displays. This review was carried out within the European Horizon 2020 project named Sound of Vision (http://www.soundofvision.net). Sound of Vision focuses on creating an ETA for the blind that translates 3D environment models, acquired in real-time, into their corresponding real-time auditory and haptic representations [15].

The remainder of the paper is organized as follows. Section 2 reviews the basics of 3D sound localization, with a final focus on blind localization. Section 3 introduces the available state-of-the-art software solutions for customized binaural sound rendering, while Section 4 presents the available state-of-the-art hardware solutions suitable for the visually impaired. Finally, in Section 5 we discuss current uses and future perspectives of spatial audio in ETAs.

2. Basics of 3D Sound Localization

Localizing a sound source means determining the location of the sound’s point of origin in the three-dimensional sound space [16]. Location is defined according to a head-related coordinate system, for instance, the interaural polar system. In the interaural polar coordinate system the origin coincides with the interaural midpoint and the elevation angle goes from to with negative values below the horizontal plane and positive values above, while the azimuth angle ranges from at the left ear to at the right ear. The third dimension, distance , is the Euclidean distance between the sound source and the origin. In the following we will refer to the three planes that divide the head into halves as the horizontal plane (upper/lower halves), the median plane (left/right halves), and the frontal plane (front/back halves).

Spatial cues for sound localization can be categorized according to polar coordinates. As a matter of fact, each coordinate is thought to have one or more dominant cues in a certain frequency range associated with a specific body component, in particular the following:(i)Azimuth and distance cues at all frequencies are associated with the head.(ii)Elevation cues at high frequencies are associated with the pinnae.(iii)Elevation cues at low frequencies are associated with torso and shoulders.

Based on well-known concepts and results, the most relevant cues for sound localization are now discussed [17].

2.1. Azimuth Cues

At the beginning of the twentieth century, Lord Rayleigh studied the means through which a listener is able to discriminate at a first level the horizontal direction of an incoming sound wave. Following his Duplex Theory of Localization [18], azimuth cues can be reduced to two basic quantities thanks to the active role of the head in the differentiation of incoming sound waves, that is, the following:(i)Interaural Time Difference (ITD), defined as the temporal delay between sound waves at the two ears(ii)Interaural Level Difference (ILD), defined as the ratio between the instantaneous amplitudes of the same two sounds.

ITD is known to be frequency-independent below Hz and above kHz, with an approximate ratio of low-frequency ITD by high-frequency ITD of 3/2, and slightly variable at middle range frequencies [19]. Conversely, frequency-dependent shadowing and diffraction effects introduced by the human head cause ILD to greatly depend on frequency.

Consider a low-frequency sinusoidal signal (up to kHz approximately). Since its wavelength is greater than the head dimensions, ITD is no more than a phase lag between the signals arriving at the ears and therefore a reliable cue for horizontal perception in the low-frequency range [16]. Conversely, the considerable shielding effect of the human head on high-frequency waves (above kHz) makes ILD the most relevant cue in such spectral range.

Still, the information provided by ITD and ILD can be ambiguous. If one assumes a spherical geometry of the human head, a sound source located in front of the listener at azimuth and a second one located at the rear, at azimuth , provide in theory identical ITD and ILD values. In practice, ITD and ILD will not be identical at these two azimuth angles because the human head is clearly not spherical, and all subjects exhibit slight asymmetries with respect to the median plane. Nonetheless their values will be very similar, and front-back confusion is in fact often observed experimentally [20]: listeners erroneously locate sources at the rear instead of the front (or less frequently, vice versa).

2.2. Elevation Cues

Directional hearing in the median vertical plane is known to have lower resolution compared with that in the horizontal plane [21]. For the record, the smallest change of position of a sound source producing a just-noticeable change of position of the auditory event (known as “localization blur”) along the median plane was found to be never less than , reaching a much larger threshold (≈17°) for unfamiliar speech sounds, as opposed to a localization blur of approximately in the frontal part of the horizontal plane for a vast class of sounds [16]. Such a poor resolution is due to(i)the need of high-frequency content (above 4-5 kHz) for accurate vertical localization [22, 23];(ii)mild interaural differences between the signals arriving at the left and right ear for sources in the median plane.

If a source is located outside the horizontal plane, ITD- and ILD-based localization becomes problematic. As a matter of fact, sound sources located at all possible points of a conic surface pointing towards the ear of a spherical head produce the same ITD and ILD values. These surfaces, which generalize the aforementioned concept of front-back confusion for elevation angles, are known as cones of confusion and represent a potential difficulty for accurate perception of sound direction.

Nonetheless, it is undisputed that vertical localization ability is brought by the presence of the pinnae [24]. Even though localization in any plane involves pinna cavities of both ears [25], determination of the perceived vertical angle of a sound source in the median plane is essentially a monaural process [26]. The external ear plays an important role by introducing peaks and notches in the high-frequency spectrum of the incoming sound, whose center frequency, amplitude, and bandwidth greatly depend on the elevation angle of the sound source [27, 28], to a remarkably minor extent on azimuth [29], and are almost independent of distance between source and listener beyond a few centimeters from the ear [30, 31]. Such spectral effects are physically due to reflections on pinna edges as well as resonances and diffraction inside pinna cavities [26, 29, 32].

In general, both pinna peaks and notches are thought to play an important function in vertical localization of a sound source [33, 34]. Contrary to notches, peaks alone are not sufficient vertical localization cues [35]; however, the addition of spectral peaks supports the improvement of localization performance at upper directions with respect to notches alone [36]. It is also generally considered that a sound source has to contain substantial energy in the high-frequency range for accurate judgement of elevation, because wavelengths significantly longer than the size of the pinna are not affected. Since wavelength and frequency are related as (Here is the speed of sound, typically = 343.2 m/s in dry air at 20°C.), we could roughly state that pinnae have relatively little effect below = 3 kHz, corresponding to an acoustic wavelength of ≈ 11 cm.

While the role of the pinna in vertical localization has been extensively studied, the role of torso and shoulders is less understood. Their effects are relatively weak if compared to those due to the head and pinnae, and experiments to establish the perceptual importance of the relative cues have produced mixed results in general [23, 37, 38]. Shoulders disturb incident sound waves at frequencies lower than those affected by the pinna by providing a major additional reflection, whose delay is proportional to the distance from the ear to the shoulder when the sound source is directly above the listener. Complementarily, the torso introduces a shadowing effect for sound waves coming from below. Torso and shoulders are also commonly seen to perturb low-frequency ITD, even though it is questionable whether they may help in resolving localization ambiguities on a cone of confusion [39].

However, as Algazi et al. remarked [38], when a signal is low-passed below 3 kHz, elevation judgement is very poor in the median plane if compared to a broadband source but proportionally improves as the source is progressively moved away from the median plane, where performance is more accurate in the back than in the front. This result suggests the existence of low-frequency cues for elevation that although being overall weak is significant away from the median plane.

2.3. Distance and Dynamic Cues

Distance estimation of a sound source (see [40] for a comprehensive review on the topic) is even more troublesome than elevation perception. At a first level, when no other cue is available, sound intensity is the first variable that is taken into account: the weaker the intensity is, the farther the source should be perceived. Under anechoic conditions, sound intensity reduction with increasing distance can be predicted through the inverse square law: intensity of an omnidirectional sound source will decay by approximately 6 dB for each doubling distance [41]. Still, a distant blast and a whisper at few centimeters from the ear could produce the same sound pressure level at the eardrum. Having a certain familiarity with the involved sound is thus a second fundamental requirement [42].

However, the apparent distance of a sound source is systematically underestimated in an anechoic environment [43]. On the other hand, if the environment is reverberant, additional information can be given by the direct to reflected energy ratio, or DRR, which functions as a stronger cue for distance than intensity: a sensation of changing distance occurs if the overall intensity is constant but the DRR is altered [41]. Furthermore, distance-dependent spectral effects also have a role in everyday environments: higher frequencies are increasingly attenuated with distance due to air absorption effects.

Literature on source direction perception is generally based on a fundamental assumption; that is, the sound source is sufficiently far from the listener. In particular, previously discussed azimuth and elevation cues are distance-independent when the source is in the so-called far-field (approximately more than 1.5 m from the center of the head) where sound waves reaching the listener can be assumed to be planar. On the other hand, when the source is in the near field some of the previously discussed cues exhibit a clear dependence on distance. By gradually approaching the sound source to the listener’s head in the near field, it was observed that low-frequency gain is emphasized; ITD slightly increases; and ILD dramatically increases across the whole spectrum for lateral sources [20, 30, 44]. The following conclusions were drawn:(i)Elevation-dependent features are not correlated to distance-dependent features.(ii)ITD is roughly independent of distance even when the source is close.(iii)Low-frequency ILDs are the dominant auditory distance cues in the near field.

It should be then clear that ILD-related information needs to be considered in the near field, where dependence on distance cannot be approximated by a simple inverse square law.

Finally, it has to be remarked that, switching from a static to a dynamic environment where the source and/or the listener move with respect to each other, both source direction and distance perception improve. The tendency to point towards the sound source in order to minimize interaural differences, even without visual aid, is commonly seen and aids in disambiguating front/back confusion [45]. Active motion helps especially in azimuth estimation and to a lesser extent in elevation estimation [46]. Furthermore, thanks to the motion parallax effect, slight translations of the listener’s head on the horizontal plane can help discriminate source distance [47, 48]: if the source is near, its angular direction will drastically change after the translation (reflecting itself onto interaural differences), while for a distant source this will not happen.

2.4. Sound Source Externalization

Real sound sources are typically externalized, that is, perceived to be located outside our own head. However, when virtual 3D sound sources are presented through headphones (see next section), in-the-head localization may typically occur and have a major impact on localization ability. Alternatively, listeners may perceive the direction of the sound source and be able to make accurate localization judgements yet accompanied with perception of the source being way closer to the head than otherwise intended (e.g., on the surface of the skull [49]). However, when relevant constraints are taken into account, such as the use of individually measured head-related transfer functions as explained in Section 3, virtual sound sources can be externalized almost as efficiently as real sound sources [50, 51]. Externalization is, along with other attributes such as coloration, immersion, and realism, one of the key perceptual attributes that go beyond the basic issue of localization recently proposed for the evaluation of virtually rendered sound sources [52].

In-the-head localization is mainly introduced by the loss of accuracy in interaural level differences and spectral profiles in virtually rendered sound sources [49]. Another extremely important factor is given by the interaural and spectral changes triggered by natural head movements in real-life situations: correctly tracked head movements can indeed substantially enhance externalization in virtual sonic environments, especially for sources close to the median plane (hardest to externalize statically in anechoic conditions, due to minimal interaural differences [53]), and even relatively small movements of a few degrees can efficiently reduce in-the-head localization [54]. Furthermore, it has been recently showed that externalization can persist once coherent head movement with the virtual auditory space is stopped [55].

Finally, factors related to sound reverberation contribute to a strong sense of externalization, as opposed to dry anechoic sound. The introduction of artificial reverberation [56] through image-source model-based early reflections, wall and air absorption, and late reverberation can significantly contribute to sound image externalization in headphone-based 3D audio systems [57], as well as congruence between the real listening room and the virtually recreated reverberating environment [58].

2.5. Auditory Localization by the Visually Impaired

A number of previous studies showed that sound source localization by visually impaired persons can be different from that of sighted persons. It has to be first highlighted that previous investigations on visually impaired subjects indicated neither better auditory sensitivity [5961] nor lower auditory hearing thresholds [62] compared to normally sighted subjects. On the other hand, visually impaired subjects acquire the ability to use auditory information more efficiently thanks to the plasticity of the central nervous system, as, for instance, in speech discrimination [63], temporal resolution [64], or spatial tuning [65].

Experiments with real sound sources suggest that visually impaired (especially early blind) subjects map the auditory environment with equal or better accuracy than sighted subjects on the horizontal plane [62, 6668] but are less accurate in detecting elevation [67] and show an overly compressed auditory distance perception beyond the near field [69]. However, unlike sighted subjects, visually impaired subjects can correctly localize sounds monaurally [66, 70], which suggests a trade-off in the localization proficiency between the horizontal and median planes taking place [71]. By comparing behavioral and electrophysiological indices of spatial tuning within the central and peripheral auditory space in congenitally blind and normally sighted but blindfolded adults, it was found that blind participants displayed localization abilities that were superior to those of sighted controls, but only when attending to sounds in peripheral auditory space [72]. Still, it has to be taken into account that early blind subjects have no possibility of learning the mapping between auditory events and visual stimuli [73].

While localizing, adapting to the coloration of the signals is a relevant component for both sighted and blind subjects. Improved obstacle sense of the blind is also mainly due to enhanced sensitivity to echo cues [74], which allows so-called echolocation [75, 76]. Thanks to this obstacle sensing ability, which can be improved by training, distance perception in blind subjects may be enhanced [68, 7678]. In addition, some blind subjects are able to determine size, shape, or even texture of obstacles based on auditory cues [70, 77, 79, 80].

Switching to virtual auditory displays, that is, the focus of this paper, a detailed comparative evaluation of blind and sighted subjects [81] confirmed some of the previously discussed results in the literature on localization with real sound sources. Better performance in localizing static frontal sources was obtained in the blind group due to a decreased number of front-back reversals. In the case of moving sources, blind subjects were more accurate in determining movements around the head in the horizontal plane. Sighted participants, however, performed better during listening to ascending movements in the median plane and in identifying sound sources in the back. In-the-head localization rates and the ability to detect descending movements were almost identical for the two groups. In a further experiment [82] error rates of about to degrees horizontally and to degrees vertically were measured for a pool of blind subjects. Improvements in localization by blind persons were observed mainly in the horizontal plane and in case of a broadband stimulus.

Finally, although visual information corresponding to auditory information significantly aids localization and creation of correct spatial mental mappings, it has to be remarked that visually impaired subjects can benefit from off-site representations in order to gain spatial knowledge of a real environment. For instance, results of recent studies showed that interactive exploration of virtual acoustic spaces [8385] and audio-tactile maps [86] can provide relevant information for the construction of coherent spatial mental maps of a real environment in blind subjects and that such mental representations preserve topological and metric properties, with performances comparable or even superior to an actual navigation experience.

3. Binaural Technique

The most basic method for simulating sound source direction over loudspeakers is to use panning. This usually refers to amplitude panning using two channels (stereo panning). In this case, only level information is used as a balance between the channels, and the virtual source is shifted towards the louder channel. However, ILD and spectral cues are determined by the actual speaker locations. In traditional stereo setups, where loudspeakers and listener form a triangle, sources can be correctly simulated on the line ideally connecting the two speakers. However, although traditional headphones also use two channels, correct directional information is not maintained due to a different arrangement of the speakers with respect to the listener and by the loss of crosstalk between the channels.

Spatial features of virtual sound sources can be more realistically rendered through headphones by processing an input sound with a pair of filters, each simulating all the linear transformations undergone by the acoustic signal during its path from the sound source to the corresponding listener’s eardrum. These filters are known in the literature as head-related transfer functions (HRTFs) [87], formally defined as the frequency-dependent ratio between the sound pressure level (SPL) at the eardrum and the free field SPL at the center of the head as if the listeners were absent:where indicates the angular position of the source relative to the listener and is angular frequency. The HRTF contains all of the information relative to sound transformations caused by the human body, in particular by the head, external ears, torso, and shoulders.

HRTF measurements are typically conducted in large anechoic rooms. Usually, a set of loudspeakers is arranged around the subject, pointing towards him/her and spanning an imaginary spherical surface. The listener is positioned so that the center of the interaural axis coincides with the center of the sphere defined by the loudspeakers and their rotation (or, equivalently, the subject’s rotation). A probe microphone is inserted into each ear, either at the entrance or inside the ear canal. The measurement technique consists in recording and storing the signal arriving at the microphones. Consequently, these signals are processed in order to remove the effects of the room and the recording equipment (especially speakers and microphones), leaving only the HRTF [87, 88].

By processing a desired monophonic sound signal with a pair of individual HRTFs, one per channel, and by adequately accounting for headphone-induced spectral coloration (see next Section), authentic D sound experiences can take place. Virtual sound sources created with individual HRTFs can be localized almost as accurately as real sources and efficiently externalized [50], provided that head movements can be made and that the sound is sufficiently long [89]. As a matter of fact, localization of short broadband sounds without head movements is less accurate for virtual sources than for real sources, especially in regard to vertical localization accuracy [90], and front/back reversal rates are higher for virtual sources [89].

Unfortunately, the individual HRTF measurement technique requires the use of dedicated research facilities. Furthermore, the process can take up to several hours, depending on the used measurement system and on the desired spatial grid density, being uncomfortable and tedious for subjects. As a consequence, most practical applications use nonindividual (or generic) HRTFs, for instance, measured on dummy heads, that is, mannequins constructed from average anthropometric measurements. Several generic HRTF sets are available online. The most popular are based on measurements using the KEMAR mannequin [91] or the Neumann KU-100 dummy head (see the Club Fritz study [92]). Alternatively, an HRTF set can be taken from one of many public databases of individual measurements (see, e.g., [93]); many of these databases were recently unified in a common HRTF format known as Spatially Oriented Format for Acoustics (SOFA) (https://www.sofaconventions.org/).

On the other hand, while nonindividual HRTFs represent the cheapest means of providing 3D perception in headphone reproduction, especially in the horizontal plane [94, 95], listening to nonindividual spatial sounds is more likely to result in evident sound localization errors such as incorrect perception of source elevation, front-back reversals, and lack of externalization [96] that cannot be fully counterbalanced by additional spectral cues, especially in static conditions [46]. In particular, individual elevation cues cannot be characterized through generic spectral features.

For the above reasons, different alternative approaches towards HRTF-based synthesis were proposed throughout the last decades [37, 97]. These are now reviewed and presented sorted by increasing level of customization.

3.1. HRTF Selection Techniques

HRTF selection techniques typically use specific criteria in order to choose the best HRTF set for a particular user from a database. Seeber and Fastl [98] proposed a procedure according to which one HRTF set is selected based on multiple criteria such as spatial perception, directional impression, and externalization. Zotkin et al. [99] selected the HRTF set that best matched an anthropometric data vector of the pinna. Geronazzo et al. [100] and Iida et al. [101] selected the HRTF set whose extracted pinna notch frequencies were closest to the hypothesized frequencies of the user according to a reflection model and an anthropometric regression model, respectively.

Similarly, selection can be targeted at detecting a subset of HRTFs in a database that fit the majority of a pool of listeners. Such an approach was pursued, for example, by So et al. [102] through cluster analysis and by Katz and Parseihian [103] through subjective ratings. The choice of the personal best HRTF among this reduced set is left to the user. Even different selection approaches were undertaken by Hwang et al. [104] and Shin and Park [105]. They modeled HRIRs on the median plane as linear combinations of basis functions whose weights were then interactively self-tuned by the listeners themselves.

Results of localization tests included in the majority of these works show a general decrease of the average localization error as well as of the front/back reversal and inside-the-head localization rates using selected HRTFs rather than generic HRTFs.

3.2. Analytical Solutions

These methods try to find a mathematical solution for the HRTF, taking into account the size and shape of the head and torso in particular. The most recurring head model in the literature is that of a rigid sphere, where the response related to a fixed observation point on the sphere’s surface can be described by means of an analytical transfer function [106]. Brown and Duda [37] proposed a first-order approximation of this transfer function for sources in the far-field as a minimum-phase analog filter. Near-field distance dependence can be accounted for through an additional filter structure [107].

Although the spherical head model provides a satisfactory approximation to the low-frequency magnitude of a measured HRTF [108], it is far less accurate in predicting ITD, which is actually variable around a cone of confusion by as much as 18% of the maximum interaural delay [109]. ITD estimation accuracy can be improved by considering an ellipsoidal head model that can account for the ITD variation and be adapted to individual listeners [110]. It has to be highlighted, however, that ITD estimation from HRTFs is a nontrivial operation, given the large variability of objective and perceptual ITD results produced by different common calculation methods for the same HRTF dataset [111, 112].

A spherical model can also approximate the contribution of the torso to the HRTF. Coaxial superposition of two spheres of different radii, separated by a distance accounting for the neck, results in the snowman model [113]. The far-field behavior of the snowman model was studied in the frontal plane both by direct measurements on two rigid spheres and by computation through multipole reexpansion [114]. A filter model was also derived from the snowman model [113]; its structure distinguishes the two cases where the torso acts as a reflector or as a shadower, switching between the two filter substructures as soon as the source enters or leaves the torso shadow zone, respectively. Additionally, an ellipsoidal model for the torso was studied in combination with the usual spherical head [38]. Such model is able to account for different torso reflection patterns; listening tests confirmed that this approximation and the corresponding measured HRTF gave similar results, showing larger correlations away from the median plane.

A drawback of these techniques is that since they do not consider the contribution of the pinna, the generated HRTFs match measured HRTFs at low frequencies only, lacking spectral features at higher frequencies [115].

3.3. Structural HRTF Models

According to the structural modeling approach, the contributions to the HRTF of the user’s head, pinnae, torso, and shoulders, each accounting for some well-defined physical phenomena, are treated separately and modeled with a corresponding filtering element [37]. The global HRTF model is then constructed by combining all the considered effects [116]. Structural modeling opens to an interesting form of content adaptation to the user’s anthropometry, since parameters of the rendering blocks can be estimated from physical data, fitted, and finally related to anthropometric measurements.

Structural models typically assume a spherical or ellipsoidal geometry for both the head and torso, as discussed in the previous subsection. Effective customizations of the spherical head radius given the head dimensions were proposed [117, 118], resulting in a close agreement with experimental ITDs and ILDs, respectively. Alternatively, ITD can be synthesized separately using individual morphological data [119]. An ellipsoidal torso can also be easily customized for a specific subject by directly defining control points for its three axes on the subject’s torso [114]. Furthermore, a great variety of pinna models is available in the literature, ranging from simple reflection models [120] and geometric models [121] to more complex physical models that treat the pinna either as a configuration of cavities [122] or as a reflecting surface [29]. Structural models of the pinna, simulating its resonant and reflective behaviors in two separate filter blocks, were also proposed [123125].

Algazi et al. [93] suggested using a number of one-dimensional anthropometric measurements for HRTF fitting through regression methods or other machine learning techniques. This approach was recently pursued in a number of studies [126129] investigating the correspondence between anthropometric parameters and HRTF shape. When suitable processing is performed on HRTFs, clear relations with anthropometry emerge. For instance, Middlebrooks [130] reported a correlation between pinna size and center frequencies of HRTF peaks and notches and argued that similarly shaped ears that differ in size just by a scale factor produce similarly shaped HRTFs that are scaled in frequency. Further evidence of the correspondence between pinna shape and HRTF peaks [123, 131, 132] and notches [125, 133, 134] is provided in a number of following works. The use of such knowledge leads to the effective parametrization of structural pinna models based on anthropometric parameters, which suggests an improvement in median plane localization with respect to generic HRTFs [135, 136].

3.4. Numerical HRTF Simulations

Numerical methods typically require as input a 3D mesh of the subject, in particular the head and torso, and include approaches such as finite-difference time domain (FDTD) methods [108], the finite element method (FEM) [137], and the boundary element method (BEM) [138].

Recent literature has focused on the BEM. It is known that high-resolution meshes are needed in order to effectively simulate HRTFs with the BEM, especially for the pinna area. Low mesh resolution results indeed in simulated HRTFs that greatly differ from acoustically measured HRTFs at high frequencies, thus destroying elevation cues [139]. However, as the number of mesh elements grows, memory requirements and computational load grow even faster [140]. Recent works introduced the fast multipole method (FMM) and the reciprocity principle (i.e., interchanging sources and receivers) in order to face BEM efficiency issues [140, 141]. Ultimately, localization performances of simulated HRTFs through the BEM were found to be similar to those observed with acoustically measured HRTFs [142], and databases of simulated HRTFs [143] as well as open-source tools for calculating HRTFs through the BEM given a head mesh as input [144] are available online.

On the other hand, image-based 3D modeling, based on the reconstruction of 3D geometry from a set of user pictures, is a fast and cost-effective alternative to obtaining mesh models [145]. Furthermore, the advent of consumer level depth cameras and the availability of huge computational power on consumer computers open new perspectives towards very cheap and yet very accurate calculation of individualized HRTFs.

4. Headphone Technologies

One of the crucial variables for generating HRTF-based binaural audio is the headphone itself. Headphones are of different types (e.g., circumaural, supra-aural, extra-aural, and in-ear) and can have transfer functions that are far from linear. The main issue with classic headphones is that the transfer function between headphone and eardrum heavily varies from person to person and with small displacements of the headphone itself [146, 147]. Such variation is particularly marked in the high-frequency range where important elevation cues generally lie. As a consequence, headphone playback introduces significant localization errors, such as in-the-head localization, front-back confusion, and elevation shift [148].

In order to preserve the relevant localization cues provided by HRTF filtering during headphone listening, various headphone equalization techniques, usually based on a prefiltering with the inverse of the average headphone transfer function, are used [149]. However, previous research suggests that these techniques are little to no effective when nonindividual (even selected) HRTFs are used [149, 150]. On the other hand, several authors support the use of individual headphone compensation in order to preserve localization cues in the high-frequency range [146, 147].

In the case of travel aids for the visually impaired, additional factors need to be considered in the design and choice of the headphone type. Most importantly, ears are essential to provide information about the environment, and visually impaired persons refuse to use headphones during navigation if these either partially or fully cover the ears, therefore blocking environmental noises. The results of a survey of the preferences of visually impaired subjects for a possible personal navigation device [151] showed indeed that the majority of participants rated headphones worn over the ears as the least acceptable output device, compared to other technologies such as bone-conduction and small tube-like headphones, or even a single headphone worn over one ear. Furthermore, those fully blind had much stronger negative feelings about headphones that blocked ambient sounds than those who were partially sighted.

This important consideration shifts our focus to alternative state-of-the-art solutions for spatial audio delivery such as unconventional headphone configurations, bone-conduction headsets, or active transparent headsets.

4.1. Unconventional Headphone Configurations

The problem of ear occlusion can be tackled by decentralizing the point of sound delivery from the entrance of the ear canal to positions around the ear, with one or more transducers per ear. In this case, issues arise regarding the proper direction and distance of each transducer with respect to the ear canal, as well as their types and dimensions. Furthermore, there is a challenge in the spatial rendering technique in that no research results support the application of traditional loudspeaker-based spatial audio techniques (such as Vector Base Amplitude Panning [152] or Ambisonics [153]) to multispeaker headsets and that traditional HRTF measurements do not match with decentralized speaker positions.

The first attempts in delivering spatial audio through multispeaker headphones were performed by König. A decentralized 4-channel arrangement placed on a pair of circumaural earcups for frontal surround sound reproduction was implemented [154] (an alternative small supra-aural configuration was also proposed [155]). Results showed that this speaker arrangement induces individual direction-dependent pinna cues as they appear in real frontal sound irradiation in the free field for frequencies above 1 kHz [156]. Psychoacoustic effects introduced by the headphone revealed that frontal auditory events are achieved, as well as effective distance perception [154].

The availability of individual pinna cues at the eardrum is imperative for accurate frontal localization [157]. Accordingly, Sunder et al. [158] later proposed the use of a 2-channel frontal projection headphone which customizes nonindividual HRTFs by introducing idiosyncratic pinna cues. Perceptual experiments validated the effectiveness of frontal headphone playback over conventional headphones with reduced front-back confusions and improved frontal localization. It was also observed that the individual spectral cues created by the frontal projection are self-sufficient for front-back discrimination even with the high-frequency pinna cues removed from the nonindividual HRTF. However, additional transducers are needed if virtual sounds behind the head have to be delivered, and timbre differences with respect to the frontal transducers need to be solved.

Greff and Katz [159] extended the above solutions to a multiple transducer array placed around each ear (8 speakers per ear) recreating the pinna-related component of the HRTF. Simulations and subjective evaluations showed that it is possible to excite the correct localization cues provided by the diffraction of the reconstructed wave front on the listener’s own pinnae, using transducer driving filters related to a simple spherical head model. Furthermore, different speaker configurations were investigated in a preliminary localization test, the one with transducers placed at grazing incidence all around the pinna showing the best results in terms of vertical localization accuracy and front/back confusion rate.

Recently, Bujacz et al. [160] proposed a custom headphone solution for a prospective ETA with four proximaural speakers positioned above and below the ears, all slightly to the front. Amplitude panning was then used as spatial audio technique to shift the power of the output sound between pairs of speakers, both horizontally and vertically. Results of a preliminary localization test showed a localization accuracy comparable to HRTF-based rendering through high-quality circumaural headphones, both in azimuth and in elevation.

4.2. Bone-Conduction Headsets

The use of a binaural bone-conduction headset (also known as bonephones) is an extremely attractive solution for devices intended for the blind as the technology does not significantly interfere with sounds received through the ear canal, allowing for natural perception of environmental sounds. The typical solution is to place vibrational actuators, also referred to as bone-conduction transducers, on each mastoid (the raised portion of the temporal bone located directly behind the ear) or alternatively on the cheek bones just in front of the ears [161]. Pressure waves are sent through the bones in the skull to the cochlea, with some amount of natural sound leakage through air into the ear canals still occurring.

There are some difficulties in using bone conduction for delivering spatial audio. The first is the risk of crosstalk impeding an effective binaural separation: because of the high propagation speed and low attenuation of sound in the human skull, both the ITD and ILD cues are significantly softened. Walker et al. [162] still observed some degree of spatial separation with interaural cues provided through bone conduction and ear canals either free or occluded, especially relative to ILD. Perceived lateralization is even comparable between air conduction and bone conduction with unoccluded ear canals [163]. However, the degradation relative to standard headphones suggests the difficulty to produce large enough interaural differences to simulate sound sources at extreme lateral locations [162].

The second problem is the need to introduce additional transfer functions for correct equalization of HRTF-based spatial audio: the frequency response of the transducer [164] and the transfer function to the bones themselves, referred to as bone-conduction adjustment function (BAF) [165], which takes into account high-frequency attenuation by the skin [166] and differs between individuals, similar to HRTFs. Walker et al. [167, 168] proposed the use of appropriate bone-related transfer functions (BRTFs) in replacement of HRTFs. Stanley [165] derived individual BAFs from equal-loudness judgements on pure tones, showing that individual BAF adjustments to HRTF-based spatial sound delivery were effective in restoring the spectral cues altered by the bone-conduction pathway. This allowed for effective localization in the median plane by reducing up/down reversals with respect to the BAF-uncompensated stimuli. However, there is no way to measure BAFs empirically, and it is unclear whether the use of a generic, average BAF could lead to the same conclusions.

MacDonald et al. [164] reported similar localization results in the horizontal plane between bone conduction and air conduction, using individual HRTFs as the virtual auditory display and headphone frequency response compensation. Lindeman et al. [169, 170] compared localization accuracy between bone conduction with unoccluded ear canals and an array of speakers located around the listener. The results showed that although the best accuracy was achieved with the speaker array in the case of stationary sounds, there was no difference in accuracy between the speaker array and the bone-conduction device for sounds that were moving, and that both devices outperformed standard headphones for moving sounds.

Finally, Barde et al. [171] recently investigated the minimum discernable angle difference in the horizontal plane with nonindividual HRTFs over a bone-conduction headset, resulting in an average value of 10°. Interestingly, almost all participants reported actual sound externalization.

4.3. Active Transparent Headsets

An active headset is able to detect and process environmental sounds through analog circuits or digital signal processing. One of the most important fields of application of active headsets is noise reduction, where the headset uses active noise control [172, 173] to reduce unwanted sound by the addition of an antiphase signal to the output sound. In the case of ETAs, the environmental signal should not be canceled but provided back to the listener (hear-through signal) mixed with the virtual auditory display signal in order for the subject to be aware of the surroundings. Binaural hear-through headsets (in-ear headphones with integrated microphones) are typically used in augmented reality audio (ARA) applications [174], where a combination of real and virtual auditory objects in a real environment is needed [175].

The hear-through signal is a processed version of the environmental sound and should produce similar auditory perception to natural perception with unoccluded ears. Thus, equalization is needed to make the headset acoustically transparent, since it affects the acoustic properties of the outer ear [176]. The most important problem here is poor fit on the head causing leaks and attenuation problems. The fit of the headphone affects isolation and frequency response as well. Using internal microphones inside the headset in addition to the external ones, a controlled adaptive equalization can be realized [177].

The second basic requirement for a hear-through system is that processing of the recorded sound should have minimal latency [175]. As a matter of fact, when the real signal (leaked to the eardrum) is summed up with the hear-through signal, the delayed version can cause audible comb-filtering effects, especially at lower frequencies where leakage is higher. The audibility of comb-filtering effects depends on both the time and amplitude difference between the hear-through signal and the leaked signal [178]. Using digital realizations, which are preferable over analog circuits in the case of an ETA in terms of both cost and size, suitable latencies of less than 1.4 ms, for which the comb-filtering effect was found to be inaudible when the attenuation of the headset is 20 dB or more, can be achieved with a DSP board [179].

Finally, the hear-through signal should preserve localization cues at the ear canal entrance. Since sound transmission from the microphone to the eardrum is independent of direction whether the microphone is inside or at most 6 mm outside the ear canal [180], having binaural microphones just outside the ear canal entrance is sufficient for obtaining the correct listener-dependent spatial information.

5. Spatial Audio in ETAs

From the multitude of ETAs, two main trends in selecting sound cues can be observed, one to provide very limited yet easily interpretable data, typically from a range sensor, and the other to provide an overabundance of auditory data and let the user learn to extract useful information from it (e.g., the vOICe [181]). A third approach, taken for instance by the authors in the Sound of Vision project [15], is to limit the data from a full-scene representation to just the most useful information, for example, by segmenting the environment and identifying the nearest obstacles or detecting special dangerous scene elements such as stairs. Surveys show that individual preferences among the blind can vary greatly, and all three approaches have users that prefer them [182].

In a recent literature review, Bujacz and Strumiłło [6] classified the auditory display solutions implemented in the most widely known ETAs, either commercially available or in various stages of research and development. Of the 22 considered ETAs, 12 use a spatial representation of the environment. However, breaking the list of ETAs down to obstacle detectors (mostly hand-held) and environmental imagers (mostly head-mounted), ETAs that use a spatial representation almost all belong to the second category. Some of them, such as the vOICe [181], Navbelt [183], SVETA [184], and AudioGuider [185], use stereo panning to represent directions, whereas elevation information is either ignored or coded into sound pitch. ETAs (including works not included in the above cited review) that use HRTFs as the spatial rendering method are now summarized. All of the systems presented in the following are laboratory prototypes.

5.1. Available ETAs Using HRTFs

The EAV (Espacio Acustico Virtual) system [186] uses stereoscopic cameras to create a low resolution (16 × 16 × 16) 3D stereopixel map of the environment in front of the user. Each occupied stereopixel becomes a virtual sound source filtered with the user’s individual HRTFs, measured in a reverberating environment. The sonification technique employs spatial audio cues (synthesized with HRTFs) and a distance-to-loudness encoding. Sounds were presented through a pair of individually equalized Sennheiser HD-580 circumaural headphones. Classic localization tests with the above virtual auditory display and tests with multiple sources were performed on 6 blind and 6 normally sighted subjects. Subjects were accurate in identifying the objects’ position and recognizing shapes and dimensions within the limits imposed by the system’s resolution.

The cross-modal ETA device [187] is a wearable prototype that consists of low-cost hardware: earphones (no further information provided), sunglasses fitted with two CMOS micro cameras, and a palm-top computer. The system is able to detect the light spot produced by a laser pointer, compute its angular position and depth, and generate a corresponding sound to the position and distance of the pointed surface. The sonification encoding uses directional auditory cues provided through Brown and Duda’s structural HRTF model [37], and distance cues through loudness control and reverberation effects. The subjective effectiveness of the sonification technique was evaluated by several volunteers who were asked to use the system and report their opinions. The overall result was satisfactory, with some problems related to the lack of elevation perception. Targets very high and very low were perceived correctly, whereas those laying in the middle were associated with wrong elevations.

The Personal Guidance System [12] receives information from a GPS receiver and was evaluated in five different types of configurations involving different types of auditory displays, spatial sound delivery methods (either via classic headphones or through a speaker worn on the shoulder), and tracker locations. No details about the binaural spatialization engine or the headphones used were provided. Fifteen visually impaired subjects traveled a m long pathway with each of the configurations. Results showed that the configuration using binaurally spatialized virtual speech led to the shortest travel times and highest subjective ratings. However, there were many negative comments about the headphones blocking environmental sounds.

The SWAN system [8, 188] aids navigation and guidance through a set of navigation beacons (earcon-like sounds), object-related sounds (provided through spatial auditory icons), location information, and brief prerecorded speech samples. Sounds are updated in real-time by tracking the subject’s orientation and accordingly spatialized through nonindividual HRTFs. Sounds were played either through a pair of Sony MDR-7506 closed-ear headphones or an equalized bone-conduction headset (see [165]). In an experimental procedure, 108 sighted subjects were required to navigate three different maps. Results showed good navigation skills for almost all the participants in both time and path efficiency.

The main idea of the Virtual Reality Simulator for the visually impaired people [189] consists in calculating the distance between the user and nearby objects (depth map) and converting it into sound. The depth map is transformed into a spatial auditory map by using 3D sound cues synthesized with individually measured HRTFs from 1003 positions in the frontal field. Sounds were provided through a standard pair of stereophonic headphones (no further information provided). The Virtual Reality Simulator proved to be helpful for visually impaired people in different research experiments performed indoors and outdoors, in virtual and real-life situations. Among the main limitations of the simulator are tracking accuracy and the lack of a real-time HRTF convolver.

The Real-Time Assistance Prototype [190], an evolution of the CASBliP prototype [191], encodes objects’ position in space based on their distance (inversely proportional to sound frequency), direction (3D binaural sounds synthesized with nonindividual HRTFs), and speed (proportional to pitch variation). Nonindividual HRTFs of a KEMAR mannequin were measured for different spatial points in a 64° azimuth range, a 30° elevation range, and a 15 m distance range. Sounds were provided through a pair of SONY MDR-EX75SL in-ear headphones. Two experiments were performed with four totally blind subjects, one requiring subjects to identify the sound direction and the other one to detect the position of a moving source and to follow. Despite providing encouraging results in static conditions for objects moving in the detected area, its main limitations reside in the inability to detect objects at ground level and in the reduced 64° field of view.

The NAVITON system [192, 193] processes stereo images to segment out key elements for auditory presentation. For each segmented element, the sonification approach uses discrete pitched sounds, whose pitch, loudness, and temporal delay (depth scanning) depend on object distance, and whose duration is proportional to the depth of the object. Sounds are spatialized with individual HRTFs, custom measured in the full azimuth range and in the vertical plane from −54° to 90°, in 5° steps. Sounds were provided through high-quality open-air reference headphones without headphone compensation. Ten blindfolded participants reported their auditory perception about the sonified virtual 3D scenes in a virtual reality trial, proving to be capable of grasping the general spatial structure of the environment and accurately estimate scene layouts. A real-world navigation scenario was also tested with 5 blind and 5 blindfolded volunteers, who could accurately estimate the spatial position of single obstacles or pairs of obstacles and walk through simple obstacle courses.

The NAVIG (Navigation Assisted by Artificial VIsion and GNSS) system [194, 195] aims to enhance mobility and orientation, navigation, object localization, and grasping, both indoors and outdoors. It uses a Global Navigation Satellite System (GNSS) and a rapid visual recognition algorithm. Navigation is ensured by real-time nonindividual HRTF-based rendering, text-to-speech, and semantic sonification metaphors that provide information about the trajectory, position, and the important landmarks in the environment. The 3D audio scenes are conveyed through a bone-conduction headset whose complex frequency response is equalized in order to properly render all the spectral cues of the HRTF. Preliminary experiments have shown that it is possible to design a wearable device that can provide fully analyzed information to the user. However, thorough evaluations of the NAVIG prototype have not been published yet.

5.2. Discussion and Conclusions

The use of HRTFs to code directional information in the above summarized ETAs suggests the importance of a high-fidelity spatial auditory representation of the environment for blind users. However, most of the above works fail to address the hardware- and/or software-related aspects we discussed in Sections 3 and 4, presenting results of performance and usability tests that are based on binaural audio rendering setups that either are ideal yet unrealistic (e.g., [186]) or underestimate the potential of spatial sound itself (e.g., [190]).

As a matter of fact, the preferred choice for the virtual auditory display within the listed ETAs is either individually measured HRTFs or nonindividual, generic HRTFs. Only the cross-modal ETA [187] proposes the use of structural HRTF modeling as a trade-off between localization accuracy and measurement cost. As a result, the evaluation of these systems (often performed through proper localization performance tests) is based either on the best scoring yet unfeasible solution (individually measured HRTFs) or on a costless yet inaccurate one (generic HRTFs), overlooking important aspects in the fidelity of the virtual auditory display such as elevation accuracy and front/back confusion avoidance. Furthermore, the aforementioned monaural localization ability by visually impaired persons (especially early blind) suggests the use of individual pinna cues for azimuth perception, which would make a visually impaired person more vulnerable to degraded localization from nonindividual HRTFs than a sighted person.

Even more unfortunately, the headphones chosen for these tests were in the majority of cases classic circumaural or in-ear headphones that block environmental sounds and thus, as discussed before, are not acceptable for the visually impaired community. The use of a bone-conduction headset is reported only for the SWAN and NAVIG systems [188, 194], where the importance of headphone equalization, although forced to be nonindividual, is also stressed. None of the remaining works, except one [186], even mentions headphone equalization. Effective externalization of the virtual sounds provided to the users is therefore questionable.

It is difficult to rank the importance of the various factors influencing a satisfactory virtual acoustic experience (e.g., externalization, localization accuracy, and front-back confusion rate). Most studies check for only one or two factors and can confirm their influence on one or more spatial sound perception parameters. Besides the choice of the HRTF set, headphone type, and equalization, and type of sound source (frequency content, familiar/unfamiliar sound, and temporal aspects) [16, 44, 196], other important factors have to be considered. For instance, as explained in Section 2.4, rendering environmental reflections increases externalization, as well as the use of a proper head-tracking method, which also helps in resolving front/back confusion [95]. This may be why most of the above cited studies chose to use high-quality headphones with generic or individual HRTFs, without applying headphone equalization as long as head-tracking or real-time obstacle tracking is implemented. It is also relevant to notice that those systems that use head-mounted cameras to render sounds at locations relative to current head orientation do not even strictly require head-tracking to work dynamically [197].

We believe there is ample space for applying the technologies presented in this review paper to the case of ETAs for the blind. Basic research in HRTF customization techniques is currently in a prolific stage, thanks to advances in computational power and the widespread availability of technologies such as D scanning and printing allowing researchers to investigate in detail the relation between individual anthropometry and HRTFs. Although a full and thorough understanding of the mechanisms involved in spatial sound perception still has to be reached, techniques such as HRTF selection, structural HRTF modeling, or HRTF simulations are expected to progressively bridge the gap between accessibility and accuracy of individual binaural audio.

Still it has to be noted that many experiments proved that subjective training to nonindividual HRTFs, especially through cross-modal and game-based training methods, can significantly reduce localization errors in both free field and virtual listening conditions [198]. Feedback can be provided through visual stimuli [199, 200], proprioceptive cues [201, 202], or haptic information [203]. Reductions in front-back confusion rates as large as were reported, as well as improvements in sound localization accuracy in the horizontal and vertical planes regardless of head movement.

On the other hand, the headphone technologies discussed in Section 4 are expected to reach widespread popularity in the blind community. Bone-conduction and active headsets are growing in the consumer market thanks to their affordable price. External multispeaker headsets are still at a prototype stage but from a research point of view open the attractive possibility of introducing individualized binaural playback without the need of fully individual HRTFs. Efforts in the design of such headphones have been produced within the Sound of Vision project [160].

A final comment regards the cosmetic acceptability of the playback device. While bone-conduction and binaural headsets are relatively discreet and portable, external multispeaker headsets may require a bulky and unconventional design. There is considerable variation within the blind community when assessing the cosmetic acceptability of a wearable electronic device, even if it works well. Nevertheless, the visually impaired participants to the survey by Golledge et al. [151] showed overwhelming support for the idea of traveling more often with such a device, independently of its appearance.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement no. 643636.