Abstract

We introduce and describe a new conceptual framework for the design and analysis of audio for immersive first-person shooter games, and discuss its potential implications for the development of the audio component of game engines. The framework was created in order to illustrate and acknowledge the direct role of in-game audio in shaping player-player interactions and in creating a sense of immersion in the game world. Furthermore, it is argued that the relationship between player and sound is best conceptualized theoretically as an acoustic ecology. Current game engines are capable of game world spatiality through acoustic shading, but the ideas presented here provide a framework to explore other immersive possibilities for game audio through real-time synthesis.

1. Introduction

Members of your platoon cluster around you and over the radio comes the message “follow me” with others responding “affirmative.” You follow the sound of organ music discovering that it emanates from a Gothic church with buttressed superstructure. In the distance, the sharp crack of gunfire and the dull thud of explosions catch your attention. Eager to join the fray, the metronomic rhythm of your running boots on the hard surface of the path is soon matched by the sound of your panting breath; this quickly overtakes the pace of your slowing footsteps so you slow to a walk. Soon the organ music is left behind and the cacophony of battle intensifies. Suddenly, a siren indicates that a platoon member has managed to steal the enemy's flag and, amidst the sharp staccato of machine gun fire, you see and hear him weaving and running towards you, flag in hand, and hotly pursued by a posse of enemy soldiers.

An account of player experiences when playing first-person shooter (FPS) games is useful in several respects when establishing the groundwork and rationale for the development of the conceptual framework presented in this paper. To begin with, it may intimate a level of engagement with games signaling, among other things, the pleasures and satisfaction at being involved in an individual or team situation. Such pleasures result from (amongst other reasons) a demonstration of individual skill and mastery (in a multiplayer situation, a public demonstration of such skill) combined with the pursuit of collaborative objectives in an environment that, in many respects, simulates real-world problem-solving scenarios. The account may also intimate that the perspective presented and the nature of the engagement with the game are both subjective and individually constructed. Indeed, the plot of the very story may be very different when narrated by the other players involved in the same game scenario. Importantly, it outlines how FPS games typically place the player in a hostile environment (the hunter and the hunted) that demands attentiveness to all available cues (especially, and crucially for the aims of this paper, sound cues) for team success, character survival and individual glory. What is described is not fiction or imagination, but a lived account of experiences within the immersive spaces of a particular genre of computer games.

Many of the points raised above may be applied to many computer games (not to mention other forms of gaming) with varying degrees of success and prioritization. Some games provide different cues for the solution of a puzzle, some have less of a team aspect or none at all, others are less combative while others offer different perspectives. (For a more complex and extensive attempt at taxonomizing computer game types, see A Multi-Dimensional Typology of Games [1].) Broadly speaking, it is the types of cues offered, the hostile environment, the mix of team and individual skills, the immersive, first-person perspective, and, of course, the combat that signals the FPS genre. It is our contention that sound cues in FPS games afford more possibilities than in other genres to live the type of game experience signaled in the account. A first-person perspective game uses sound to immerse the player within the game environment in a way that a 2-dimensional platform game such as Donkey Kong [2] or a variety of role playing games (RPGs) do not typically attempt. This paper presents a conceptual framework for FPS game audio and a model of game worlds as an acoustic ecology in order to increase the ability of both game scholars and developers to analyze both the relevance of current applications of audio and its effect upon player immersion and player-player interaction.

The conceptual framework (Figure 1) and examples given here account for multiplayer run and gun FPS games, that is, networked games in which there is more than one human player. Bots (computer-generated characters) do not (yet) respond to sound but, in tracking down or evading player-controlled characters, make use of game code variables that change according to that character's position, actions and status. The assumption is made that the conceptual framework for single-player games constitutes a subset of those found in multiplayer games, hence it is the latter that is investigated here.

Of the terminology that resides in the conceptual framework constructed and outlined here, a large proportion is derived from a broad range of disparate work on the nature and function of sound spread across a variety of media. Such areas utilized in the framework include kinaesthetics, affordance theories, modes of listening, auditory icons, diegetic sound, sonification, causality, indexicality, soundscapes and immersion theories to name but a few. In doing so, it was necessary for all these treatments of sound to be adapted to account for the medium through which the FPS genre is experienced, either by pointing out the significant differences (between the medium to which the terminology was originally applied and that of the FPS genre) and adjusting accordingly or by extending the theoretical model to include new terminology where existing terminology proved insufficient. In order to further illustrate the framework, the FPS game Urban Terror [3] is used (Figure 2).

This paper extends a paper [4] presented by the authors at the Cybergames 2007 conference by exploring the implications of the conceptual framework described there for the future development of the audio component of game engines.

2. The Conceptual Framework

Understanding of sound in the FPS game world is a matter of experience. This experience, and the resultant comprehension, is the result of the training and conditioning which occurs either external to the FPS game (for example, through exposure to popular commercial cinema sound conventions) or which takes place during initial exposure to the sonic conventions of the FPS genre as a whole or to the specific FPS game being played. These conditions apply equally to both the sound designer and the player who, ideally, should have broadly similar socio-cultural experiences and understandings of sound. FPS game sounds may therefore be described as a set of sonic signs or auditory icons which may be analyzed through semiotic terminology, such as indexicality, iconicism, symbolism or metaphor, in an attempt to explain how the intended meaning is (ideally correctly) translated to the received meaning. Thus, the FPS game engine may be understood as a sonification system in which sounds are (re)encodings of non-audio data. This game world data may derive either directly from the game engine, as in the case of game status sounds for instance, or, in the majority of cases, is an expression of player activity, such as the sounds of footsteps or the firing of weapons.

2.1. Audio Sample Categorization

Our first approach in constructing a taxonomy of FPS game sounds was one that perused the classification of audio samples as found on either the distribution medium or on the installation drive of the product itself. This initial approach provided useful insights into the sound designer's classification system which itself may be extrapolated to the meaning that is intended for particular sounds. While this approach has already been used to account for game sound within games studies community through reference to character, interactable or environment sounds, for example, [57], none of the literature explicitly examines the distribution or installation media for further clues as to the sound designers’ intentions. At the very least, our approach revealed a division between diegetic and nondiegetic sounds as there is typically a separate directory for music or menu interface audio samples as opposed to other audio samples which, themselves, may be subclassified into character, interactable, environment or feedback audio samples. Thus, of the 607 base audio samples in Urban Terror (game-specific audio samples as opposed to level-specific audio samples), fully 601 are available to be used during gameplay with the remaining six being the menu music (one) and menu interface sounds (five). The 601 audio samples are, therefore, diegetic whilst the remaining six are nondiegetic.

The game designer-constructed organization of audio samples in Urban Terror is illuminating in several respects. Firstly, it is an indication of how the game code deals with sound and its relationship to a variety of characters, objects, and locations within the game. Sounds that players’ characters create as they move, fire, or taunt are separated from environment sounds which are part of a location; sounds of interactable objects are separated from the sounds non-interactable objects make, and diegetic sounds are separated from nondiegetic sounds. Secondly, the sheer number of sounds is an indication of the importance of sound to the game experience. Thirdly, this organization of sound indicates some of the technical limitations of the game, namely in the areas of media storage and computer memory. As an example, some audio samples of footsteps are shared between the characters and this decreases the number of sounds which must be stored on the game distribution medium (a compact disc or Internet download in this case) and which must be loaded in the computer memory while playing.

Alone, this mode of categorizing sound offers little insight into the function and meaning of sound, how sound is used in the game by the player or how sound functions to form an acoustic ecology. In order to explore such issues, it was necessary to employ and devise other taxonomic approaches. However, before proceeding to these other possible forms of classification, consideration of the means of sound creation and production at the game design stage is useful as it sheds some light on the degree of interaction made possible in the FPS game which directly relates to the player immersion within and participation in a game-related acoustic ecology. In any computer game, sounds heard during gameplay and from within the game environment are synthesized or digitally recorded and stored as discrete audio samples. In all modern run and gun FPS games, most, if not all such sounds consist of audio samples and this is certainly the case for Urban Terror. Most of these audio samples are sounded in response to player input, game status (which, in most cases, is an indication of player activity) or bot activity in games where bots are employed. A smaller number of environment audio samples are under the control of the game engine although their audification may be responsive to player location (by pan and intensity) or where the game engine is capable of acoustic shading.

Such audio samples (as described above) are labeled nomic auditory icons by Gaver [8]. However, in the context of games, nomic is an ill-advised term to use, and so we prefer to call them causal auditory icons. They bear a strong degree of causality and indexicality to the actions they represent because they are usually recordings of real-world analogues represented in the game, hence the virtual causality of the sounds. Conversely, a symbolic auditory icon has a more arbitrary mapping between the sound and the event it represents and within games aspiring to a degree of realism (such as Urban Terror [9]) there are few such auditory icons.

The abundance of recorded audio samples (as opposed to synthesized audio samples), which may be described as causal auditory icons, combined with their appropriate in-game use (in other words, they are causal sounds with a high degree of virtual indexicality, for example, a recording of a shot-gun is sounded each time a shot-gun is fired), is a good indication of the level of realism the FPS game aspires to. Urban Terror, which is usually described as a realism mod, is a prime example; of the 601 diegetic audio samples available, the only synthetic audio samples (the only symbolic auditory icons) are those related to game status events, such as when a flag has been captured. This may be compared to Quake III Arena [10] or Quake 4 [11] which, set as they are in a more fantastical gamescape, have a greater proportion of symbolic auditory icons representing not just game status events but also various audio samples sounded by player input (such as those to do with power-ups and teleporters).

2.2. Diegetic Audio

It is possible to classify all audio samples as either diegetic or nondiegetic following film sound theory. However, differences in the creation and resultant nature of the FPS game soundscape compared to the film soundscape require refinements of the term diegetic. As already noted, sounds in an FPS game consist of discrete audio samples and, unlike film, there is no complete game soundtrack that is stored on the distribution medium and played during gameplay. The FPS game soundscape, which forms a part of the acoustic ecology, is created in real-time through the agency of game engine actions (the sounding of game status feedback or ambient audio samples, for example) or through the agency of player input acting upon the discrete audio samples which form the soundscape's palette. Furthermore, with any playing of the game (even the same level), the resultant soundscape will be substantially different for the one player and, in a multiplayer game, the soundscape experienced by one player will also be substantially different to that experienced simultaneously by other players. It is for this latter reason that we define the terms ideodiegetic (those sounds that any one player hears) and telediegetic (those heard and responded to by a player—they are ideodiegetic for that player—but which have consequence for another player; they are telediegetic for the second player). Furthermore, ideodiegetic sounds may be classified as kinediegetic (sounds initiated directly by that player's actions) and exodiegetic (all other ideodiegetic sounds).

Of the class of diegetic audio samples, and in the context of a multiplayer game, all global feedback sounds (such as game status messages) may be classed as exodiegetic sounds. They are ideodiegetic in that they are heard by all players (simultaneously) but initiated by the game engine in response to significant events. All other audio samples may be ideodiegetic or telediegetic depending upon context. These include environment sounds (which are usually level-specific audio samples rather than game-specific audio samples). Ideodiegetic sounds may be classed as either kinediegetic or exodiegetic. For the player who triggers them, the sounds are kinediegetic; they are exodiegetic for other players. If such sounds have consequences for other players who do not hear them (for example, the blast of the shotgun which kills an enemy may draw others of her teammates to that location which itself may provide opportunities for the opposing team), they may also be classed as telediegetic for these other players.

As has been noted by several writers [6, 7], sound in an FPS game may be attended to in one of three modes: reduced listening, semantic listening, and causal listening. Reduced listening, as noted by Stockburger [7], is little used by experienced players. What these writers do not suggest is that the mode of listening may change depending upon context and experience. Furthermore, we identify a fourth mode, navigational listening. This is required because of the unique (compared to electro-acoustic music and film sound theory where the original three modes were first described) abilities of the FPS player to move her character around the 3-dimensional game world. In this mode, certain sounds may be used as audio beacons helping to guide players, especially those new to the particular game level, around the game world structures.

2.3. The FPS Game Soundscape

Schafer's [12] keynote sounds are, in this context, audio samples which form part of the sonic ambience and may not be directly triggered by the player being, instead, sounded by the game engine. (They may be triggered by other players but are judged by the one player to be distant and of little interest and so form part of the general ambience of battle.) An example in Urban Terror is the Bach organ fugue in the Abbey level or, in the same level but in a different location, the twittering of birds. There is some ambiguity here that is not captured by the brief descriptions of such environment or ambient sounds in existing games studies literature. For example, the player does typically have some kinaesthetic control over the sounding of these sounds; by simply moving away from a location, the sound may be attenuated to silence (and vice versa). Furthermore, if a keynote sound is a sound which is not intended to be consciously listened to, merely forming the background for more perceptually important sounds, the decision to consciously attend to a sound or not is often a matter of player choice, indeed, musicians may respond to the organ fugue differently to non-musicians.

A signal sound is a foregrounded sound which is designed to be consciously attended to because it potentially contains important information encoded within it. Most of the game-specific audio samples in Urban Terror may be classified as signal sounds when they are sounded in a context which foregrounds them. Thus, the loud, and therefore proximate, sounding of gunshot samples is worthwhile paying attention to (particularly in the individual deathmatch game mode). However, if the sounds of battle are distant, they may be classed as keynote sounds particularly if they are relatively constant and the player's attention is directed elsewhere. All game status indicators and team radio messages are signal sounds because, although they are as pervasive as keynote sounds, they are usually louder and therefore more proximate and, in the case of radio messages, have no reverberation, thereby foregrounding them through the lack of depth cues.

Soundmarks are identifying aural features of the acoustic ecology and may be either signal sounds or keynote sounds which are consciously attended to. Symbolic auditory icons, such as flag status signals, are more likely to be uniquely identifying of an FPS game than causal auditory icons because the latter are derived from recordings of existing real-world, external sounds whereas the former reference the internal game world.

2.4. Immersion through Sound

FPS game sounds may be categorized according to a variety of immersive principles. Following Ermi and Mäyrä's ideas [13], all FPS sounds can contribute to sensory immersion where the sounds of the game world override those in the player environment. It should be noted, though, that the degree of sensory immersion is dependent upon a range of factors beyond the control of the game designers including the relative loudness of the two sets of sound and the audio hardware used; one of the factors influencing the decision of most FPS players to use headphones [14] is likely to be a greater sensory immersion. Many sounds offer challenge-based immersion by requiring a response which includes the use of both mental and kinetic skills. It is typically the case that these sounds are ones which are produced by other players and they usually relate to actions involving weapons. However, whilst most level-specific environment sounds in Urban Terror, for example, generally do not offer challenge-based immersion possibilities, audio beacons require the navigational listening mode and, therefore, mental skills.

Sounds offering imaginative immersion possibilities are those which help the player identify with her character and the game environment and action. In the first case, FPS games offer a range of character sounds, some of which may be classed as proprioceptive sounds (such as the character breathing whose rate may vary according to the exertions of the character) and which, with a high level of immersion, may be seen as aural prostheses similar to the prosthetic limbs seen receding into the screen. Exteroceptive sounds affording imaginative immersion through identification include a range of sounds which aid in contextualizing the player character within the environment.

McMahan categorises computer game elements as perceptual sureties, surprises, or shocks [15]. The latter militate against immersion in the game world by being external stimuli (or errors in the game) that remind the player that this is just a game taking place within the player’s real-world environment. Sureties are mundane cues, expected details providing an experience which is consistent with the rules and conventions of the game world. The creaking of a door as it opens and closes or the footsteps of a player moving around are aural examples of this. Surprises, according to McMahan, consist of three types: attractors (inviting the player to do something); connectors (helping player orientation) and retainers (causing the player to linger in game world locations).

A variety of sounds in FPS games fulfil these requirements. Indeed, any sound inviting an active response may be said to be an attractor. Thus, the sound of gunfire in the distance may tempt the player to investigate and team radio messages detailing enemy actions invite a response on the part of the team player. Many sounds, particularly environment sounds, function as connectors and they are often attended to in the navigational listening mode. Locational and depth properties are important parameters of sounds functioning as connectors. Although, at first playing of the game, the player may derive enjoyment out of certain sounds and so may linger in a particular location in order to hear more, the nature of the FPS game is such that more-or-less continual movement is usually required of the player to seek out or avoid the enemy or to attack the enemy base and so, for the experienced player, no sounds in the FPS game may be said to be retainers.

We propose four terms to describe the spatializing and temporal affordances offered by FPS game sounds. The perception of a variety of spaces is one of the main contributing factors of FPS sounds to the perception of, and immersion within, the game world. In terms of our phrase resonating space, there is a real resonating space, which is the acoustic volume enveloping and morphing around the player, and a virtual resonating space, matching, through a process of synchresis, the illusory visual space depicted on the screen (other virtual spaces may be identified as separate volumes within the game world), the perception of which is created by parameters of sound such as localization, depth cues and reverberation. Such cues may be processed in real time (acoustic shading) with more sophisticated game audio engines or they may be encoded into the audio samples on the distribution medium. Sounds providing this affordance are choraplasts. Sounds may also function as topoplasts where they create the perception of paraspaces such as locations in the game. Additionally, sound may provide the affordance of the perception of time passing or of a particular temporal period in the past, present or future. The former are chronoplasts and, because sound is vectored through time, that is, it takes time to hear a sound, all sounds have a basic chronoplastic function (in addition to any explicit function they may have in this area). The latter are aionoplasts and weapon sounds typically have this function setting the game, for example, in the modern era rather than the mediæval age.

3. The Implications for Game Engine Design

As previously stated, modern FPS games make use of audio samples that may be treated as causal auditory icons (the most common form and typically recordings of real-world events) or symbolic auditory icons (more common in games with a less realist scenario). Whilst the use of causal auditory icons provides quite accurate sonic representations of real-world artifacts in the game world, that use comes at a price that is calculated in memory, both distribution medium storage and game system random access memory (RAM). Assuming a 16 bit, monophonic, 44.1 kHz game audio system, 100 one-second audio samples require a total of almost 9 MB of storage.

This may not seem expensive with the technology available in 2007. However, there are several factors conspiring to push this cost up. Firstly, many game audio samples are longer than one second particularly if they are vocal audio samples. Secondly, and perhaps more importantly, the game designers’ desire to provide the player with an increasingly rich sonic environment requires, as an initial solution, the provision of yet more samples. Unlike sound design for a film, game sound designers work to a non-linear script and “it is not possible to make every gunshot sound unique if you do not know how many gunshot sounds are needed!” [16]. Games such as Urban Terror (based on the Quake III Arena game engine) must therefore strike a balance between a wish to provide an audio sample for every sonic possibility (an infinite number) and paying regard to storage and memory requirements (Urban Terror is typically provided as an Internet download).

Later game engines, such as that used by Half-Life 2 [17], use acoustic shading techniques. This provides a part solution to the audio sample memory problem by real-time processing of audio samples with reverberation that approximates the virtual acoustic properties of the character’s environment. This is also a step towards solving the non-linearity enigma as expressed by Boyd because any one audio sample does multiple duty by having different reverberant characteristics in different locations of the game. However, acoustic shading of audio samples still requires a high use of memory and it provides solely a reverberation solution without taking account of other encodings possible in sound (its emotive aspects or the direct sound source, for example).

Gaver notes that, using what he calls everyday listening, sounds are usually described by one or two of their salient characteristics (object and action); a metallic clang, a wooden thump, a glass-like shattering, for example [18]. Indeed, he provides algorithms to show it is possible to conceptualize and synthesize sound according to these characteristics rather than directly by the use of properties such as frequency and intensity; a top-down model as opposed to a bottom-up model. The resultant caricature sounds should then provide the minimum information required to enable at least an approximate identification of both source object and action.

Populating an FPS game with such caricature, synthesized sounds would seem to militate against realism and the requirements of a rich, immersive player experience. However, there is evidence to suggest that, where sound is concerned, a reduced realism may be all that is required to achieve the desired immersive effect [1921]; a simulation, rather than emulation, that is based upon convention, expectation, and caricature. Certainly, in film, this is how many sound FX work; recordings are made (and enhanced in post-production) of sources and actions that are not necessarily the same as those depicted on the screen. However, by matching the main characteristics of the recorded sound to the expected sound of the screen depiction (such expectation often being the result of cinematic convention) and synchronizing sound and visual action, the audience is persuaded that that screen event really has produced that sound.

The suggestion, then, is that real-time synthesis of sound in digital games may prove to be of benefit in dealing with the twin challenges of memory and non-linear practice without imperiling the immersive experience and, indeed, perhaps enriching it. The ideal scenario might be to have a combination of synthesis and audio samples because (despite advancements in synthesis) some sounds (such as the human voice) are still best represented by audio samples. Synthesis may be of use for the more symbolic auditory icons, fast-repeating sounds (such as gunfire), and background, keynote sounds and may be enhanced through any acoustic shading the game engine offers.

4. Conclusion

All sounds, or the use of some sounds, in the FPS game contribute in some way to player immersion in the acoustic ecology and it is this immersion within (and the player's creative participation in the game's acoustic ecology) that, in large part, affords the perception of immersion in the FPS game world. Thus, the player is physically immersed in the real resonating space and, through kinaesthetic techniques and the ability to trigger a range of sounds through various input methods, is drawn into the virtual resonating space that is then synchretically mapped to the visual game world and activity that are represented either on- or off-screen.

The model in Figure 1 exhibits all the elements of the conceptual framework described thus far. Because it is a model of an acoustic ecology, it importantly shows relationships between the player (the listener) and soundscape. As it is a model of the acoustic ecology of the FPS run and gun game, though, it includes a variety of components and relationships which are unique to digital games (some of which may be unique to the subgenre) such as the game engine, image, a range of spatial and immersive elements and perceptual factors. Furthermore, because this is a model of a multiplayer game acoustic ecology, it also includes the game server and other players and their soundscapes. (For clarity, only one other player and soundscape are shown here.)

Although the conceptual framework and the model are focused on FPS games as their paradigm, it may well be the case that they (or aspects of them) may also be used to analyze the wider area of digital game sound in the future with the caveat that much further research and testing (using different digital game genres) is required. Furthermore, it is suggested that the conceptual framework and model may prove to be of use in the design of the audio component of game engines by supporting the notion that real-time synthesis of sound in the game is a valid means of providing an immersive acoustic ecology.