Abstract

This paper discusses the results of the pilot experimental research dedicated to speech recognition and perception of the semantic content of the utterances in noisy environment. The experiment included perceptual-auditory analysis of words and phrases in Russian and German (in comparison) in the same noisy environment: various (pink and white) types of noise with various levels of signal-to-noise ratio. The statistical analysis showed that intelligibility and perception of the speech in noisy environment are influenced not only by noise type and its signal-to-noise ratio, but also by some linguistic and extralinguistic factors, such as the existing redundancy of a particular language at various levels of linguistic structure, changes in the acoustic characteristics of the speaker while switching from one language to another one, the level of speaker and listener’s proficiency in a specific language, and acoustic characteristics of the speaker’s voice.

1. Introduction

Speech intelligibility and speech recognition are important and trending topics of research in various fields of science: Linguistics, Medicine, Electrical Engineering, and Information Technology. Speech recognition process is investigated from different sides, as only an integrated approach could lead to a better understanding of this process. One of research directions is study of biological and neurologic mechanisms of speech perception [15]. Another research area is intelligibility of synthesized speech in noise [6, 7]. Much attention is paid to development of algorithms improving speech intelligibility in noise [810].

Study of listener’s specifics on the process of speech recognition showed that music training affects positively speech-in-noise perception [11, 12]. Another important and rather controversial topic is perception of accented speech in noise by native and nonnative listeners; thus the research [13] revealed that native listeners were able to percept the test material at the same level regardless of the accent of the speaker while previous studies [14, 15] showed that speech of native speakers was generally found by native listeners more intelligible than speech of nonnative ones; nonnative listeners showed a trend of better perception of speech produced by speakers from the same language environment as themselves, that is, those having familiar accent [16, 17].

In Russia studies of speech characteristics and speech intelligibility and recognition in noise environment started in the middle of 20th century ([1820], etc.). Experiments [19, 20] on word intelligibility for Russian speech against white and pink noise with levels of signal-to-noise ratio of 0 dB and lower showed different results: according to [19] the results for these two types of noise were very similar, while according to [20] word intelligibility for speech against white noise was higher. At the end of 20th century a major study of speech perception was carried out in white noise with various levels of signal-to-noise ratio [21], which investigated influence of different factors on Russian speech recognition: natural versus synthesized speech, different parts of speech, number of syllables in the word, place of stress, different types of phonemes, and so forth, which resulted in the range of the factors, which actually have influence on speech perception and recognition in noise environment and the level of influence of each of these factors. Another study [22] investigated changes of speaker characteristics (such as voice pitch, tempo of speech, voice strength) in the conditions of switching from a native (Russian) language to a foreign one (English). The latest researches were focused on cognitive mechanisms of semantic content decoding of Russian speech in noise [23, 24], which demonstrated that dialogs have better intelligibility than monologs and reading, and such factors as background knowledge of listeners about the topic of the conversation and their general interest, as well as emotional level of the speaker, also influenced the process of speech perception.

The current research studies perception of native (Russian) and nonnative (German) speech in noisy environment (pink and white types of noise were chosen for the experiment) focuses on the following aims:(1)To identify the effect of the tested types of noise with various levels of signal-to-noise ratio (in comparison) on speech perception;(2)To identify effects of linguistic and extralinguistic factors on speech perception in noisy environment.

2. Method and Experiment

Our pilot research included perceptual-auditory analysis at various levels of linguistic structure of speech utterances in Russian and German (in comparison) in the same noisy environment: various (pink and white) types of noise with various levels of signal-to-noise ratio as well as effects of linguistic and extralinguistic factors on speech perception in noisy environment.

The research material of the study was a specially composed (according to the method of Potapova [25, 26]) ad hoc corpus of words and phrases in Russian and German in realizations of Russian and German native speakers, which were mixed with various (pink and white) types of noise with various levels of signal-to-noise ratio (0 dB, −3 dB, −6 dB, −9 dB, and −12 dB). The material allowed analyzing of protection and intelligibility degrees at acoustic, phonetic, syntactic, and lexical levels of linguistic structure.

The following requirements for development of the ad hoc research material were stated:(1)Test phrases should be grammatically and semantically linked and consist of words which exist in both languages;(2)Various types of consonants and vowels should be represented in the test phrases;(3)Acoustic realization of the chosen types of consonants and vowels should be similar and comparable in both languages;(4)Comparable (by place and manner of articulation) consonants and vowels should be in identical positions in a syllable (for all test words and phrases in Russian and German);(5)The rhythmic scheme of test words and phrases should be identical for both languages;(6)Combinations with various types of vowels in stressed position in the first syllable (with regard to unilateral distribution) should be tested for each type of consonants.

According to the requirements of the research material the following test phrases were formulated (see Table 1: analyzed syllables are bold in the table; hereafter the system of IPA was used for transcriptions).

Each phrase consists of 3 words, having in the first stressed syllable a combination of the tested type of consonant with one of the tested types of vowels. Since pronunciation norms of the German language require voicing the voiceless fricative consonant “s” preceding a vowel, speakers were instructed to pronounce this sound as a voiceless one in the word “Sascha.”

Speakers, who took part in the study, were native speakers of the literary Russian language without prominent dialectal features of pronunciation, speaking also German (50%) and being native speakers of the literary German language without prominent dialectal features of pronunciation, speaking also Russian (50%). The level of knowledge of foreign language of all speakers was the same, B2-C1, which was tested according to the system developed by the Council of Europe [27]. 50% of speakers were women (native Russian speakers and native German speakers) and 50% were men (native Russian speakers and native German speakers).

Each speaker read aloud test phrases and isolated words from phrases three times each. Thus, the total number of obtained realizations of the test words and phrases totaled 480; the total number of realizations for a single speaker was 120 (60 in Russian and 60 in German).

All test words and phrases were combined into 2 tables (in Russian and in German). The order of words and phrases was random and differed for different speakers. All test words and phrases were read with the intonation of the completed narrative, followed by a pause.

Audio recording of the research material was conducted in a specially equipped room, preventing foreign interference and noise: an anechoic chamber of the Institute of Applied and Mathematical Linguistics of Moscow State Linguistic University.

Two samples of noise (white and pink) were generated using the program Cool Edit Pro 2.0. for mixing them with audio records of the test words and phrases realizations.

Each speech segment was mixed with white and pink noise with various levels of signal-to-noise ratio: 0 dB, −3 dB, −6 dB, −9 dB, and −12 dB. Thus, for each spoken realization of the test material 10 variants of mixed signals were obtained, 4800 samples in total, plus 150 phonograms containing only noise (75 with white noise and 75 with pink noise). The total number of phonograms for the experiment was 4950.

The number of listeners was 21: 6 males and 15 females, 19–21 y.o., native speakers of the Russian language without prominent dialectal features of pronunciation with normal hearing, who have a command of English at level B2-C1 (which was tested on the system developed by the Council of Europe [27]). Some of them (12 listeners) are proficient in German at level B1-B2 and some (9 listeners) do not know German at all.

All phonograms were numbered randomly for presentation to each listener: total number of rotation variants was 11.

Listeners have to listen to phonograms according to their sequence numbers in the proposed rotation variant and to write down the answers for each of them in the table (see example of the answer table in the Table 2).

They could listen to each phonogram as many times they wanted. Each half an hour there were short breaks. Work time per day did not exceed 4 hours. The perception test run during 2 days: total work time for each listener was 8 hours.

The total number of played phonograms was 23085. The total number of played phonograms, which contained only noise, was 727.

The total number of played phonograms, containing speech signal (test words and phrases mixed with noise), was 22358 (the share of phonograms with white and pink noise types made up 50% each).

These calculations indicate that the size of the base of played phonograms is sufficient to ensure reliable and stable quality of the data.

A summary table with quantitative description of the experiment is presented in Table 3.

We calculated statistical sampling error for the findings to prove the observed tendencies statistically. The statistical sampling error was calculated using the following formula [28]:where is the statistical sampling error, is a -value with a given confidence probability: in our case or a confidence level of 95%, is dispersion of an alternative characteristic (dispersion of the sample share): in our case, as the sample share is unknown, the maximum value = 0,25 is taken [29], and is the sample size.

3. Discussion

The working hypothesis of the study was as follows: speech recognition (detection of speech in noise and identification of the utterance language) and perception of the semantic content of the utterance in the variable noisy environment are influenced by the type of noise and the signal-to-noise ratio, as well as by some of linguistic and extralinguistic factors. These factors are the existing redundancy of a particular language at various levels of linguistic structure, changes in the acoustic characteristics of the speaker while switching from one language to another one, the speaker and listener’s level of proficiency in a specific language, and acoustic characteristics of the speaker’s voice.

The experiment showed that within the corpus of the research material pink noise provides better protection of the utterance than white noise at equal integral level of signal-to-noise ratio (for all tested levels) in terms of the following indicators: detection of speech signal in noise (see Figure 1), correct identification of the utterance language (see Figure 2), and correct perception of the semantic content of the utterance. On Figures 1 and 2 statistically significantly higher (at the level of 95%) values in relation to the previous higher level of signal-to-noise ratio are marked with a frame.

The lower the level of signal-to-noise ratio (the higher the level of noise over the level of the desired signal), the higher the difference of efficiency degree between pink and white types of noise, reaching its maximum at the lowest tested signal-to-noise ratio (−12 dB): while assessing detection of speech signal in noise, the efficiency of white noise masking is ~4.7 times lower as compared to pink noise. This result corresponds to findings observed in [20].

Detection of the utterance in noise also depends on level of speaker and listener’s proficiency in a specific language, as well as on utterance language. Thus, for listeners, who are native Russian speakers, a higher score of detection of utterance in noise was shown for utterances in German, pronounced by native German speakers, than for utterances in Russian, pronounced by native Russian speakers (see Figure 3: a statistically significantly higher (at the level of 95%) value of utterance miss (false negative error) for German (depending on the characteristics of the speaker) is marked with a frame), which demonstrates the effect of the language (and its acoustic characteristics) of the utterance on detection and recognition of the utterance in noise.

Listeners, who are native speakers of the language of the utterance, are able to detect native speech in noisy environment regardless the speaker’s level of proficiency in this language of (see Figure 3); however a foreign accent of the speaker significantly reduces the score of recognition of the utterance language (see Figure 4: a statistically significantly higher (at the level of 95%) value of recognition (depending on the characteristics of the speaker) of the word Полe is marked with a frame), and recognition of the semantic content of the utterance, which confirms findings of experiments [14, 15].

Besides, for a number of phonograms with low scores of recognition, the list of these substituting words also differed for speech of native and nonnative speakers: Tables 4 and 5 show most frequent substitutes for the word била ([bilə], Eng.: beat) for phonograms of native and nonnative speakers.

From Tables 4 and 5 we can also see that quite a very wide range of substituting words with low answer shares (1%–3%) was observed for these phonograms: 26 substitutes for phonograms of Russian native speakers and 28 substitutes for phonograms German native speakers and these lists were very different. These varieties of words with low answer shares are visible on the tag clouds for these two phonograms (e.g., on Figures 5 and 6). The tag cloud or word cloud technique was used for analysis of the substituting word frequencies, in which the font size for a particular word was used depending on frequency of its occurrence: the higher the frequency, the greater the font size for a specific word. All the “word clouds” for this experiment were drawn using the online program Wordle™ [30]. Thus on Figures 5 and 6 we can see lots of small words which are different for both phonograms.

At the acoustic level the degree of the utterance protection also depends on the fundamental frequency of the speaker’s voice: within the corpus of the research words and phrases in Russian and German utterances in realizations of males are more concealed from detection against noise (i.e., they demonstrated higher score of utterance miss (false negative error) against noise) than utterances in realizations of females against both tested types of noise with all tested levels of signal-to-noise ratio (see Figures 7 and 8).

At the phonetic level various sounds have various degrees of intelligibility depending on their acoustic nature. Thus, the most resistant (among those tested within the research material of the experiment) to recognition in the stressed syllable are consonants [s] and [m], as well as vowel [a], while the most masked are consonants [b] and [p] and vowel [i] (see Figures 912).

At the syntactic level correct recognition of words depends on the context: the scores of correct recognition of words, which functioned as a part of a phrase, were higher than the scores for the same words in an isolated position against both tested (pink and white) types of noise at all tested levels of signal-to-noise ratios (see Figures 13 and 14: the data in Figure 13 is sorted in descending order and formatted in MS Excel from the dark colour to the light one).

In the phrases the best recognized parts were subjects (which always came first in the phonogram corpus), with predicates showing the second highest score (see Figures 15 and 16).

At the lexical level recognition of utterances in Russian is influenced by the occurrence of words in the language: within the research corpus of Russian words the highest score of recognition was shown by words мама, Саша, and папа, which have the highest frequency of occurrence in the Russian language among all the tested words, while the lowest scores of recognition were shown by words Полe, Зинин, Борю, and Милу, which are proper names and less common than Саша (see Table 6 index; ipm represents number of occurrences of lemma per one million words in the corpus; here the score is presented according to [31]).

4. Conclusion

According to the results of the experiment the following factors influence intelligibility and perception of the speech in noisy environment:(1)Type of noise and signal-to-noise ratio (pink noise provides better protection of the utterance than white noise at equal integral level of signal-to-noise ratio (for all tested levels) in terms of the following indicators: detection of speech signal in noise, correct identification of the utterance language, and correct perception of the semantic content of the utterance);(2)Utterance language and speaker and listener’s proficiency in a specific language;(3)Fundamental frequency of the speaker’s voice (within the corpus of the research material in Russian and German utterances read by males was detected by listeners statistically rarely than utterances read by females against both tested types of noise with all levels of signal-to-noise ratio);(4)Context: isolated word or as a part of the phrase (within the corpus of the research material in Russian intelligibility of words within the phrase was better against both tested types of noise with all levels of signal-to-noise ratio compared to isolated words);(5)Frequency of word occurrence in the language (according to the results of the experiment, words with higher frequency of occurrence in the Russian language showed better intelligibility);(6)Phonetic composition of the word (within the corpus of the research material in Russian the voiceless sibilant fricative alveolar [s] and sonorant bilabial [m] among consonants and central open [a] among vowels showed the best intelligibility (i.e., the worst ability of masking using noise) within the tested set of sounds, while among consonants stop bilabial ones: voiced [b] and voiceless [p] and front close [i] among vowels showed the worst intelligibility, that is, the best ability of masking using noise).

Among the further possible directions of analysis the following can be mentioned:(1)Increase of volume of bilingual research material;(2)Expansion of the inventory of acoustic parameters for the analysis of the language sounds recognition;(3)Increase of the number of speakers and listeners taking into account such factors as age, gender, degree of experience in listening, and proficiency in the utterance language, in relation to the studied languages;(4)Study of the influence of linguistic and extralinguistic factors on the recognition in noisy environment for long connected texts;(5)Organization of the database, including units of the sound composition and intonation system of various languages.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The research was supported by Ministry of Education and Science of Russian Federation (Project no. 34.1254.2014K, head of the project R. K. Potapova).