Abstract

This paper describes the application of a full Bayesian significance test (FBST) to compute evidence intervals in forensic speaker comparison (FSC). In the FBST approach, the challenge is to apply the test to a large number of observations and to formulate an equation to solve the test quickly. The contribution of the present work is that it proposes an application of the FBST to FSC and develops a method to calculate the FBST for the distribution of expected values (mean) with unknown variance without using Monte Carlo Markov chains (MCMC). Comparisons with other interval inference methodologies indicate that the evidence interval size is 49% greater than that computed with the Gosset approach. The evidence interval presented 71% fewer classification errors than the punctual inference did for the signal-to-noise ratio (SNR) of 17 dB.

1. Introduction

The main task in forensic speaker comparison (FSC) is to analyze two or more voice records to infer whether they come from the same speaker. FSC differs from biometric voice recognition in the hypothesis test approach and in the nature of the voice samples. In the FSC scenario, a questioned-voice is compared to a known-voice, whereas in biometric recognition, the comparison is made among multiple speakers [1, 2].

The questioned-voice (or voice evidence) is an audio recording accepted as a vestige or evidence in a criminal investigation. The questioned-voice may be recorded in different situations, such as lawful phone interception (wiretapping), recordings of face-to-face conversation, or audio broadcasting.

In FSC, the hypothesis considers that both the questioned- and known-voices come from different speakers, whereas assumes that the questioned- and known-voices come from the same speaker.

However, the “individualization” that the hypotheses above propose has been considered a fallacy. This individualization assumes that the result of the confrontation between the questioned and standard voice is unique, without a priori probability and without repeating the test for the entire population [3, 4]. According to Saks and Koehler [3], the most reasonable hypotheses would be

Punctual inference in FSC is based on a score of (dis)similarity [57]. Interval inference is a tradeoff between precision and confidence because it sacrifices some precision of the estimate by moving from a point to a range, but results in greater confidence that the statement is correct (inside interval) [8] (pp. 418).

Reports on interval inference in automatic speaker recognition (ASR) began with Bisani and Ney [9], who used bootstrap [10] to compute confidence intervals. Subsequently, Campbell et al. [11] computed confidence intervals using multilayer perceptron (MLP) based on statistical entropy. Later, Koval and Lokhanova [12] used a sigmoid function to approximate the a posteriori probability , where is the voice data and is the null hypothesis, using Platt scaling [13] and estimated credibility intervals. The credibility interval can also be computed by empirical methods (Morrison et al. [14]).

The present work proposes the application of the full Bayesian significance test (FBST) to compute evidence intervals of FSC. This proposal aims to obtain the same confidence of capturing the parameter of interest in FSC and to reduce type I errors, reinforcing the legal aphorism of Absolvere nocentem satius est, quam condemnare innocentem. One of the motivations of this work, among others, is to establish a confidence limit of the automatic speaker comparison techniques, primarily when used as a support to quantify an FSC [15].

Applications of the FBST to FSC were not found during the bibliographic survey in the development of this research. Thus, the main contribution of this work is that it proposes an application of the FBST to FSC and develops a method to calculate the FBST for the distribution of the expected value (mean) with unknown variance without using Monte Carlo Markov chains (MCMC).

The results indicate that the application of the FBST to FSC can improve the evaluation of results by the LR framework, reducing the occurrence of type I errors. The FBST also supports decisions on multispeaker comparisons.

The paper is organized as follows. Section 1 presents the FBST and our proposed improvements and proposes adaptations for FSC (the GMM-UBM method was chosen because it presented more satisfactory results in previous experiments than the i-vector- and x-vector-based methods with deep neural networks (DNN). These experiments were performed with database in Portuguese, quoted in this article, and with voices provided by the Civil Police of Minas Gerais (Brazil) forensic sector. The result of this experiment is in the process of being published). Section 1 compares the evidence interval to other methods. Section 1 presents the conclusion and future research directions.

2. Evidence FSC Interval with the FBST

2.1. Interval Inference in FSC

In classical FSC, the comparison is performed between features of the questioned-voice, , and features of the known-voice, . The features of the universal background model (UBM), , represent the average speaker [5].

The LR can be computed using a GMM-UBM. In this case, the LR equivalent score is computed as follows:

Furthermore, and are, respectively, the evaluation of the data of the GMM of the known-voice, , and of the UBM .

The GMM-UBM is a methodology applied to voice comparison [7, 16, 17]. In the first studies [5, 18], the GMM-UBM methodology was applied using Mel-frequency cepstrum coefficients (MFCC).

The first step in the GMM-UBM procedure is to compute the GMM of the known-voice, , and of the UBM , which can be computed using the expectation-maximization (EM) algorithm [5]. In the second step, the Score of the comparison is obtained as a ratio between two likelihoods: the questioned-voice versus the known-voice and the questioned-voice versus the UBM model .

The score proposed by Reynolds et al. [5] is the sample mean of the log-likelihood ratio (LLR) over T speech frames:

Because the features are not independent and not identically distributed (i.i.d.), the resulting values are not, technically, a likelihood ratio. Normalization by the number of frames, T, also removes the duration effects from the log-likelihood value. However, the of equation (3) allows us to include an interval-based inference.

Calculating the interval inference is possible empirically or analytically over the sample space. The widespread empirical approaches include bootstrap [10], jackknife [19], and the method proposed by Morrison et al. [14]. One possible analytical method uses the t-Student distribution of Gosset [20, 21]:where is the sample standard deviation, µ is the expected value of is a t-Student distribution with significance α and degrees of freedom.

In Section 1, we compare our evidence interval computed using the FBST to Morrison’s credibility/confidence intervals, the analytical method in equation (4).

Morrison’s approach [14, 22] uses two samples of voice per speaker and measures the LLR from the vowel formants. In these works, the credibility intervals were computed from raw data rather than from a statistic such as the mean. We propose a small modification to Morrison’s approach such that the computation is based on the sample mean instead of the raw data.

2.2. Full Bayesian Significance Test

The FBST can be used to compute evidence against a precise hypothesis , where η is a value in the parametric space of LLR of equation (2).

The FBST [23, 24] is a coherent Bayesian significance test for sharp hypotheses. The test is based on an evidence concept value, whose original definition was motivated by practical, juridical, and epistemological requirements. Consider the parametric space and a subset and a precise (null) hypothesis that the parameter lies in the null set, defined by the inequality and equality constraints given by the vector functions and h in the parameter space:

For the experimental data , the a posteriori density of a precise hypothesis is proportional to the product of the likelihood and the a priori density [25]:where is an a priori density and is the likelihood. The points of the parameter space with highest “surprise” in the null set arewhile the highest relative surprise set (HRSS), , is

The Bayesian evidence value against is the a posteriori probability of the “tangent” set; that is,where is the probability that the parameter θ is inside . The e-value associated with the FBST is

The e-value is a probability in the parameter space (μ and ρ), whereas the value is a probability in the sample space [26]. In Section 1, we use the e-value and (Bayesian evidence value against ) to compute the evidence interval on FSC using hypothesis .

2.2.1. Improvement of the FBST over the Mean with an Unknown Variance

This section describes a method to compute the FBST for a distribution of the mean (expected value) with an unknown variance. To lower the computational cost, we focus on a mostly analytical development. This is important in order to limit the computation time of the e-value over the η-space.

Consider a normally distributed sample with n i.i.d. observations, , where µ is the expected value and is the precision. The minimal sufficient statistic could be the sample mean and total sum of squares . The likelihood function for and [26] is

Taking the a priori noninformative distribution [27], the a posteriori probability density function (PDF) is [26]whereand c is calculated such that the integral over equation (12) is 1. The gradient is given by the partial derivatives of (henceforth, the we write the PDF as ) lead to the maximum :

Figure 1 shows an example of the FBST evaluation over . The bell-shaped surface is and the solid black line is the restriction of the null hypothesis . The maximum value of the black line delimits the “tangent” set, represented as a dash-dot line. The dotted line is the restriction .

The evidence against the null hypothesis is evaluated by equation (9). Main works on the FBST over the distribution of a mean with an unknown variance [26, 28, 29] use MCMC to solve the integral of in equation (9). However, specifically for equation (12), it shows that the “tangent” set has extreme points , , e and (as in Figure 2), where

Making for equation (12) results inand grouping variables and taking the natural logarithm in both sides yieldswith the roots beingwhere is the Lambert-W function [30]. By the symmetry of over the µ- axis, we can compute the evidence bywhere is the contour function (from any boundary) on the µ-axis of the “tangent” set (Figure 2). The contour of can be defined aswhere . The roots of equation (20) in µ define the left and right sides of the contour (see Figure 2):

Note that is a contour for values greater and less than . By symmetry, we compute equation (19) aswhere is the error function. We can simplify the argument of this function aswhere is the inferior limit of ρ, and η is the hypothesis test . Thus, we can rewrite equation (19) as the one-dimensional integral:

The integral in equation (24) does not need MCMC techniques, thus demanding less computational effort than equation (9) does.

2.3. Proposed Method

This section proposes a method to compute the evidence interval with a Bayesian evidence level α, which can be computed using equation (24). The result in the GMM-UBM scenario is the sample mean of the time series , as equation (3) shows, on the parametric space η.

Consider the time series with a parametric mean (expected value) of μ, precision ρ, and sample mean . From this, it is possible to define the evidence interval of µ as the subspace , where and are values above and below , respectively. The Bayesian evidence against the precise hypotheses and is (see equation (24)).

Outside this range of the LLR, , the evidence (e-value computed by the FBST) that the parametric mean (μ) is higher than or lower than is less than α.

We are aware that the definition above does not fit the traditional confidence (or credibility) interval as defined in [31]. However, it is an analytical method based on the parameter space and represents the limits of evidence that the sample can provide Bayesian evidence (“significance”) of .

For example, consider that the comparison between a questioned-voice and a known-voice generates a time series , where the values of frames in equation (2) are used. Figure 3 shows the statistical distribution of these LLR values on the normalized histogram (Norm. Hist.) in the left panel. In this panel, the solid light gray line is the empirical PDF (emp. PDF) and the small circle over this curve indicates the sample mean . The dash-dotted rectangle on the left graph is the region on the right graph. The sample mean of the series is Np (nepers) (neper is the natural logarithm of ratios, named after John Napier).

The evaluation of the hypothesis along the variable η in the LLR space with the FBST (equation (24) yields the e-value curve. The variation of η values results in the e-value curve (ev-curve, solid dark gray) indicated in the right graph of Figure 3. This curve is computed by sampling the η space and solving equation (24) for each sample. On this graph, the horizontal dash-dotted line (ev = 0.05) indicates the Bayesian evidence (significance) (evidence value against hypothesis or e-value = 0.05). The horizontal solid black error bar indicates the evidence interval and the sample mean.

3. Comparison with Other Methods

This section presents an experiment and a case study involving the range of evidence. We conducted training and testing stage with a voice data set CEFALA-1 [32], containing 104 speakers (55 men and 49 women) recorded with five microphones (generating 520 records). The validation step used 50 recordings that do not belong to the corpus CEFALA-1. This validation emulates an open-set database in speaker comparison.

We designed an experiment to compare the proposed interval inference method with other methods used in FSC. The experiment used 104 voices narrowband filtered (4th order butterworth) in the 300–3500 Hz range and resampled to 8 kHz, compatible with the Brazilian mobile phone system.

In order to compare the various interval inference methods, we need to use the speech database to define the known-voice and questioned-voice sets. We do this as follows. For each subject 50% of voice content was used as known-voice and 50% as questioned-voice, both in the CEFALA-1 corpus and in the validation recordings.

In order to emulate forensic conditions, both the known-voice and questioned-voice data are subject to 3 types of degradation. First, the data are contaminated with pink noise at the following SNR levels: 25 dB, 23 dB, 20 dB, 17 dB, 15 dB, and 12 dB. Next, the data are encoded and then decoded by a GSM 06.60 codec [33]. Finally, the data are run through a narrowband filter (300–3500 Hz).

The features were extracted with MFCC using 13 critical bands (filters), a frame length of , and frame step of . The features include delta and delta-delta . We used Sonh’s [34] method for voice activity detection (VAD) to identify the voiced frames.

The methods used to compute interval inference (significance ) wereGosset: confidence interval computed by equation (4)Morrison: empirical credibility interval computed by combining the k-nearest neighborhood (KNN) with the linear regression, as described by Morrison [14]FBST: the proposed method that computes the evidence as a subspace of the parametric space, where the e-value is α

Figure 4 presents examples of the interval inference. In the figure, we show the LLR values along the horizontal axis. The inference intervals are shown as horizontal lines, with the circles indicating the mean values and dot-dashed vertical line indicating the decision threshold.

The horizontal light gray line indicates a same-speaker comparison, and dark gray indicates a different-speaker comparison. The scenarios are (a) correct comparison, (b) an intermediate region where the comparison threshold is within the inference interval, and (c) comparison error (Type I or Type II).

We used the method proposed by Morrison et al. [14] to compute the credibility interval over the data themselves, not over the mean (expected value) of the data. Morrison’s method was adapted to compute the mean of 50 subsamples with replacement (similar to bootstrap [10]).

We evaluated the performance of each interval inference method based on results presented in Figure 4. We expected that a comparison between the GMM model of a given speaker and a set of features coming from that speaker (same-speaker comparison hereafter) results in a higher LLR value than a comparison between that same GMM model and a set of features coming from a different speaker (different-speaker comparison hereafter). The training and testing stage, using only samples from the CEFALA-1 corpus with contaminations between 12 and 25 dB, presented an equal error rate (EER) of 8.1% with threshold at LLR = 0.25 Np. The results presented below cover the test and validation steps.

Figure 5 shows the number of correct classifications in scenario (a). The occurrences of correct classifications for the evidence interval (vertical light gray bar) is smaller than that of other methods (interval and punctual). The comparisons of the best interval methods yield values of 84.0% against 84.4% for SNR 12 dB, 84.4% against 88.6% for 15 dB, and less than 1% for the other SNR values. These values represent a loss of the accuracy of less than 0.5% compared to interval inference. Compared to the punctual inference, the loss in the accuracy is less than 1.6% for the other SNR values.

The intermediate results, in which the intervals overlap, are exemplified in Figure 4 by comparisons (b). These scenarios are deemed inconclusive and represent an In dubio pro reo condition, meaning that a defendant should not be convicted when doubts remain about his or her guilt (association between questioned- and known-voices).

In the punctual inference, scenario (b) does not occur, and there is no transition region. Thus, in the interval inference, scenarios (a) and (c) are decisive, and the intermediate scenario, (b), indicates that the results have some equivalence; that is, there is a chance that the comparison between different speakers will be larger (or smaller) than the comparison between the same speakers.

Figure 6 shows the comparison results for various interval inference methods (Gosset, Morrison’s method, and FBST). The results are grouped by the SNR level. The panel indicates the percentage of inconclusive interval inferences (b), wrong interval inferences (c), and punctual error inferences (dashed vertical line).

Compared to the punctual inference (dashed vertical line), the evidence interval computed by the FBST (horizontal light gray bar) reduces the number of wrong inferences in 1.6%, 1.1%, 0.9%, 0.7%, 0.6%, and 0.4%, respectively, for SNRs from 12 dB to 25 dB (see Figure 6). Compared to the other methods of the interval inference, the evidence interval (horizontal light gray bar) presents an incorrect number of inferences (c) less than or equal to the other methods (horizontal bars).

These results can be explained by checking the size of the intervals for each method in Figure 7. In this figure, points represent the raw data (jittered horizontally), the horizontal line shows the sample mean, and the lateral lines represent a smoothed density. Table 1 summarizes the values contained in Figures 5, 6, and 7.

On an average, the length of the evidence interval (computed by the FBST) is 24% larger than the interval calculated by the Gosset method and 15% larger than the interval calculated by Morrison’s method (see Table 1). They also present a higher dispersion than the other methods do.

Another attempt to measure the influence of interval inference is to exclude from the confusion matrix the comparisons that result in scenario (b) of Figure 4. In this way, a fifth category, “In dubio pro reo,” may be included. The Table 2 presents a comparison of how the inclusion of the interval inference, when including “In dubio pro reo,” changes the percentage of true positives, true negatives, false positives, and false negatives. The table shows the EER calibration of 8.1%. However, for open-set validation, the GMM-UBM methodology presents false positive rates of 9.4%, which reduces to 8.4% using the range of evidence calculated from the FBST.

4. Conclusion and Future Work

This paper presented an improvement to the FBST calculation for the distribution of a mean with an unknown variance. These improvements obviate the need for MCMC techniques to calculate the FBST integral. Compared with other methods, the evidence interval was more conservative, reducing incrementally Type I and Type II errors in low-SNR scenarios.

Although the results do not present a significant improvement in the reduction of the false positive rate, for open sets, the present work helps to understand the limits of the GMM-UBM methodology applied to FSC. The contribution of the range of evidence may seem insignificant. However, in the case of sex crimes, especially against children, understanding the limits of each tool in the FSC helps the forensic expert to make more informed decisions.

Possible developments of the present work include improving the FBST for the Behrens–Fisher problem, combining the evidence interval with background database calibration and tests with different features such as Power Normalized Cepstral Coefficients (PNCC), Perceptual Linear Predictive (PLP), and noise. The application of the interval inference in speaker verification techniques, such as i-vector and x-vector, are under development and should be discussed in future work.

Data Availability

The audio files (corpus) used in the experiments can be found at http://www.cefala.org. It is the intention of the authors to make available the processed data and the algorithms as soon as the work is published. Basically the data are acoustic features (Mel-frequency cepstrum) and Gaussian mixture models.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to thank N. Phillips for always sharing codes. The authors thank the SPAV team of the Institute of Criminalistics of Minas Gerais and all colleagues and teachers of CEFALA for their practical contributions. This work was carried out with the financial support of the Centro Universitário Newton Paiva.