#### Abstract

For speaker tracking, integrating multimodal information from audio and video provides an effective and promising solution. The current challenges are focused on the construction of a stable observation model. To this end, we propose a 3D audio-visual speaker tracker assisted by deep metric learning on the two-layer particle filter framework. Firstly, the audio-guided motion model is applied to generate candidate samples in the hierarchical structure consisting of an audio layer and a visual layer. Then, a stable observation model is proposed with a designed Siamese network, which provides the similarity-based likelihood to calculate particle weights. The speaker position is estimated using an optimal particle set, which integrates the decisions from audio particles and visual particles. Finally, the long short-term mechanism-based template update strategy is adopted to prevent drift during tracking. Experimental results demonstrate that the proposed method outperforms the single-modal trackers and comparison methods. Efficient and robust tracking is achieved both in 3D space and on image plane.

#### 1. Introduction

Audio-visual speaker tracking is a key technology of human-machine interaction, driven by applications such as intelligent surveillance, smart space, and multimedia systems. By analyzing the audio-visual data captured by multimodal sensor arrays, the positions of the speakers in the scene are continuously tracked, providing the underlying basis for subsequent action recognition and interaction. Compared with the conventional single-modal tracking, complementary information from audio and video streams is utilised to improve the tracking accuracy and robustness [1].

Current methods for speaker tracking are built on the probabilistic generation models due to their ability to process multimodal information. As the representative state-space approach based on Bayesian framework, Kalman filter (KF) [2], extended KF (EKF), and particle filter (PF) [3] are commonly used methods. Among them, PF can recursively approximate the filtering distribution of tracking targets by using dynamic models and random sampling. However, traditional PF assumes that the number of targets is a priori, which is not suitable for natural scene containing multiple speakers with random motion. Probability hypothesis density (PHD) filter [4] is introduced to solve the problem, which is another random method based on finite set statistics (FISST) theory. Different from the above Bayesian methods, the speaker number is estimated during the PHD-based tracking process, and therefore, the PHD filter is considered promising for multispeaker tracking. However, the PHD filter restricts the propagation of the multitarget posterior distribution to the first-order moment, resulting in the loss of high-order cardinal information, which leads to speaker number estimation errors in low signal-to-noise ratio situation [5]. PF is selected as the tracking framework in this paper since it easily approaches the Bayesian optimal estimates without being constrained by linear systems and Gaussian assumptions [6].

Many works try to improve the architecture of PF to integrate data streams from different modalities into a unified tracking framework. The direction of arrival (DOA) derived from the audio source is used to reshape the typical Gaussian noise distribution of the particles in the propagation step of PF, and the weights of particles are recalculated according to their distance to the DOA results [7]. Tracking efficiency and accuracy usually depend on the number of particles and noise variance used in the state model and the propagation equation. Moreover, as an enhanced version of the PF, an adaptive algorithm is proposed to dynamically adjust the number of particles and noise variance by using audio-visual information [5]. The audio information obtained from the generalized cross correlation (GCC) algorithm and the video information extracted by the continuous adaptive mean shift (CAMshift) method are combined using a particle swarm optimization- (PSO-) based fusion technology [8]. The PSO algorithm can also be utilised to optimize the particle sampling in PF and improve the particle convergence to the active speaker region by incorporating an interaction mechanism [9].

To analyze and infer the dynamic system applied for speaker tracking, Bayesian theory provides an effective framework, which includes a state model and an observation model. Among them, the state model is used to describe the evolution of the state with time, and the observation model associates the observed information with the state of the speaker [10]. The prevailing fusion strategies in PF-based framework are performed by modifying the observation model to fuse the observations collected from multisensor devices [5]. Specifically, audio and visual likelihoods are constructed separately in the observation model to update the particle weights. A joint observation model is proposed in [11], which fuses audio, shape, and structure observations derived from audio and video in a multiplicative likelihood. The visual observation model [12] is derived by a face detector and reverts to a color-based generative model during misdetection. Furthermore, the visual observation is used to calculate the video-assisted global coherence field- (GCF-) based audio likelihood by limiting the acoustic map to the horizontal plane determined by the predicted height of the speaker [13]. Probabilities of the visual and acoustic observations are combined using an adaptive weighting factor, which is adjusted dynamically according to an acoustic confidence measure based on a generalized cross correlation with phase transform (GCC-PHAT) approach [14]. PF is combined to a pretrained convolutional neural network (CNN), which provides a generic target representation. The conventional color histogram-based appearance models cannot deal with sudden changes effectively, while a more stable observation model is presented by fusing deep features and manual features [15].

The purpose of this work is to adopt deep metric learning to optimize the observation model in the PF tracker. The method can effectively describe the similarity between samples by learning a distance metric. By designing the network structure and constructing a distance-based cost function, the similarity between the particle diffusion area and the matching template can be obtained. The likelihood function based on the network output can better define the particle weights and reflect the confidence of observations from different modalities. This work is based on a two-layer PF framework, which achieves audio-visual fusion through a hierarchical structure including an audio layer and a visual layer [16]. In the propagation step, two groups of particles from the audio and video streams are diffused through the audio-guided motion model. In the update step, the similarity between the particle diffusion area and the template is obtained through a pretrained Siamese network to calculate particle weight. In the estimation step, an optimal particle set is constructed to determine the speaker position. Finally, the target template is updated by a long short-term mechanism.

The main work of the proposed 3D audio-visual speaker tracker is discussed in following sections: Section 2 describes the methodology of the tracker in detail, including the motion model, the observation model, the ensemble method, and the template update method; Section 3 presents the experimental results and detailed analysis; finally, Section 4 concludes this work.

#### 2. Methodology

For the state-space model in the tracking task, the recursive filtering approach is commonly used to realize the dynamic system estimation based on the Bayesian theory. In PF, the state of the speaker is estimated according to the posterior probability distribution , which is approximated by a set of random particles with associated weights, where is the state vector at time and indicates the current observations from the audio signals and the video frames. Assuming that the target state transition is a first-order Markov process, the required posterior probability distribution is formulated in terms of likelihood function and state transition model :

Sampling importance resampling (SIR) algorithm is applied to implement the recursive Bayesian filter by Monte Carlo (MC) simulations [10]. The importance density is chosen to be the prior density . As the particles are drawn from the proposal importance density, , the weights are given by .

Figure 1 depicts the framework of the proposed audio-visual speaker tracker. Audio and visual particles, and , are propagated, respectively, in audio layer and visual layer driven by the audio-guided motion model (Section 2.1). Through the pretrained Siamese network, the likelihoods are obtained to weigh the particles (Section 2.2), and the speaker position is estimated using a set of optimal particles (Section 2.3). Finally, a long short-term mechanism is applied to update templates of the target (Section 2.4).

##### 2.1. Audio-Guided Motion Model

In audio processing, the DOAs of input audio signals are adopted as position observations to assist particle diffusion. To obtain the DOA observations, the two-step sam-sparse-mean (SSM) approach is employed for sound source localization [17]. The first step is to perform sector-based detection and localization. The space around the microphone array is dispersed into multiple sectors with corresponding SSM activity value. The existence of the activity source in the sector is determined by comparing the activity value with the adaptive threshold. The second step is to conduct a point-based search in each active sector. The parametric approach is utilised for localization, which uses the cost function SRP-PHAT to optimize the spatial position parameters.

The proposed audio-visual speaker tracker is equipped with a set of audio particles and a set of visual particles , where and are support points with associated weights and and is the number of particles. The audio particle that propagated in 3D world is modeled in spherical coordinates, which is represented as the state vector:where (, , ) indicates azimuth, elevation, and radius of the audio particle at time and (, , ) indicates corresponding velocities. Unlike the above definition, the visual particle is propagated on the image plane and modeled in rectangular coordinates. The state vector of visual particle is defined as follows:where and are horizontal and vertical coordinates of the image frame and (,) represents velocities in the corresponding direction.

In the propagation step, due to the inaccuracy of elevation and radius estimation, the relatively accurate azimuth estimates are used to optimize the motion model based on the Langevin process [18]. In the azimuth direction of audio particles, the Langevin motion model is expressed aswhere and are designed parameters, is the zero-mean Gaussian-distributed noise, and is the time interval between two consecutive frames. In addition, the position of particles is further modified according to the DOA estimation results:where is a correction factor and is the DOA (azimuth) estimated by the microphone measurements. In the elevation and radius direction, the particles are only diffused through the Langevin model without additional adjustment.

For the motion model of visual particles, additional operations related to coordinate transformation are required. A pinhole camera model is employed to project the point located in 3D world coordinates to the image plane:where and are elevation and radius from the tracking result of the previous frame and and are normalization coefficient and projection matrix. is the projected point on the image plane, and its direction relative to the microphone array center can be calculated:where indicates the coordinates of the array center. The visual particle is firstly propagated through the Langevin model, whose coordinate is denoted as . The audio-guided motion model and coordinate transformation for visual particles are expressed aswhere is the modified particle coordinate. The DOA information is projected onto the image plane through the pinhole camera model. In this way, the particles converge to the sound source direction by moving toward the projection point.

##### 2.2. Deep Metric Learning-Assisted Observation Model

The observation model is constructed to measure the candidate samples determined by the particles. Deep metric learning method using Siamese network provides a solution for similarity metric tasks, which convert the tracking problem into a similarity problem in the feature space of the known target and search area [19]. The framework of the adopted Siamese network is shown in Figure 2. It is equipped with two subnetworks with the same network structure and identical shared parameters. Each branch consists of three convolutional layers with 15 kernels, followed by a 2 × 2 MaxPooling layer. The filter size is 5 × 5, 3 × 3, 3 × 3. The input of the network is an image pair with a label , where *Y* indicates whether the image pair represents the same speaker. Each image is fed to a branch of CNN. Through the network, the feature mapping function, with parameter , is trained to map the input image pair to the target feature space. In this space, a distance-based metric function, , is used to measure the similarity of two images. The loss function proposed in [19] is adopted, which is defined as follows:where , is the index of image pairs, is the number of training sample pairs, and and indicate the loss of positive and negative image pairs. Constant is set to the upper bound of .

The rectangular boxes around the particles are cropped as candidate samples. The audio particles in the 3D world coordinates are projected to the image plane to obtain their rectangular boxes. Candidate samples are fed into the network, and the outputs indicate the similarity of the sample to the template, which is used to calculate the likelihood:where is a designed parameter, is the normalized output, and is the bounding box of the particle. Observation noise is assumed to follow a Gaussian distribution. To prevent tracking failures due to occlusion, deformation, and speaker walking out of the camera view, audio likelihood is added to modify the particle weights. Using the current DOA estimation result, the audio likelihood is defined as follows:where is a designed parameter. Reverse process of the pinhole camera model is used to reconstruct the 3D coordinates of the visual particle to obtain its azimuth. This process requires a prior parameter that is derived from the radius of the audio particle closest to the visual particle. The weight of particles is calculated as follows:where is a user-defined threshold. When the likelihoods of all particles are less than the threshold, it indicates that visual observations are unreliable; therefore, audio observations are added to improve the particle weights.

##### 2.3. Ensemble Method with Optimal Particle Set

Ensemble method is used to integrate the decisions of audio and visual particles to estimate the position. By comparing the weights of all particles, the particles with the largest weight are defined as the optimal particles and used to form the optimal particle set:where and represent the set of optimal audio particles and visual particles, respectively. The speaker position is estimated by the optimal particle set:where denotes the normalized weight of the optimal particle. Finally, the optimal particle set is utilised to reset the audio and visual particles at the next frame, which ensures the effectiveness and diversity of the particles.

##### 2.4. Template Update with Long Short-Term Mechanism

The traditional method performs speaker tracking according to the template provided in the first frame, which is not updated in subsequent frames to avoid contamination of target features. However, in real scenes, nonrigid targets such as speakers have various deformations, thus showing large differences in appearance. Therefore, template update method is used to adapt to changes in speaker appearance and prevent drift problem in tracking.

An indicator is proposed to measure the tracking confidence and selectively update the template. The confidence of the tracking result is defined as follows:where is the smallest likelihood among all particles and is the likelihood of , which is calculated by equation (12). Threshold is set. When , the tracking success rate is considered high enough to update the template, and the area around is cropped as a short-term template . In addition, the past appearance of the speaker is essential for tracking. It is inevitable that noise will be merged into template through successive updating. Therefore, the target image defined in the first frame by user is continuously adopted as a long-term template . The samples are matched with and , respectively, and the similarities are measured by the Siamese network. The modified likelihood is defined as follows:where is a designed weighting factor, by which the long-term template and the short-term template are combined. Equation (17) is substituted into equation (12) to calculate the particle weight.

#### 3. Experiments and Discussion

##### 3.1. Experimental Setting

The proposed tracker is evaluated on the AV16.3 corpus [20], which is a commonly used dataset for audio-visual speaker tracking captured by spatially distributed audio-visual sensors. The corpus is collected in a conference room with three cameras on the wall and two microphone arrays on the table. Audio signals are recorded at 16 kHz using two 10 cm radius, 8-microphone uniform circular arrays. Video sequences are recorded at 25 Hz captured by 3 monocular cameras. Each frame is a color image of pixels. The camera calibration information provided by the dataset is used for coordinate conversion in the pinhole camera model. The trackers are evaluated in single speaker case with three camera views on seq08, 11, and 12, where the speaker is moving and speaking at the same time with some challenging poses, such as outside of the camera view, not facing the cameras or fast motion. In each experiment, we use data streams from a camera and a microphone array.

The number of audio particles, visual particles, and optimal particles is set to 50, respectively. The audio-guided motion model is set with and , which depends on the particle velocity at different coordinates and . Parameters of likelihood functions are set as . Thresholds and weighting factors are set as and . The Siamese network is trained with image pairs extracted from videos with continuous speaker location annotations. Images within radius of the annotation center are considered as positive samples and otherwise are negative samples. Two positive samples in the interval frames are randomly matched into positive image pairs with label . Random matches of positive and negative samples are denoted as negative sample pairs with label , where and . Training set consists of 2000 pairs of images. The average mean absolute error (MAE) in 3D (m) and image plane (pixels) of 10 runs are used to evaluate the accuracy of the tracker.

##### 3.2. Experimental Results

First, the proposed 3D audio-visual speaker tracker is compared with PF-based single-modal approach referred to as audio-only (AO) tracker and video-only (VO) tracker. In Figure 3, (a)–(c) display the 3D tracking results on the three coordinates of azimuth (rad), elevation (rad), and radius (m) on seq11c2. The AO method uses the sound source localization result as the observation value in the PF algorithm. It performs well in azimuth localization, but it is difficult to accurately track in the other two directions. VO tracker achieves effective tracking on the image plane. Assuming that the speaker height is 1.7 m, and the tracking results on the image plane can be projected into the 3D space. The results show that, in the case of using a monocular camera, insufficient depth information makes it difficult to perform 3D tracking. The tracking results obtained using the AV method show that tracking in the azimuth direction is superior to the other two directions, which is due to the guidance of the azimuth estimation in the proposed motion model. The errors in elevation and radius are mainly caused by the speaker walking out of the camera field of view and large movements. Figure 3(d) shows the 3D MAE of above three methods, reflecting the superiority of the tracking performance with audio-visual fusion.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

The effectiveness of the proposed deep metric learning-assisted observation model is evaluated by comparing with a two-layer particle filter (2-LPF), which uses the color histogram matching method to measure the similarity between the rectangle around the particle and the reference template. The HSV color model is extracted to calculate the Bhattacharyya distance, thereby defining likelihood and particle weights in the same form as equation (10). Partial 3D trajectories of two trackers on seq08c1 and seq11c2 are shown in Figures 3(e) and 3(f), where the tracking results of the proposed method (yellow) are closer to the ground truth trajectories (green). Compared with traditional color features, the features extracted by the network are more distinguishable. Siamese network can better handle the similarity measurement of input image pairs, based on which more accurate particle weights are applied. The trajectory errors shown in figures are mainly due to the large error in the image 3D reconstruction process when the speaker, microphone, and camera are in the same vertical plane.

In order to investigate the effect of the proposed template update method on tracking performance, another comparative experiment is conducted. In the comparison method, the fixed template is the target image defined in the first frame without updating during the tracking process. Figure 4(a) shows the MAE on image plane (pixels) of two methods on seq11c1. The first peak of the error curve is caused by the speaker leaving the screen for a while and then reentering the scene. After frame 460, the target scale and appearance changes obviously due to the speaker moving close to the camera. As shown in the frame samples in Figure 4(b), the tracker equipped with template update mechanism achieves stable tracking (red), while the rectangular box produced by the comparison method (green) deviates from the target.

**(a)**

**(b)**

Finally, the tracking accuracy of the proposed tracker and the existing audio-visual trackers [5, 12, 21] are tested on nine single-person sequences captured by three cameras. Table 1 lists the MAE on the image plane (pixels) and 3D (m), respectively. The proposed tracker achieves outstanding performance both on image plane and in 3D space.

#### 4. Conclusions

This paper presents a deep metric learning-assisted 3D audio-visual speaker tracker, which integrates a designed Siamese network into the two-layer PF framework. In the proposed observation model, the similarity measures of the template and the particle diffusion areas are calculated to update weights of audio particles and visual particles. The template that adapts to the speaker appearance change is obtained through a long short-term update mechanism, which prevents drift during tracking. Audio information and video information are fused through the audio-guided motion model, conditional weighting formula, and the optimal particle set. The proposed algorithm is evaluated on single speaker sequences and achieves substantial performance improvement compared to the trackers using individual modalities and the comparison of audio-visual methods. Future work will focus on acoustic feature extraction and multimodal confidence evaluation methods.

#### Data Availability

The dataset AV16.3 corpus used in this study is an open source dataset provided by the Idiap Research Institute. The dataset is publicly available on the website https://www.idiap.ch/dataset/av16-3/. We cited this dataset in Section 3.1 of the article with a corresponding reference in the reference list as [20].

#### Conflicts of Interest

The authors declare that there are no conflicts of interest.

#### Authors’ Contributions

Hong Liu and Yidi Li conceived and designed the study. Yidi Li and Bing Yang developed the theory and performed the experiments. Runwei Ding and Yang Chen verified the analytical methods and supervised the findings of this work. Yidi Li wrote the manuscript with support from Bing Yang and Yang Chen. Hong Liu and Runwei Ding reviewed and edited the manuscript. All authors discussed the results and contributed to the final manuscript.

#### Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 61673030 and U1613209) and National Natural Science Foundation of Shenzhen (No. JCYJ20190808182209321).