Abstract

Capturing the body motion of fish has been gaining considerable attention from scientists of various fields. In this paper, we propose a method which is able to track the full-body motion of multiple fish with frequent interactions. We firstly propose to model the midline subspace of a fish body which gives a compact low-dimensional representation of the complex shape and motion. Then we propose a particle swarm-based optimization framework whose objective function takes into account multiple sources of information. The proposed multicue objective function is able to describe the details of fish appearance and is also effective through mutual occlusions. Excessive experimental results have demonstrated the effectiveness and robustness of the proposed method.

1. Introduction

The most effective way to quantitatively research behavior patterns and underlying rules of fish schools is tracking each fish. It is also helpful in many related applications such as robotics and virtual reality. For example, based on the body motion of fish, people can build man-made fish-like swimming robots and create vivid virtual fish in computers [19]. Top-view video and 2D tracking obtain sufficiently informative motion data for behavior investigation because a shallow water tank is used in many fish behavior research experiment and the fish typically swim around the same horizontal plane. However, tracking the full-body motion of fish typically multiple fish with interactions is still a challenging task due to (1) full-body motion of fish is highly complex which is difficult to model with a few parameters. (2) Fish may move abruptly thus the motion continuity assumption no longer hold, which causes conventional tracking approaches to fail. (3) Multiple fish cause frequent mutual occlusions which corrupt the appearance models of the tracking approaches.

Visual tracking is a hot research topic during the past two decades, and significant improvements have been made in all aspects of visual tracking such as appearance model [10, 11, 12] and estimation method [13, 14]. And multiple trackers can be utilized for tracking multiple targets [1517]. Nevertheless, conventional visual trackers which are designed for tracking the positions of generic objects are not applicable to the full-body motion tracking problem here. Another branch of multitarget tracking methods follows the detection and association framework [18, 19], i.e., the outputs of detectors are associated with trackers across time. In this problem, however, the detectors may fail to give correct output during mutual occlusions which frequently happen and last for sufficiently long time, causing difficulties for the subsequent data association.

To be truly helpful for biological research, many numbers of automatic software were developed for multiobject tracking, such as ANY-maze and EthoVision [2, 2022]. But only a few targets can be tracked and professional experiment setup are needed. A multiple tracking system of fish on the basis of a scale-space determinant of Hessian (DoH) fish head detector and Kalman filter is developed by Qian et al. [23]. Delcourt et al.’s system can track as many as 100 fish simultaneously but is not suitable for long period tracking [2]. However, these approaches highly depend on detection results and motion continuity for data association, and discriminative information of the head is not fully exploited. The body modeling and motion tracking of multiple fish is preliminarily discussed and attempted [24]; however, the effect declines when the number of fish increases.

Problems of tracking will be more difficult when severe occlusions occur: individuals may be assigned wrong identities and these errors would propagate throughout the rest of the video. Several existing tracking methods combine detection and tracking stages together to correct detection errors timely. Prior knowledge on possible articulations and temporal coherency is used by Andriluka et al. to associate the detection of each individual across frames, which depends on the motion model and object specific codebooks constructed by clustering local features. Kalal et al. [10] proposed the framework tracking-learning-detection (TLD) to track single object, which is divided into tracking, learning, and detection 3 subtasks. The tracker follows the target across frames; the detector localizes the object in each frame and is corrected, and it is updated online by P-N learning. Wang et al. [25] proposed an effective tracking method using convolutional neural network (CNN) for head identification. They firstly detect fish heads using a scale-space method, and data association across frames is then achieved via identifying the head image pattern of each individual fish in each frame via CNN specially tailored to suit this task. Finally they combine prediction of the motion state and the recognition result by CNN to associate detection across frames. But samples must be collected for training CNN, and when new targets occur in the video, their method cannot work.

We present in this paper a method that is capable of tracking the full-body deformation of multiple fish with frequent mutual interactions. Since in most of the time, fish motion and deformation are horizontal, we capture the videos from a top view. In order to model the complex fish body deformation, a midline subspace model is firstly learned from a large number of training samples which give a compact representation of fish body and thus facilitate subsequent motion estimation. Then we propose a multicue cost function which is able to characterize the subtitle appearance details of fish body during swimming. Moreover, this cost function is able to work under partial occlusions, making the system fully automatic. Extensive experimental results have demonstrated the effectiveness and robustness of the method.

The contribution of the paper can be summarized as follows:(i)We introduce a midline subspace model learned from large amount of data to model the complex shape and deformation of a fish body. This subspace model is compact and low-dimensional thus greatly facilitating parameter optimization.(ii)We propose a highly discriminative and robust multicue objective function which models different aspects of the image structures of fish region.(iii)We have conducted systematic experiments to demonstrate the effectiveness of the proposed method.

2. Shape and Kinematic Model

2.1. Midline Representation

Since the videos are captured from a top view, the shape of fish on images is approximately symmetrical about its midline as shown in Figure 1. So the deformation of the body can be viewed as driven by the midline. We will show later that once the midline is determined, the contour of the whole body shape can be recovered easily according to a reference shape.

As shown in Figure 1, a midline can be approximated with a chain of articulated equal-length line segments. Thus a midline is made up of n joints : a head point, a tail point, and middle joints. During one-time-step deformation, the length of each segment is kept fixed. We use to denote the parameters of a midline, where is the position of the head point, is the orientation of the first line segment, and is the rotation angle of the segment relative to the first one (i.e., the absolute rotation angle is ). The first three parameters determine a rigid body transform, and the rest of the parameters account for the nonrigid deformation of body shape. Given the parameter vector , the midline points can be recovered as

The body width of the shape () keeps fixed in one time step, so a pair of contour points can be recovered as

Thus with a reference contour and a midline represented by n joints, a set of contour points can be recovered. Given the image at t as , the tracking problem can be formulated as maximizing probability: .

2.2. Learning Subspace of Midlines

The major defect of the above representation is the high dimensionality, as sufficient number of line segments is essential to guarantee accurate approximation of the fish body shape. In fact, such a representation is redundant as the deformation of the fish is governed by fewer factors. So we seek to embed the deformation parameters into a lower dimensional linear subspace, which can be learned from large amount of training samples.

We collect midlines of various postures and perform principal component analysis (PCA) on their nonrigid deformation parameters . We choose basis from the PCA results, and thus each is the linear combination of the following basis:where the coefficient and is the sample mean. We find that 6 bases are sufficient to approximate a midline to a satisfactory accuracy. In fact, 6 bases account for of the variance. Figure 2 shows the training samples of PCA and the first four principal components of PCA. Now is replaced with 6 parameters, and thus the parameters to be estimated can be written as . Tracking problem becomes finding the maximum probability of .

3. Multicue Objective Function

To take into consideration the image cues from the whole fish body area as well as some surrounding context, the midlines are extended as straight lines at the head to guarantee sufficient coverage. And each line segment is associated with a rectangular region whose width is and length is two times the body width at (i.e., ). Each rectangle moves in rigid transforms with . Considering the observation that the image likelihood of different parts of the fish body should play different roles in tracking, we divide the points into four parts: as shown in Figure 3 and sample points are uniformly picked in each rectangle. We compute image likelihood function for each of the four parts, respectively, and then the weighted sum of the functions is computed as the final objective function value. Three kinds of image likelihood which characterize three kinds of information are considered, and they are temporal appearance coherence, segmentation compatibility, and shape self-symmetry.

3.1. Temporal Appearance Coherence

The appearance coherence is the basic assumption in visual tracking, which enforces the appearance of the estimated target state to be consistent with a reference appearance model. We compute the similarity between the pixel values in each part at t and their correspondences at a reference frame . The normalized cross-correlation (NCC) is adopted as the similarity metric. For example, the similarity score of the first part is

The scores of the rest three parts can be computed likewise.

3.2. Segmentation Compatibility

Segmentation compatibility is introduced to enforce the estimated shape be compatible with the segmentation result. Since segmentation performance is stable across time, enforcing segmentation compatibility will prevent the tracker from drifting. As we select a larger region which contains some context pixels, both the foreground and background should be compatible with the reference. Let denote a segmented binary image of , is the point in , and then the segmentation compatibility score of the ith part can be computed as and :where the first term forces the estimated shape cover more foreground pixels and the second term encourages the shape to leave less background pixels uncovered. The segmentation compatibility scores of three other parts can be computed likewise.

3.3. Shape Self-Symmetry

So far, we have exploited the appearance coherency and segmentation cues; however, the internal structure of the shape has been ignored. As discussed previously, the shape is self-symmetrical about the midline. So if a midline is correctly estimated, the image structures on the two sides of the midline should be symmetrical. This prior knowledge offers a stable and drifting free guidance for tracking. Like the two previous cues, we compute a self-symmetrical score for each of the four parts. Take the third part , for example, the self-symmetrical score is computed as the NCC between two pixel value vectors, which are formed by the pixel values of sample points on the two sizes of midline (as shown in Figure 4). The order of the pixel values should be adjusted so that two points that are symmetrical about the midline are in the same position of the array. Formally, the score can be written aswhere and are the point pair which are symmetrical about the midline. The shape self-symmetry scores for the other parts can be computed likewise.

3.4. Combination of Multiple Cues

The final objective function is the weighted sum of the scores of all the four parts:where is the weight of part i, which is set empirically in the experiments. Different parts should play different roles in tracking, and this has been proven in our experiments. We find that the head (part 1) and tail (part 4) should be associated with larger weights than parts 2 and 3, and this is possibly because the image regions of head and tail contain more discriminative features than the other two parts.

4. Sequential Particle Swarm Optimization

With the objective function defined, the tracking problem becomes maximizing with respect to the parameters . However, this objective function is highly complex and nondifferentiable. Particle Swarm Optimization is a stochastic optimization technique which has received much attention due to its ability in finding optimum of complex problems.

We adopt a standard particle swarm optimization procedure [26]. A set of candidate solutions are maintained as particles, which move in the parameter space in the influence of a global best solution and a particle’s local best solution. In each time step, the tracker generates initial solutions using a second-order dynamic model:where is the Gaussian noise.

5. Experiments

We captured the data for evaluation: 20 zebra fish were placed in a water tank and a video camera was placed on top of the tank which recorded the movements of the fish. The resolution of the camera is 1024 by 1024, and the frame rate is 100 fps.

We first evaluate the method following the manner of conventional multiple target tracking. The tracking of a target is considered to be completed if no ID switch or target lost occurs during the entire time step. We have manually labeled a 650-frame long video for automatic evaluation. A tracker is considered to be failed if both estimated shape’s head and tail positions are more than 20 pixels away from the labeled positions which are considered as groundtruth. We get an 800-frame video with 10 zebra fish, which is shot in front light. The resolution of the video is 2048 W by 2040 H, and the frame rate is 100 fps. On this data set, we evaluate the adaptability of our method.

In order to evaluate the role of different cues, we evaluate different combinations of cues on the labeled data. The evaluation results are listed in Tables 1 and 2.

We evaluate different methods on the labeled data and the 800-frame video. The evaluation results of different methods are listed in Tables 3 and 4. Qian’s method is more prone to IDS and lost when the number of frames occluded by each other increase. In Wang’s method, the lower image resolution possibly leads to the worse effect of CNN (1024 ∗ 1024 vs 2048 ∗ 2040). The proposed method utilizes more body shape information and thus has better performance.

To evaluate the accuracy of the estimated shape, we give a plot of errors of the head and tail point of one fish in Figure 5. From the figure, we see that the error does not accumulate as time grows. The estimated tail point vibrates more violently than the head point, and this is possibly because the appearance of the tail is not quite stable as that of the head point.

We also give some qualitative results of the tracked shape under complex mutual interactions in Figure 6. With the designed multicue objective function and midline-based subspace model, our method is able to overcome medium degree of partial occlusions. And we find our method may fail if the occluded area is too large. Finally, we plot the trajectories of the estimated head position of 20 fish in Figure 7. For better visualization, we added time as the third dimension of the plot.

6. Conclusion

We present in this paper a method that is capable of tracking the full-body deformation of multiple fish with frequent occlusions and interactions. We propose a midline subspace-based model to represent each complex shape and deformation of the fish. And we further propose a PSO-based multicue optimization method to estimate the parameters of the model. Experimental results demonstrate the effectiveness of the proposed method.

Data Availability

The captured video data used to support the findings of this study are available online at https://pan.baidu.com/s/15Ijt5bpud_aqA1Smro1CdQ and the code is jagt.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

Xiang Liu, Pei Ru Zhu, and Ye Liu developed the methodology; Jin Wen Zhao collected the resources; Xiang Liu and Pei Ru Zhu managed the software; Ye Liu and Jin Wen Zhao validated the study; Jin Wen Zhao visualized the work; Xiang Liu and Ye Liu wrote the original draft; Xiang Liu, Pei Ru Zhu, and Ye Liu reviewed and edited the original draft; and Xiang Liu and Pei Ru Zhu contributed equally to this work.

Acknowledgments

This research was funded by the Science and Technology Commission of Shanghai Municipality (no. 19ZR1421500), “Research on ultrasound-assisted diagnosis of liver fibrosis cirrhosis based on visual feature extraction and machine learning,” and National Natural Science Foundation of China (no. 61602255), “Research on target tracking method based on explicit occlusion model under RGBD camera.”