Research Article | Open Access
A Motion Detection Algorithm Using Local Phase Information
Previous research demonstrated that global phase alone can be used to faithfully represent visual scenes. Here we provide a reconstruction algorithm by using only local phase information. We also demonstrate that local phase alone can be effectively used to detect local motion. The local phase-based motion detector is akin to models employed to detect motion in biological vision, for example, the Reichardt detector. The local phase-based motion detection algorithm introduced here consists of two building blocks. The first building block measures/evaluates the temporal change of the local phase. The temporal derivative of the local phase is shown to exhibit the structure of a second order Volterra kernel with two normalized inputs. We provide an efficient, FFT-based algorithm for implementing the change of the local phase. The second processing building block implements the detector; it compares the maximum of the Radon transform of the local phase derivative with a chosen threshold. We demonstrate examples of applying the local phase-based motion detection algorithm on several video sequences. We also show how the locally detected motion can be used for segmenting moving objects in video scenes and compare our local phase-based algorithm to segmentation achieved with a widely used optic flow algorithm.
Following Marr, the design of an information processing system can be approached on multiple levels . Figure 1 illustrates two levels of abstraction, namely, the algorithmic level and the physical circuit level. On the algorithmic level, one studies procedurally how the information is processed independently of the physical realization. The circuit level concerns the actual realization of the algorithm in physical hardware, for example, a biological neural circuit or silicon circuits in a digital signal processor. In this paper, we put forth a simple motion detection algorithm that is inspired by motion detection models of biological visual systems (in vivo neural circuit) and provide an efficient realization that can easily be implemented on commodity (in silico) DSP chips.
Visual motion detection is critical to the survival of animals. Many biological visual systems have evolved highly efficient/effective neural circuits to detect visual motion. Motion detection is performed in parallel with other visual coding circuits and starts already in the early stages of visual processing. In the retina of vertebrates, it is known that at least three types of Direction-Selective Ganglion Cells (DSGC) are responsible for signaling visual motion at this early stage . In flies, direction-selective neurons are found in the optic lobe, 3 synapses away from the photoreceptors .
The small number of synapses between photoreceptors and direction-selective neurons suggests that the processing involved in motion detection is not highly complex but still very effective. In addition, the biological motion detection circuits are organized in a highly parallel way to enable fast, concurrent computation of motion. It is also interesting to note that the early stages of motion detection are carried out largely in the absence of spiking neurons, indicating that initial stages of motion detection are preferably performed in the “analog” domain. Taking advantage of continuous time processing may be critical for quickly processing motion since motion intrinsically elicits fast and large changes in the intensity levels, that is, large amounts of data under stringent time constraints.
Modern, computer-based motion detection algorithms often employ optic flow techniques to estimate spatial changes in consecutive image frames [4, 5]. Although, often time, optic flow estimation algorithms produce accurate results, the computational demand to perform many of these algorithms is too high for real-time implementation.
Several models for biological motion detection are available and their architecture is quite simple . The Reichardt motion detector  was thought to be the underlying model for motion detection in insects . The model is based on a correlation method to extract motion induced by spatiotemporal information patterns of light intensity. Therefore, it relies on a correlation/multiplication operation. A second model is the motion energy detector . It uses spatiotemporal separable filters and a squaring nonlinearity to compute motion energy and it was shown to be equivalent to the Reichardt motion detector. Earlier work in the rabbit retina was the foundation to the Barlow-Levick model  of motion detection. The model relies on inhibition to compensate motion in the null direction.
In this paper, we provide an alternative motion detection algorithm based on local phase information of the visual scene. Similar to mechanisms in other biological models, it operates in continuous time and in parallel. Moreover, the motion detection algorithm we propose can be efficiently implemented on parallel hardware. This is, again, similar to the properties of biological motion detection systems. Rather than focusing on velocity of motion, we focus on localization, that is, where the motion occurs in the visual field as well as the direction of motion.
It has been shown that images can be represented by their global phase alone . Here we provide a reconstruction algorithm of visual scenes by only using local phase information, thereby demonstrating the spectrum of the representational capability of phase information.
The Fourier shift property clearly suggests the relationship between the global shift of an image and the global phase shift in the frequency domain. We elevate this relationship by computing the change of local phase to indicate motion that appears locally in the visual scene. The local phases are computed using window functions that tile the visual field with overlapping segments, making it amenable for a highly parallel implementation. In addition, we propose a Radon-transform-based motion detection index on the change of local phases for the readout of the relation between local phases and motion.
Interestingly, phase information has been largely ignored in the field of linear signal processing and for good reason. Phase-based processing is intrinsically non-linear. Recent researches, however, showed that phase information can be smartly employed in speech processing  and visual processing . For example, spatial phase in an image is indicative of local features such as edges when considering phase congruency . The role of spatial phase in computational and biological vision, emergence of visual illusions, and pattern recognition is discussed in . Together with our result for motion detection, these studies suggest that phase information has a great potential for achieving efficient visual signal processing.
This paper is organized as follows. In Section 2, we show that local phase information can be used to faithfully represent visual information in the absence of amplitude information. In Section 3, we develop a simple local phase-based algorithm to detect motion in visual scenes and provide an efficient way for its implementation. We then provide examples and applications of the proposed motion detection algorithm to motion segmentation. We also compare our results to those obtained using motion detection algorithms based on optic flow techniques in Section 4. Finally, we summarize our results in Section 5.
2. Representation of Visual Scenes Using Phase Information
The use of complex valued transforms is widespread, for both representing and processing images. When represented in polar coordinates, the output of a complex valued transform of a signal can be split into amplitude and phase. In this section, we define two types of phases of an image: global phase and local phase. We then argue that both types of phases can faithfully represent an image and we provide a reconstruction algorithm that recovers the image from local phase information. This indicates that phase information alone can largely represent an image or video signal.
It will become clear in the sections that follow that the use of phase for representing of image and video information leads to efficient ways to implement certain types of image/video processing algorithms, for example, motion detection algorithms.
2.1. The Global Phase of Images
Recall that the Fourier transform of a real valued image , , is given by with and .
In polar coordinates, the Fourier transform of can be expressed aswhere is the amplitude and is the phase of the Fourier transform of .
Definition 1. The amplitude of the Fourier transform of an image , , is called the global amplitude of . The phase of the Fourier transform of an image , , is called the global phase of .
It is known that the global phase of an image plays an important role in the representation of natural images . A classic example is to take two images and to exchange their global phases before their reconstruction using the inverse Fourier transform. The resulting images are slightly smeared but largely reflect the information contained in the global phase.
2.2. The Local Phase of Images
The global phase indicates the offset of sinusoids of different frequencies contained in the entire image. However, it is not intuitive to relate the global phase to local image features such as edges and their position in an image. To study these local features, it is necessary to modify the Fourier transform such that it reflects properties of a restricted region of an image. It is natural to consider the Short-Time (-Space) Fourier Transform (STFT):where , , is a real valued window function centered at . Typical choices of window functions include the Hann window and the Gaussian window. The (effectively) finite support of the window restricts the Fourier transform to local image analysis.
Similarly to the Fourier transform, the STFT can be expressed in polar coordinates aswhere is the amplitude and is the phase of the STFT.
Definition 2. The amplitude of the STFT of an image , , is called the local amplitude of . The phase of the STFT of an image , , is called the local phase of .
Note that when is a Gaussian window, the STFT of evaluated at can be equivalently viewed as the response of a complex-valued Gabor receptive field to .
In this case, the window of the STFT clearly is given byTherefore, the STFT can be realized by an ensemble of Gabor receptive fields that are common in modeling simple and complex cells (neurons) in the primary visual cortex .
2.3. Reconstruction of Images from Local Phase
Amplitude and phase can be interpreted as measurements/projections of images that are indicative of their information content. Classically, when both the global amplitude and phase are known, it is straightforward to reconstruct the image. The reconstruction calls for computing the inverse Fourier transform given the global amplitude and phase. Similarly, when using local amplitude and phase, if the sampling functions form a basis or frame in a space of images, the reconstruction is provided by the formalism of wavelet theory, for example, using Gabor wavelets .
Amplitude or phase represents partial information extracted from visual scenes. They are obtained via nonlinear sampling, that is, a nonlinear operation for extracting the amplitude and phase information from images. The nonlinear operation makes reconstruction from either amplitude or phase alone difficult. Earlier studies and recent development in solving quadratic constraints, however, suggest that it is possible to reconstruct images from global or local amplitude information [18, 19].
While computing the amplitude requires a second order (quadratic) nonlinearity, computing the phase calls for higher order nonlinear operators (Volterra kernels). It is possible, however, to reconstruct up to a constant scale an image from its global phase information alone without explicitly using the amplitude information. An algorithm was provided for solving this problem in the discrete signal processing domain in . The algorithm smartly avoids using the inverse tangent function by reformulating the phase measurements as a set of linear equations that are easy to solve.
Using a similar argument, we demonstrate in the following that, up to a constant scale, a bandlimited signal , , can be reconstructed from its local phase alone. We first formulate the encoding of an image by local phase, using the Gabor receptive fields as a special case. It is straightforward then to formulate the problem with local phase computed from other types of STFTs.
Formally, we consider an image, , on the domain , to be an element of a space of trigonometric polynomials of the formwhereare the set of basis functions of , , are the bandwidth and the order, respectively, of the space in the dimension, and , are the bandwidth and the order, respectively, of the space in the dimension.
Consider a bank of Gabor receptive fieldswhere is the translation operator with , , , , and , and , , , , , and . The responses of the Gabor receptive fields to the input are given bywhere , , is the local amplitude and is the local phase.
Dividing both sides of (11) by , we haveSince , we haveNote that is a real-valued Gabor receptive field with a preferred phase at .
Remark 3. Assuming that the local phase information is obtained via measurements, that is, filtering the image with pairs of Gabor receptive fields (10), the set of linear equations (14) has a simple interpretation: the image is orthogonal to the space spanned by the functionswhere with , .
We are now in the position to provide a reconstruction algorithm of the image from phase , .
Lemma 4. can be reconstructed from , aswithwhere is a matrix whose th row and th column entry areHere traverses the set , and . is a vector of the formthat belongs to the null space of . A necessary condition for perfect reconstruction of , up to a constant scale, is that , where is the number of phase measurements.
Proof. Substituting (7) into (14), we obtain for all . Therefore, we have inferring that is in the null space of .
If , it follows from the rank-nullity theorem that leading to multiple linearly independent solutions to .
Example 5. In Figure 2 an example of reconstruction of an image is shown using only local phase information. The reconstructed signal was scaled to match the original signal. The SNR of the reconstruction is 44.48 [dB]. An alternative way to obtain a unique reconstruction is to include an additional measurement, for example, the mean value of the signal to the system of linear equations (14).
3. Visual Motion Detection from Phase Information
In this section we consider visual fields that change as a function of time. For notational simplicity will denote here the space-time intensity of the visual field.
3.1. The Global Phase Equation for Translational Motion
Let , , , be a visual stimulus. If the visual stimulus is a pure translation of the signal at , that is,whereare the total length of translation at time in each dimension and and are the corresponding instantaneous velocity components, then the only difference between and in the Fourier domain is captured by their global phase. More formally, consider the following.
Lemma 6. The change (derivative) of the global phase is given bywhere, by abuse of notation, denotes the global phase of and is the initial condition.
Proof. If is the 2D (spatial) Fourier transform of , , by the Fourier shift theorem, we haveFor a certain frequency component , the change of its global phase over time amounts toTherefore, in the simple case where the entire visual field is shifting, the derivative of the phase of Fourier components indicates motion, and it can be obtained by the inner product between the component frequency and the velocity vector.
3.2. The Change of Local Phase
3.2.1. The Local Phase Equation for Translational Motion
The analysis in Section 3.1 applies to global motion. This type of motion occurs most frequently when the imaging device, either an eye or a camera, moves. Visual motion in the natural environment, however, is more diverse across the screen since it is, often time, produced by multiple moving objects. The objects can be small and the motion more localized in the visual field.
Taking the global phase of will not simply reveal where motion of independent objects takes place or their direction/velocity of motion. The ease of interpretation of motion by using the Fourier transform, however, motivates us to reuse the same concept in detecting local motion. This can be achieved by restricting the domain of the visual field where the Fourier transform is applied.
To be able to detect local motion, we consider the local phase of by taking the STFT with window function . Note that, the STFT and its ubiquitous implementation in DSP chips can be extensively used in any dimension. For simplicity and without loss of generality, we consider the window to be centered at . The STFT is given bywhere, by abuse of notation, is the amplitude and the local phase.
Before we move on to the mathematical analysis, we can intuitively explain the relation between the change in local phase and visual motion taking place across the window support. First, if the stimulus undergoes a uniform change of intensity or it changes proportionally over time due to lighting conditions, for example, the local phase does not change since the phase is invariant with respect to intensity scaling. Therefore, the local phase does not change for such nonmotion stimuli. Second, a rigid edge moving across the window support will induce a phase change.
For a strictly translational signal within the window support (footprint), for example, where and are as defined in (23), we have the following result.
Lemma 7. Consider the following:where, by abuse of notation, is the local phase of and is the initial condition. The functional form of the term is provided in the Appendix.
Proof. The derivation of (29) and the functional form of are given in the Appendix.
Remark 8. We notice that the derivative of the local phase has similar structure to that of the global phase, but for the added term . Through simulations, we observed that the first two terms in (29) dominate over the last term for an ON or OFF moving edge . For example, Figure 3 shows the derivative of the local phase given in (29) for an ON edge moving with velocity pixels/sec.
Remark 9. Note that may not be differentiable even if is differentiable, particularly when . For example, the spatial phase can jump from a positive value to zero when diminishes. This also suggests that the instantaneous local spatial phase is less informative about a region of a visual scene whenever is close to zero. Nevertheless, the time derivative of the local phase can be approximated by applying a high-pass filter to .
3.2.2. The Block Structure for Computing the Local Phase
We construct Gaussian windows along the , dimensions. The Gaussian windows are defined aswhere , , in which is the distance between two neighboring windows and , , where are the number of pixels of the screen in and directions, respectively.
We then take the 2D Fourier transform of the windowed video signal and write in polar formThe above integral can be very efficiently evaluated using the 2D FFT in discrete domain defined on blocks approximating the footprint of the Gaussian windows. For example, the standard deviation of the Gaussian windows we use in the examples in Section 4 is 4 pixels. A block of pixels () is sufficient to cover the effective support (or footprint) of the Gaussian window. At the same time, the size of the block is a power of 2, which is most suitable for FFT-based computation. The processing of each block is independent of all other blocks; thereby, parallelism is readily achieved.
Note that the size of the window is informed by the size of the objects one is interested in locating. Measurements of the local phase using smaller window functions are less robust to noise. Larger windows would enhance object motion detection if the object size is comparable to the window size. However, there would be an increased likelihood of independent movement of multiple objects within the same window, which is not modeled here and thereby may not be robustly detected.
Therefore, for each block , we obtain measurements of the phase at every time instant , with , where with . We then compute the temporal derivative of the phase, that is, for .
We further illustrate an example of the block structure in Figure 4. Figure 4(a) shows an example of an image of pixels. Four Gaussian windows are shown each with a standard deviation of 4 pixels. The distance between the centers of two neighboring Gaussian windows is 6 pixels. The red solid square shows a -pixel block with , , which encloses effective support of the Gaussian window on top-left (, is the block with Gaussian window centered at pixel ). The green dashed square shows another -pixel block with , . The two Gaussian windows on the right are associated with the blocks , and , , respectively. Cross section of all Gaussian windows with , , that is, those centered on the magenta line, are shown in Figure 4(b). The red and blue curve in Figure 4(b) correspond to the two Gaussian windows shown in Figure 4(a). Figure 4(b) also suggests that some of the Gaussian windows are cut off on the boundaries. This is, however, equivalent to assuming that the pixel values outside the boundary are always zero, and it will not significantly affect motion detection based on the change of local phase.
Since the phase, and thereby the phase change, is noisier when the local amplitude is low, an additional denoising step can be employed to discount the measurements of for low amplitude values . The denoising is given bywhere is a constant, and .
3.3. The Phase-Based Detector
We propose here a block FFT based algorithm to detect motion using phase information. Such an algorithm is, due to its simplicity and parallelism, highly suitable for an in silico implementation.
3.3.1. Radon Transform on the Change of Phases
We exploit the approximately linear structure of the phase derivative for blocks exhibiting motion by computing the Radon transform of over a circular bounded domain .
The Radon transform of the change of phase in the domain is given by whereThe Radon transform evaluated at a particular point is essentially an integral of along a line oriented at angle with the axis and at distance along the direction from .
If, for a particular and , we have , we havewhere is a correction term due to different length of line integrals for different values of in the bounded domain .
After computing the Radon transform of for every block at time , we compute the Phase Motion Indicator (PMI), defined asIf the is larger than a chosen threshold, motion is deemed to occur in block at time .
Using the Radon transform makes it easier to separate rigid motion from noise. Since the phase is quite sensitive to noise, particularly when the amplitude is very small, the change of phase under noise may have comparable magnitude to that due to motion as mentioned earlier. The change of phase under noise, however, does not possess the structure suggested by (29) in the domain. Instead, it appears to be more randomly distributed. Consequently, the PMI value is comparatively small for these blocks (see also Section 3.3.3).
Moreover, the direction of motion, for block where motion is detected, can be easily computed aswhereThis follows from (36).
3.3.2. The Phase-Based Motion Detection Algorithm
The algorithm is subdivided into two parts. The first part computes local phase changes and the second part is the phase-based motion detector.
In the first part, the screen is divided into overlapping blocks. For example, the red, green, and blue blocks in the plane “divide into overlapping blocks” correspond to the squares of the same color covering the video stream. A Gaussian window is then applied on each block, followed by a 2D FFT operation that is used to extract the local phase. A temporal high-pass filter is then employed to extract phase changes.
In the second part, the PMI is evaluated for each block based on the Radon transform of the local phase changes in each block. Motion is detected for blocks with PMI larger than a preset threshold, and the direction of motion is computed as in (39).
It is easy to notice that the algorithm can be highly parallelized.
We provide an illustrative example in Figure 6 showing how motion is detected using Algorithm 1. The full video of this example can be found in Supplementary Video S1; see Supplementary Material available online at http://dx.doi.org/10.1155/2016/7915245. Figure 6(a) depicts a still from the “highway video” in the Change Detection 2014 dataset  evaluated at a particular time . As suggested by the algorithm, the screen in Figure 6(a) is divided into overlapping blocks and the window functions are applied to each block. Local phases can then be extracted from the 2D FFT of each windowed block, and the local phase changes are obtained by temporal high-pass filtering. The phase change is shown in Figure 6(b) for all blocks, with block enlarged in Figure 6(c) and block enlarged in Figure 6(d) (see also the plane “2D FFT and extract phase change” in Figure 5). Note that at the time of the video frame, block covers a part of the vehicle in motion in the front, and block corresponds to an area of the highway pavement where no motion occurs.
Figure 6(f) depicts, for each block , the maximum phase change over all ; that is, We observe from the figure that, for regions with low amplitude, such as the region depicting the road, when the normalization constant is absent, the derivative of the phase can be noisy. For these blocks the maximum of over all is comparable to the maximum obtained for blocks that cover the vehicles in motion.
However, (29) suggests that the local phase change from multiple filter pairs centered at the same spatial position can provide a constraint to robustly estimate motion and its direction. Given the block structure employed in the computation of the local phase, it is natural to utilize phase change information from multiple sources.
Indeed, if, for a particular block , , then it is easy to see that will be zero on the line and have opposite sign on either side of this line. For example, in Figures 6(b) and 6(c), clearly exhibits this property for blocks that cover a vehicle in motion. The PMI is a tool to evaluate this property.
Finally, the PMIs for all blocks are shown compactly in a heat map in Figure 6(e). The figure shows clearly that the blocks corresponding to the two moving vehicles have a high PMI value while the stationary background areas have a low PMI value, allowing one to easily detect motion by employing simple thresholding (see also the plane “Radon transform and extract strength orientation of plane” in Figure 5). In addition, the orientation of motion in each block is readily observable even by inspection in Figure 6(b) by a line separating the yellow part and blue part in each block. Further results about the direction of motion are presented in Section 4.
3.4. Relationship to Biological Motion Detectors
A straightforward way to implement local motion detectors is to apply a complex-valued Gabor receptive field (5) to the video signal , and then take the derivative of the phase with respect to time or apply a high-pass filter on the phase to approximate the derivative.
We present here an alternate implementation without explicitly computing the phase. This will elucidate the relation between the phase-based motion detector presented in the previous section and some elementary motion detection models used in biology, such as the Reichardt motion detector  and motion energy detector .
Assuming that the local phase is differentiable, we have (see also the Appendix)where and are, respectively, the real and imaginary parts of .
We notice that the denominator of (42) is the square of the local amplitude of , and the numerator is of the form of a second order Volterra kernel. This suggests that the time derivative of the local phase can be viewed as a second order Volterra kernel that processes two normalized spatially filtered inputs and .
We consider an elaborated Reichardt motion detector as shown in Figure 7. It is equipped with a quadrature pair of Gabor filters whose outputs are and , respectively, for a particular value of . The pair of Gabor filters that provide these outputs are the real and imaginary parts of . It also consists of a temporal high-pass and temporal low-pass filter . The output of the elaborated Reichardt detector follows the diagram in Figure 7 and can be expressed asThe response can also be characterized by a second order Volterra kernel. We notice the striking similarity between (43) and the numerator of (42). In fact, the phase-based motion detector shares some properties with the Reichardt motion detector. For example, it is straightforward to see that a single phase-based motion detector is tuned to the temporal frequency of a moving sinusoidal grating.
Since the motion energy detector is formally equivalent to an elaborated Reichardt motion detector , the structure of the motion energy detector with divisive normalization is also similar to the phase-based motion detector.
4. Exploratory Results
In this section, we apply the phase-based motion detection algorithm on several video sequences and demonstrate its efficiency and effectiveness in detecting local motion. The motion detection algorithm is compared to two well-known biological motion detectors, namely, the Reichardt motion detector and the Barlow-Levick motion detector. We then show that the detected local motion can be used in motion segmentation tasks. The effectiveness of the segmentation is compared to segmentation using motion information obtained from a widely used optic flow based algorithm available in the literature .
4.1. Efficient Parallel Implementation
The algorithm was implemented in PyCUDA  and tested on an NVIDIA GeForce GTX TITAN GPU. All computations use single precision floating points. The processing speeds of the algorithm for several screen sizes are listed in Table 1. Clearly, the proposed phase-based motion detection algorithm has real-time capability to process video even with full High Definition screen size.
For comparison, we implemented the Reichardt motion detector and the Barlow-Levick motion detector. Their respective diagrams are shown in Figure 8. Note that we moved the high-pass filter to the front of the low-pass filter . This configuration provides a superior performance to the one in Figure 7. For both the Reichardt and the Barlow-Levick detectors, the videos are first blurred by a Gaussian filter (with the same variance as the Gaussian window in (30) used for the phase-based motion detector) and subsampled at the center of each overlapping block in the phase-based motion detector. The subsampled video then provides inputs to two 2D arrays of the circuits shown in Figure 8, one for the horizontal direction and one for the vertical direction. The outputs of the horizontal and vertical motion circuits form an array of motion vectors that indicate the strength and direction of motion. The three tested motion detectors have the same number of outputs as a result. The processing speeds of the Reichardt motion detector and the Barlow-Levick motion detector, both implemented in PyCUDA and tested on the same GPU, are shown in Table 1.
Note that the Reichardt motion detector and the Barlow-Levick motion detector are highly efficient due to the simplicity of their algorithms. The phase-based motion detection algorithm, however, is a much more sophisticated algorithm, and yet it can be implemented in real-time using parallel computing devices.
The fast GPU implementation is based on the FFT and Matrix-Matrix multiplication. It is expected that those operations can be efficiently implemented in hardware, for example, FPGA.
4.2. Examples of Phased-Based Motion Detection
We applied our motion detection algorithm on video sequences of the Change Detection 2014 dataset  that did not exhibit camera egomotion. For these video sequences, the standard deviation of the Gaussian window functions was set to 4 pixels and the block size was chosen to be pixels. Threshold and normalization parameters were kept the same with the exception of the “thermal video” (in order to deal with larger background noise levels, see below). We also tested the same video sequences using the Reichardt motion detector and the Barlow-Levick motion detector. For the Reichardt detector, the high-pass filters were chosen as first order filters with a time constant of 200 milliseconds, and the low-pass filters were chosen as first order filters with a time constant of 300 milliseconds (assuming that the frame rate is 50 frames per second). Threshold was set to 2. For the Barlow-Levick motion detector, the time constant of the first high-pass filters was set to 250 milliseconds. The low-pass filters were the same as in the Reichardt detector. Threshold was set to 2.
The first video was taken from a highway surveillance camera (“highway video”) under good illumination conditions and high contrast. The video had moderate noise, particularly on the road surface. The detected motion is shown in the top left panel of Figure 9 (see Supplementary Video S2 for full video). The phase-based motion detection algorithm captured both the moving cars and the tree leaves moving (due to the wind). In the time interval between the 9th and 10th second, the camera was slightly moved left- and right-wards within 5 frames, again possibly due to the wind. Movement due to this shift was captured by the motion detection algorithm and the algorithm was fast enough to correctly determine the direction of this movement. In this video, we already noted that this algorithm suffers from the aperture problem. For example, in front of the van where a long, horizontal edge is present, the detected motion is mostly pointing downwards. In addition to moving downwards, the edge is also moving to the left, however. This is expected since the algorithm only detects motion locally and does not take into account the overall shape of any object.
For comparison, the motion detection results for the Reichardt motion detector and the Barlow-Levick motion detector are shown in the top middle and top right panel of Figure 9 (see Supplementary Video S2 for full video). The Reichardt detector performed relatively well when vehicles are moving faster, but the direction it predicts for vehicles moving slower, for example, on the back of the image is not accurate. In addition, motion is still detected in some parts of the screen where the vehicles have just passed by. The detection result for the Barlow-Levick motion detector was poorer. In particular, the response to OFF edge movement is always the opposite to the actual movement direction.
We then squeezed the range of the screen intensity from to , resulting in a video with a lower mean luminance and lower contrast. The motion detection results on the low-contrast video are shown in the bottom 3 panels of Figure 9 (see Supplementary Video S2 for full video). For reference, the motion detection results for the original video are shown in red arrows. Motion detected in the low-contrast video is shown in blue arrows if no motion is detected for the block in the original video. If motion is detected in both the original and the low-contrast video, the arrows are shown in magenta. Figure 9 clearly shows that while the motion detection performance is degraded for all three detectors, the phase-based motion detector performed still quite well in the low-contrast video, detecting most of the moving vehicles. The other two detectors missed many of the blocks where motion was detected in the original movie.
To quantify how well each detector works under different contrast conditions, we computed the ratio of unthresholded output values of each motion detector between lower contrast and full contrast video. For example, in the case of phase-based motion detector, we computed, for each block, the ratio between the PMI index for the lower contrast video and that for the full contrast video. The ratios are then averaged across all blocks where motion is detected in the full contrast video. This average for different contrasts is shown in Figure 10 by the blue curve. In the ideal case when the normalization constant is 0, the phase detector should produce invariant responses to videos with different contrast. The curve shown here is mainly due to a nonzero . The ratios for the Reichardt motion detector and the Barlow-Levick motion detector are computed similarly and are shown in red and yellow, respectively. It is clear that the phase-based motion detector has a superior performance across a range of contrast values. At contrast, the phase-based detector still has of the PMI index value for full contrast video. As expected, the response of Barlow-Levick motion detector is linear with respect to contrast, and the Reichardt motion detector has a quadratic relation to contrast. For a fixed threshold value a larger ratio equates to a more consistent performance at lower contrast.
The second video was captured by a surveillance camera in a train station (“train station video”). The video was under moderate room light with a low noise level. The front side of the video had high contrast; illumination on the back side was quite low. The detected motion is shown in the video of Figure 11 (see Supplementary Video S3 for full video). Movements of people were successfully captured by the motion detection algorithm.
The third video was a “thermal video” with large amount of background noise (thermal video). The threshold for detecting motion was raised by 60% in order to mitigate the increased level of noise. The detected motion is shown in the video of Figure 12 (see Supplementary Video S4 for full video).
The last example we show here was taken from a highway surveillance camera at night (“winterstreet video”). The overall illumination on the lower-left side was low whereas illumination was moderate on the upper-right side where the road was covered by snow. The detected motions are shown in the video in Figure 13 (see Supplementary Video S5 for full video). We note that, overall, car movements were successfully detected. Car movements on the lower-left side, however, suffered from low illumination and some parts of the car were not detected well due to the trade-off employed for noise suppression.
With a higher threshold, the phase-base motion detection algorithm is able to detect motion under noisy conditions. We added to the original “highway video” and “train station video” Gaussian white noise with standard deviation of the maximum luminance range. The results are shown, respectively, in Figures 14 and 15 (see Supplementary Videos S6 and S7 for full videos).
4.3. Examples of Motion Segmentation
We asked whether the detected motion signals in the video sequences can be useful for segmenting moving objects from the background. We applied a larger threshold to only signal motion for salient objects. The blocks, however, introduce large boundaries around the moving objects. To reduce the boundary and to segment the moving object more closely to the actual object boundary, we applied the motion detection algorithm around the detected boundary with blocks. If blocks did not indicate motion, then the corresponding area was removed from the segmented object area.
For comparison, we performed motion segmentation based on local motion detection based on an optic flow algorithm . The segmentation in this case was implemented by comparing the length of the optic flow vectors with an appropriate threshold.
We employed a simple thresholding for both phase-based motion detection algorithms and optic flow based motion detection. More sophisticated algorithms may produce better segmentation results. Thus, the segmentation was purely based on motion cues and no postprocessing at pixel level was performed. The state-of-the-art results for the Change Detection 2014 dataset utilize multiple cues such as motion, color, and background extraction to segment objects and thereby achieve better results. We are exploring here, however, only the case where motion is the only cue for segmentation. Therefore, the ground truth information from the dataset was not applicable to our test. We will, therefore, only show the effectiveness of motion segmentation visually.
We first applied motion based segmentation on 2-second segment of the “highway video.” The result using the local phase-based motion detection algorithm is shown in the video of Figure 16(a) (see Supplementary Video S8a for full video) and that using optic flow based motion detection algorithm is shown in the video of Figure 16(b) (see Supplementary Video S8b for full video). Both videos are played back at of speed. With a higher threshold, the movement of the leaves was no longer picked up by the phase-based motion detector. Therefore, only the cars were identified as moving objects and they are indicated in red. Although the moving objects were not perfectly segmented on their boundary, they were mostly captured. For the optic flow based segmentation, since the regions of interest are set by thresholding the length of the velocity vector, objects moving at lower speed, for example, the cars on the top, were not always picked up by the segmentation algorithm.
We then applied the motion segmentation to a 2-second segment of the “train station video,” the “thermal video,” and the “winterstreet video.” The results are shown, respectively, in the videos of Figures 17, 18, and 19 (see Supplementary Videos S9a and S9b, S10a and S10b, and S11a and S11b, resp., for full videos).
These results show that for the purpose of detecting local motion and its use as a motion segmentation cue, the local phase-based motion detector works as good, if not better than a simple thresholding segmentation using an optic flow based algorithm.
Previous research demonstrated that global phase information alone can be used to faithfully represent visual scenes. Here we provided a reconstruction algorithm of visual scenes by only using local phase information. More importantly, local phase information can be effectively used to detect local motion. Through a simple temporal derivative of the phase, we obtained a second order Volterra kernel that is applied on two normalized inputs. The structure of the second order Volterra kernel in the phase-based motion detector is akin to models employed to detect motion in biological vision, for example, the Reichardt detector  and the motion energy detector .
We then proposed an efficient, FFT-based algorithm employing the change in local phase for detecting motion. In order to exploit the special structure of the change in phase in the frequency domain that is due to rigid motion, the phase-based motion detection algorithm also incorporates the Radon transform, a transform closely related to the Fourier transform. Based on the Radon transform, a motion indicator was proposed to robustly detect whether the phase change is caused by motion. Therefore, the algorithm can be efficiently implemented whenever the FFT is available/supported. We showed examples of applying the phase-based motion detection algorithm on several video sequences. We also showed that the locally detected motion can be used for segmenting moving objects in video scenes. We compared the segmentation of moving objects using our local phase-based algorithm to segmentation achieved using a widely used optic flow based algorithm. Our results suggest that spatial phase information may provide an efficient alternative to perform many visual tasks in silico as well as in modeling in vivo biological vision systems. This is consistent with other recent findings .
Note that phase information has been used for solving various visual tasks in the past. In fact, phase has been successfully employed in optic flow algorithms  and image registration for translation , both applied to motion related tasks. The phase-based optic flow algorithm applies the optic flow equation on the local phase of images rather than the intensity itself to achieve better resolution and robustness in estimating motion velocity. The phase correlation method computes the normalized cross-power spectrum of two images to extract the phase difference. It provides better accuracy and robustness as compared to the classical cross correlation approach applied to two consecutive images in a sequence. However, it has limitations when dealing with images with a repetitive structure.
Our method differs from the above two cases in the following ways. First, it employs a simple temporal derivative/high-pass filtering on the phase to extract local phase changes. The structure of the phase change is exploited for better detection of motion. On the contrary, the phase correlation method considers the structure of the phase itself. This also allowed us to make the motion detection a more local, continuous process rather than purely operating globally on discrete frames. Second, it explores the structure of the change of phase due to motion in the frequency domain rather than in the spatial domain, in which the key constraints of optic flow equations are based. Third, instead of focusing on estimating the exact velocity or shift, our method is centered on the detection of motion with a coarse local estimate of direction. This is the case in the first steps of biological motion detection in the retina or optic lobe of insects; we discussed the resemblance of the method presented here to those of biological models of motion detection.
The proposed motion detection algorithm, however, shares several advantages with the other phase-based methods. For example, it is sensitive to motion that only induces a subpixel shift between frames and for very small differences in intensity. In addition, when compared to the amplitude, the local phase is robust under different contrast and illumination conditions. Consequently, the algorithm presented here can operate in a wide range of contrast/illumination conditions.
Furthermore, once the local phase is extracted from each block, motion detection becomes a localized, temporal operation on the local phase of each block. This forms the basis for the highly parallel structure in the phase-based motion detection algorithm. By contrast, traditional optic flow techniques often rely on explicit comparisons across spatial locations, which increase, for example, memory complexity since all states must be made available to their neighbors.
We also notice that, for each block, a large number of measurements of phase changes are obtained. From Figure 6, we see that this number is much higher than that of the original pixel space. In other words, in order to detect motion in a robust way, the local phase-based motion detection algorithm undergoes an expansion of measurements before settling down onto a single motion indicator value. This number of measurements, however, does not incur additional computational demand thanks to the highly efficient FFT algorithm. Biological visual systems often have a similar structure. For example, in the vertebrate retina, computations are carried out by an extraordinarily large number of neurons until measurements of the visual scene are projected by a fraction of the neurons onto the cortex . Similar expansion takes place in the primary visual cortex as well.
We also highlight the ease of implementation of the intrinsically parallel algorithm proposed here. The algorithm introduced in Section 3.3 is based on the FFT algorithm and does not require solving an optimization problem. It can be efficiently implemented in hardware, for example, FPGAs, or in software, for example, using GPU accelerators. We note that extending the FFT to higher dimensions is straightforward and the implementations of higher dimensional FFTs are also highly efficient. Clearly, the methodology can be applied to motion detection of data in 3D or higher dimensional space, where the Radon transform operates over planes or hyperplanes.
Finally, we argue that the change of phase, although highly nonlinear, can be obtained through a normalization (gain control) followed by a second order Volterra kernel. This separation of a higher order nonlinearity into gain control block and a lower order nonlinear filter can be used for modeling motion detection circuits in biological systems.
The Derivative of the Local Phase
Letand, therefore,The local phase amounts to and therefore Now,