#### Abstract

Moving objects of interest (MOOIs) in surveillance videos are detected and encapsulated by bounding boxes. Since moving objects are defined by temporal activities through the consecutive video frames, it is necessary to examine a group of frames (GoF) to detect the moving objects. To do that, the traces of moving objects in the GoF are quantified by forming a spatiotemporal gradient map (STGM) through the GoF. Each pixel value in the STGM corresponds to the maximum temporal gradient of the spatial gradients at the same pixel location for all frames in the GoF. Therefore, the STGM highlights boundaries of the MOOI in the GoF and the optimal bounding box encapsulating the MOOI can be determined as the local areas with the peak average STGM energy. Once an MOOI and its bounding box are identified, the inside and outside of it can be treated differently for object-aware size reduction. Our optimal encapsulation method for the MOOI in the surveillance videos makes it possible to recognize the moving objects even after the low bitrate video compressions.

#### 1. Introduction

Surveillance cameras are ubiquitous and play an important role in our daily life. The recorded video data from the surveillance cameras provide rich information to many applications ranging from human and machine interactions [1–3] to content indexing and retrieval [4, 5]. For such applications of digital video surveillance and digital video recording (DVR) systems [6, 7], it is often required to examine moving objects for a long period of frames in recorded videos. This naturally demands highly efficient compressions for a huge amount of video data. Here, the conflicting requirement is how to maintain high visual quality, especially for the important information in the video such as the moving objects, in low bit-rate compressions.

A wide range of advanced techniques has been proposed to improve the conventional video compression framework. For example, an efficient block mode determination algorithm [8] was applied for an efficient scalable video compression, where video data can change their resolution to use the limited bandwidth efficiently. The scalable compression scheme is particularly useful for surveillance videos. Note that surveillance videos usually consist of alternating sequence of frames with static background and moving objects. Definitely, the moving objects are the important data to be preserved in the compressions. This requires the compression technique to distinguish the important moving objects of interest (MOOI) from the unimportant static background (non-MOOI) in the video and to treat them differently in the compression process. As a result, the natural user interface (NUI) via the face detection [9, 10] in the surveillance videos can be a feasible technique even for highly compressed videos. To differentiate the MOOI from the non-MOOI, the object segmentation and tracking processes can be applied [6, 7]. However, these methods need to identify accurate object boundaries, which often require expensive computations. Weng et al. [11] used Kalman filter for object detecting and tracking. This method can detect and track the object trajectory frame by frame accurately. However, the object boundary that differentiates the MOOI from the non-MOOI cannot be identified clearly. Goswami et al. [12] used a mesh-based technique to track moving objects in video sequence. Mesh-based motion estimation techniques are more accurate than the block based method, but they are relatively slow due to the high computational complexity.

In this paper we differentiate the MOOI from the non-MOOI by detecting the bounding boxes surrounding the MOOIs for each group of frames (GoF). Then, the detected bounding boxes encapsulating the MOOIs are fixed throughout the GoF. To detect the bounding box we need to identify the pixels with spatiotemporal saliency. For this, we construct the spatiotemporal gradient map (STGM) of a GoF [13], where each pixel in the STGM represents the level of the temporal and spatial saliency. Then, the optimal size of the bounding box is determined to include the local pixels with the highest energy density of the STGM. Once the pixels including the MOOI are determined by the bounding boxes, we can apply linear transformations with different slopes to the inside and outside of the bounding boxes such that the MOOI is intact while those in the non-MOOI are the main target for size reduction. After this initial data reduction, the standard H.264/AVC compression is applied to the size-reduced frames for further compressions. At the receiver, the reverse processes including decompression by H.264/AVC and the size expansion by the inverse linear transformations are applied to restore the video data with the original size. The overall block diagram of our MOOI-based compression is shown in Figure 1.

As far as the image size reduction is concerned, various methods have been proposed for content-aware image and video retargeting context such as the seam carving methods [13–17]. These methods reduce the image size by removing the unimportant seam lines that have the low saliency. The output video has the reduced spatial resolution, where the rich texture areas are maintained but the homogenous areas are removed. These video retargeting methods are mainly for display purposes and it is not reversible to reconstruct the original image size from the retargeted videos unless the decoder knows exact locations of the discarded seam pixels. Therefore, the conventional video retargeting methods are not appropriate as the initial data reduction for video compression. Note that image pruning scheme with image downsampling as a preprocessing step of video compressions has been also used in Vo et al. [18], where one of the two consecutive image lines (i.e., even or odd lines) is to be discarded for image size reduction. Since the line dropping is limited for one of two consecutive lines and the criterion for line dropping is based on the least mean square errors (LMSE) of the interpolated image data, it is hard to differentiate the MOOI from the non-MOOI. A reversible nonuniform size reduction method was also proposed in Won and Shirani [19] without the bounding box. Finally, we note that this paper is the extended version of our previous single MOOI [20] to multiple MOOIs.

Our contributions of this paper can be summarized as follows: (i) we introduce a spatiotemporal gradient map (STGM) to trace the boundary of the MOOI within a GoF; (ii) based on the STGM, a cost function for determining the center and size of the bounding box encapsulating the MOOI is formulated; (iii) an optimization process for updating the center and size of the bounding box alternately is introduced; (iv) the subjective visual quality especially for the MOOI is enhanced by nonuniformly reducing the size of the video frames as a preprocessing for the H.264 video compressions.

This paper is organized as follows. In Section 2, the algorithm for detecting multiple moving objects in video is presented. Then, different linear transforms are applied to MOOI and non-MOOI for size reduction in Section 3. Section 4 shows the experimental results of proposed method. Section 5 concludes this paper.

#### 2. Detection of Multiple Moving Objects of Interest

Moving objects in video can be detected by motion estimation, which is a computationally expensive process. Instead, in this paper, all MOOIs are detected by using spatial and temporal gradients in a GoF of H.264/AVC structure. Specifically, a spatial gradient map at pixel of a frame with a size of can be defined as an average of the magnitude of spatial gradients within a window as follows:
where is the magnitude of the spatial gradient. Using , we define the temporal saliency cost by computing the temporal gradient of the spatial gradients between the two consecutive even numbered frames (even numbered frames give more temporal deviations with less computational load):
As a result of the spatiotemporal gradients in (1) and (2), the boundary pixels of the moving objects are highlighted in the temporal saliency cost . Then, we can construct a STGM by mosaicking the maximum saliency cost at each pixel location throughout the GoF with even numbered frames starting from the first even numbered frame to for frames in the* k*th GoF of the H.264/AVC as follows:
where Note that, according to the definition of in (3), the value of corresponds to the spatial or temporal boundary of the MOOI in the GoF. Therefore, we can define a bounding box (or window) with to encompass the trace of each MOOI, where the weighted sum of within the optimal window at the center yields the peak value. Therefore, our problem boils down to the determination of the optimal window size and the center of the bounding box . In this paper, we propose a novel method to find bounding boxes and their centers for multiple MOOIs in a GoF. Our approach to determine and is to take an alternate optimization process between (4) and (5). That is, starting from an initial value of we apply (4) to have . Then is used to update by (5). This alternate process continues until there is no more change from to . Consider
where
and is an indicator function such that
and are predetermined thresholds. Note that, given the current window size , we find the center of the bounding box for all pixels in the image by (4). Then, given the current center of the bounding box, we examine all possible window sizes to find the maximum number of strong edges within the window (i.e., in (5)) under the condition that the number of weak edges is less than a threshold (i.e., in (5)). Figure 2 shows the convergence of our alternate optimization process of (4) and (5) for and , where our method converges after about the 5th iteration. Note that the center and the size of the bounding box improve after every iterative step. This tells that the center and size of the bounding box are updated cooperatively, which eventually leads to the final convergence. Because our method is based on a binary decision by the threshold, the computation for each iteration is very simple and the convergence is very fast.

**(a)**

**(b)**

For the case of multiple MOOIs in the GoF, after the first MOOI (denoted as MOOI-1) and its bounding box are defined, the bounding box for the next MOOI (i.e., MOOI-2) can be found by repeating the alternating optimization process of (4) and (5). This time, in order not to detect the already-found bounding box again we set the pixel values of the STGM under the predetermined bounding boxes to zeroes before we search for the next MOOI (i.e., MOOI-2). This search process for the next MOOIs and their optimal bounding boxes continues until the sum of all pixel values of the STGM under the bounding box is less than a threshold . After all moving objects and their bounding boxes in the GOF are found, we have multiple MOOIs from MOOI-1 to MOOI-.

#### 3. Linear Transformations for Nonuniform Size Reduction

After all MOOIs and their bounding boxes in each GOF are determined, we can reduce the sizes for MOOIs and non-MOOIs nonuniformly. In this paper, to treat the inside and outside regions of the bounding boxes differently and to speed up the size reduction process, linear transformations with different slopes for MOOIs and non-MOOIs are applied. So, to squeeze the original frame of to ( and ) we first apply 1D linear transformations to reduce the number of rows from to . Subsequently, the number of columns is reduced from to to have the squeezed frame of . Since these sequential 1D reductions for the rows and columns are similar, we describe only the row reduction in this section.

For each row we need a linear mapping function to convert the original row index to . Depending on the existence of the MOOI at or not, the slope of the linear function takes either for the MOOI or for the non-MOOI (see Figure 3). So, the slopes control the amount of the size reduction between the MOOI and the non-MOOI and we have . Specifically, for the th MOOI (i.e., MOOI-), , we denote as the absolute distance from the center of MOOI-, , to the row index in the original frame and represents one-half of the vertical size of the MOOI-. Also, denotes the row index of the center of the MOOI- at the reduced frame. Note that the index of the MOOI- is assigned sequentially from the left to the right of the image space and we start the linear mapping with MOOI-1 by the following linear transformation for each row in : Then, for the next rows in with we have the following mapping function : Finally, for the rows in the last MOOI- we have the mapping function as follows: Given the reduction rate for the MOOIs, the reduction rate for the non-MOOI should be determined by considering the overall reduction ratio from to as well as . So, for a single MOOI, is given by For multiple MOOIs is calculated as follows: The index of the center of the MOOI is also changed from to after the reduction. Specifically, the indices for the first MOOI and the next MOOIs are given as the following equations, respectively:

**(a)**

**(b)**

In practice, bounding boxes of MOOIs can be overlapped horizontally and/or vertically. Specifically, we define that bounding boxes are overlapped when their 1D projections are overlapped. This is because 1D linear transformations are applied to rows and columns separately. Figure 4 shows the example when the bounding boxes are overlapped horizontally. To deal with the overlapped bounding boxes, the boundaries of MOOI-2 and MOOI-3 are merged; that is, new left and right boundaries and center are determined as , , and , respectively. The row reduction is then performed using (8)–(14) for MOOI-1 and the merged MOOI. The column reduction is performed in a similar manner.

Our goal of the linear transformations is to keep the original image data in the MOOIs as much as possible after the size reduction, while achieving the major size reduction in the non-MOOI. Therefore, we first set and adjust () to meet the requirement of the size reduction. After the transformations of (8)–(10) the integer valued indices at the reduced rows are determined by the interpolation from the actual mapped indices. After the row reduction, the transformation-interpolation process is applied to the columns to complete the size reduction.

After the size reduction, the conventional H.264/AVC is used to further compress the size-reduced frames and the compressed bit stream is sent to the receiver. At the receiver, after decoding compressed bit-stream, the decompressed frames are expanded to the original size by the inverse transformation-interpolation for the columns and the rows sequentially. The sizes of bounding boxes, their centers, and the size reduction rate of MOOIs are sent to the decoder as side information for the size expansion. Note that we can use the same GoF boundaries as those from the H.264/AVC.

#### 4. Experimental Results

Our experiments have been conducted to demonstrate two aspects: (i) accuracy of the proposed MOOI detection with the bounding box and (ii) usefulness of the proposed MOOI detection. The accuracy of the proposed MOOI detection with the bounding box is judged by visual comparisons with the previous inter-frame based Kalman filtering approach [11]. The usefulness of the proposed MOOI detections is demonstrated by applying the detected MOOI to the content-aware image resizing with the comparison of the LMSE method [18] and to the image size reduction as a preprocessing for the H.264/AVC compressions.

The surveillance video sequences [21, 22] were used to evaluate the performance of the proposed method. In all our experiments, the parameters are predetermined and fixed as follows: , ,, , , . The threshold parameters affect the accuracy of the bounding box detection. The users can interact with the system by adjusting these parameter values. For comparisons, the proposed method and the LMSE method in Vo et al. [18] were applied to reduce the size of the video frames before we apply H.264/AVC. Then, the visual qualities after the H.264/AVC decompression and the size expansion are compared.

The proposed bounding box detection method is also compared to the moving object detecting and tracking method [11]. As shown in Figure 5, the Kalman filtering method [11] tends to detect only the moving part between the consecutive frames not the whole body of the moving object, which demonstrates the power of our STGM-based formulation of the cost function for a GoF. For the case of multiple MOOIs, Figure 6 shows the order of MOOI detection from the first MOOI (i.e., MOOI-1 in Figure 6(a)) at the leftmost side of the image to the last MOOI (i.e., MOOI-3 in Figure 6(c)) at the rightmost side of the image in a GoF. This demonstrates the extension of our previous work [20] to the problem of multiple MOOIs. Since our bounding box is determined on the basis of the GoF, the first bounding box of the walking person includes all pixels along the motion trajectories from the first frame (Figure 6(e)) to the last one (Figure 6(f)) of the GoF.

**(a)**

**(b)**

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

Once the MOOIs are detected with the bounding boxes, we can differentiate the image regions of the moving objects from the rest of the image regions of nonmoving objects. This allows us to treat MOOIs and non-MOOI separately for image size reduction. That is, we can nonuniformly reduce the size of the image frames in the video before compressions such that the non-MOOIs are the major target for the size reduction. Frame reductions by 30% and bitrates of 50–200 kbps were tested for the visual comparisons of the MOOIs after the decompressions and size expansions. Figure 7 and Figure 8 show the results of the size reduction by the LMSE in Vo et al. [18] and our method for a single MOOI and three MOOIs, respectively. As shown in the figures, the moving objects are almost intact after the size reduction by the proposed method. Figure 9 for the Stair sequence in the database [21] demonstrates the differences more clearly. As one can see, the proposed method outperforms the LMSE inside regions of the MOOI in terms of PSNR and visual quality. Figure 10 compares the numerical results by the rate-distortion graphs. Although our method yields PSNR slightly lower than the LMSE for the whole image, inside the MOOI regions, it achieves 3~4 dB higher PSNRs than the LMSE and the H.264/AVC compressions without the size-reduction at bitrates lower than critical bitrate of 150 kbps.

**(a)**

**(b)**

**(c)**

**(a)**

**(b)**

**(c)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

#### 5. Conclusions

Optimal bounding box detection method for the moving object of interest (MOOI) has been proposed. Multiple MOOIs as well as a single one can be automatically detected by the proposed method. Once the bounding boxes are identified, one can treat the MOOI and the non-MOOI differently to preserve the visual quality of the important MOOIs. Linear transformations with different slopes are used to nonuniformly reduce the sizes of the MOOIs and non-MOOI. Our size reduction method can be applied as an initial compression for the H.264/AVC video compression standard. Experimental results show that the decompressed videos of the H.264/AVC using the proposed method yield better PSNRs for the MOOI about 3 dB higher than the LMSE.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgment

This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Republic of Korea, under the ITRC (Information Technology Research Center) support program (NIPA-2014-H0301-14-4007) supervised by the NIPA (National IT Industry Promotion Agency).