Mathematical Problems in Engineering

Volume 2017, Article ID 5046727, 11 pages

https://doi.org/10.1155/2017/5046727

## Low-Rank and Sparse Based Deep-Fusion Convolutional Neural Network for Crowd Counting

PLA University of Science and Technology, Nanjing, Jiangsu, China

Correspondence should be addressed to Zhisong Pan; moc.liamtoh@szptoh

Received 14 March 2017; Revised 10 July 2017; Accepted 25 July 2017; Published 25 September 2017

Academic Editor: Suzanne M. Shontz

Copyright © 2017 Siqi Tang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper proposes an accurate crowd counting method based on convolutional neural network and low-rank and sparse structure. To this end, we firstly propose an effective deep-fusion convolutional neural network to promote the density map regression accuracy. Furthermore, we figure out that most of the existing CNN based crowd counting methods obtain overall counting by direct integral of estimated density map, which limits the accuracy of counting. Instead of direct integral, we adopt a regression method based on low-rank and sparse penalty to promote accuracy of the projection from density map to global counting. Experiments demonstrate the importance of such regression process on promoting the crowd counting performance. The proposed low-rank and sparse based deep-fusion convolutional neural network* (LFCNN)* outperforms existing crowd counting methods and achieves the state-of-the-art performance.

#### 1. Introduction

Recent years have witnessed extensively crowded scenes, such as concerts, political speeches, ceremonies, marathons, and tourist spots. The crowd counting problem, as a machine learning and computer vision problem, takes a single image or a surveillance frame as input and aims to estimate how many people are in it. It is of significant importance to public security and automatic surveillance [1]. Though tremendous strides have been made in crowd counting, it still remains a challenge due to severe occlusion, various perspective distortion, and diverse crowd densities.

To solve these problems and promote the accuracy of crowd counting, much methods have been proposed in the crowd counting literature. This paper is not the first one to leverage convolutional neural network (CNN) model to promote the accuracy of crowd counting, whereas most of the CNN based crowd counting methods adopt a two-stage pipeline: crowd density estimation with an end-to-end deep network and direct integral to obtain the global counting, which accumulates the errors and limits the promotion of counting accuracy. To solve this problem and promote the accuracy of crowd counting, we propose a low-rank and sparse based deep-fusion convolutional neural network* (LFCNN)*, which adopts the low-rank and sparse penalty based regression process instead of the direct integral.

##### 1.1. Contributions

In this paper, we aim to promote the accuracy of crowd counting methods from a single image. Motivated by the density map regression architecture and feature-based global regression architecture, we propose the low-rank and sparse based deep-fusion convolutional neural network for crowd counting, which contains two key components: deep-fusion network for density map regression and a subsequent regression method to map the estimated density map to global counting. As the spatial information and the global counting of crowds are, respectively, used by the two steps, the images are projected step by step, from surveillance frame to gray-scale density map and ultimately to global counting. The contributions of the proposed method can be summarized as follows.

To improve the accuracy of density map regression, inspired by the inception structure of googlenet [2], we propose a deep-fusion network structure to capture multiscale targets in crowded images. In each inception unit of our deep-fusion network, to achieve robustness to variation of peoples size, conv layers with filters of various sizes and numbers are utilized as base networks. At the end of each inception, the feature maps are concatenated and the intermediate representations of base networks are combined. As a result, the deep-fusion network contains more models with more receptive fields and thus obtains more accurate density map regression result for human of various sizes in surveillance frames. Compared with googlenet, the structure of our fusion network is shallower and simpler, which is more suitable for density map regression task.

Aiming to improve the accuracy of global counting regression, which is the ultimate objective of crowd counting methods, we adopt least squares regression with low-rank and sparse penalty to project the estimated density map to global counting, instead of the direct integral process adopted by most existing CNN based crowd counting methods. The inspiration here is rather intuitive: the estimated density maps are coarse with abundant errors and ambiguity, so it is necessary to build the following regression models to eliminate the errors and obtain overall count. Enlightened by low-rank and sparse learning, which builds upon the theory that signals should contain a low-rank part and a sparse part, we adopt low-rank and sparse penalty on estimated density map. We then solve the problem by transforming the penalty on density map to penalty on regression parameters. Compared with other regression methods, such as Ridge Regression and LSSVM, our proposed method can also be viewed as fine-tuning the estimated density map based on the assumption that an accurate density map should contain low-rank structure and sparse structure.

Experiments on large-scale crowd counting datasets demonstrate that, to our knowledge,* LFCNN* can outperform other methods and achieve the state-of-the-art performance in crowd counting application.

#### 2. Related Works

##### 2.1. Crowd Counting

Existing crowd counting methods can be divided into location-based methods and regression based methods.

The location-based methods are based on the foundation that a crowd is composed of single targets which can be detected and then counted. These methods attempt to locate every single person by detector scanning [3, 4], tracking, and trajectories clustering [5] before getting the counting result. However in extensively crowded scenes, a single person is prone to overlap with another and can hardly be precisely detected, which leads to relatively severe error on counting result.

Another popular crowd counting pipeline, regression based method, treats the whole crowd instead of a single person as target and avoids the challenging task of detecting individual person. These methods, more suitable for extensively crowded scenes, can also be divided into two catalogues, global counting regression methods [6–8] and density map regression methods [9–14].

The global feature-based regression pipeline usually contains three successive steps: (1) foreground segmentation, (2) feature extraction, and (3) crowd counting regression. Pixel features [6], texture features [7], and integrated features [8, 15] were utilized and regression models, such as Gaussian Process [8], Ridge Regression [16], and neural network and random forest [15], were adopted to achieve better performances. Despite the effectiveness of these methods, merely utilizing the global counting as supervision signal without using the spacial information of the crowds largely limited the accuracy of these methods.

Compared with global feature-based regression methods, density map regression methods, proposed by [9], further promoted the accuracy of crowd counting by utilizing the crowd’s spatial information contained in density map, which was calculated by the position of each person and denoted the crowd density of a local area. Following this pipeline, [10, 17] promoted the counting accuracy using modified random forest algorithm as regression model of density map.

In recent years, with the prosperity of convolutional neural networks (CNN) in image classification [18, 19], detection [20, 21], segmentation [22], and pedestrian detection [23], the CNN model is also leveraged by crowd counting methods. Zhang et al. [11] firstly proposed CNN based density map regression methods, called* Patch-CNN*, and demonstrated significant improvement on the methods based on hand-crafted features. Based on this pipeline, Zhang et al. [12] adopted three networks with various kernel sizes to construct* MCNN*, which was more adaptive to variations in person size. Inspired by combining the high-level semantic information and low-level detailed features, Boominathan et al. [13] combined deep and shallow networks to construct* Long-short CNN* as density map regression network. Aiming to solve the multiscale problem of person size, Onoro-Rubio and López-Sastre [14] proposed* Hydra-CNN*, using a pyramid of image patches of multiple scales to train multiple networks and benefitting from the integration of multiple models. Another attempt to promote the counting accuracy by model ensemble is the* Boost-CNN* [24], which employed boosting to density map regression CNN model. To sum up, the success of these methods could be attributed to the following two reasons: the automatic learning ability of end-to-end density map regression networks and the usage of spacial information in density maps. These methods attempted to promote the accuracy of density map regression by adopting more and more complicated network structures. However, as the global counting instead of the density map is the objective of counting methods, these methods are exactly not end-to-end trained for global counting regression and there is always a gap between the output of the network and the objective of counting problem. The direct integral, adopted by most of the existing methods to project from density maps to global counting, accumulates the error in estimated density maps and limits the promotion of counting accuracy.

The CNN based counting regression methods, such as* Patch-count CNN* [25],* Patch-multitask CNN* [26], and* TSCCM* [27], applied fully connected layers to directly regress the counting of person in image patches. Though these methods constructed end-to-end network for counting task, a surveillance frame needs to be cut into amounts of patches with each patch counted by the network, which is fairly time-consuming. Moreover, without using the spacial information in density maps, these methods’ accuracy is severely limited.

Though lots of crowd counting methods have been proposed as shown above, most existing crowd counting methods, including the CNN based ones, either regress the global counting without the spacial information or utilize direct integral of estimated density maps to obtain the global counting without using the global counting. Adopting CNN based density map estimation architecture and a learning process to project the density map to the overall count, our model differs from the existing methods and benefits from adopting both the spacial information and the global information of crowds.

##### 2.2. Low-Rank and Sparse Structure

Low-rank and sparse structures have been profoundly studied in matrix completion, compressed sensing, and dimensional reduction. Principal Component Analysis (PCA) [28] is based on the assumption that signals usually have low intrinsic complexity, are low-rank, or lie on some low-dimensional manifold. And it operated linear projection to seek such low-rank representation by minimizing the error between the signal and the low-rank representation.where denotes the noise and is the low-rank part of the signal .

A variant of PCA [28], known as a robust PCA (RPCA) [29, 30], is built upon the theory that signals matrix has low-rank structure and the noise is sparsely distributed, affecting only fraction of the signal matrix entries.

Furthermore, Go Decomposition (GoDec) [31] proposed low-rank + sparse decomposition of a signal, where

Chalapathy et al. [32] applied deep neural network to construct robust nonlinear subspace that captures the majority of data points and detect anomaly instances, while allowing for some data to have arbitrary corruption. As the deep network extends the robust PCA model to the nonlinear autoencoder setting, the nonlinearity helped discover potentially more subtle anomalies, which promoted the robustness of the model.

#### 3. Notation and Problem Definition

*Notation*. We use boldface lowercase letters like to denote vectors. Boldface uppercase letters like are used to denote matrices. is used to denote the norm, and is used to denote the trace norm. denotes the Hadamard products.

*Problem Definition*. Suppose that we have images, denoted with , where is the th image and is the number of people in this image. Person in this image is denoted with and the location of person is . The density map is denoted by . More specifically, denotes the density map calculated as ground truth of density map regression network, denotes the density map estimated by CNN networks, and denotes the density map modified by the low-rank and sparse regression method.

#### 4. Deep-Fusion Density Map Regression Network

In this section, we illustrate the proposed deep-fusion network structure for crowd density map regression. The goal of the deep-fusion network is to learn a density map regression function , where is the surveillance image of arbitrary scene and is the crowd density map of it. So we firstly illustrate the calculation method of the supervision signal , based on which we further explain the deep-fusion network structure with the some detail of the network.

##### 4.1. Density Map Calculation

The first step is to calculate the density maps as the training ground truth of the network. With position of each pedestrian labeled, the true density map is actually decided by the pedestrians location, shape, and perspective distortion. Due to severe occlusions, pedestrians’ bodies overlap with each other and head is the main cue to judge whether there exists a pedestrian in extensively crowded images. So our work follows [9] and adopts the Gaussian kernel centered on the locations of pedestrians head to denote each pedestrian in the calculated density map, as inwhere is the th pedestrian and is its location. Actually, the parameter of Gaussian kernel should correlate with the size of each head, which is influenced by the height and angle of the surveillance camera according to the perspective distortion theory. Most of the scene-specific methods [8] get the parameter by measuring the perspective distortion parameter of each scene as prior knowledge of crowd counting model. However, for arbitrary scene, measuring every single image to get its perspective distortion parameter is much too time-consuming and almost impossible. In our model, we define a global constant parameter for all training images based on the average size of all the heads in datasets.

The density map of the whole image is calculated as a sum of Gaussian kernels of all the pedestrians as inwhere is the calculated density map of and stands for the impulse function.

##### 4.2. Deep-Fusion Network

Multiscale is a significant problem of almost all current computer vision tasks, especially in crowd counting problems owing to the perspective distortion of surveillance cameras [12]. Motivated by googlenet [2], as shown in Figure 1, which integrates several paratactic conv layers with various perspective fields in a inception unit, we propose to use a deep-fusion network to manipulate the scale variation problem. The overall structure of our deep-fusion network is illustrated in Figure 2.