Abstract

Height measurement for moving pedestrians is quite significant in many scenarios, such as pedestrian positioning, criminal suspect tracking, and virtual reality. Although some existing height measurement methods can detect the height of the static people, it is hard to measure height accurately for moving pedestrians. Considering the height fluctuations in dynamic situation, this paper proposes a real-time height measurement based on the Time-of-Flight (TOF) camera. Depth images in a continuous sequence are addressed to obtain the real-time height of the pedestrian with moving. Firstly, a normalization equation is presented to convert the depth image into the grey image for a lower time cost and better performance. Secondly, a difference-particle swarm optimization (D-PSO) algorithm is proposed to remove the complex background and reduce the noises. Thirdly, a segmentation algorithm based on the maximally stable extremal regions (MSERs) is introduced to extract the pedestrian head region. Then, a novel multilayer iterative average algorithm (MLIA) is developed for obtaining the height of dynamic pedestrians. Finally, Kalman filtering is used to improve the measurement accuracy by combining the current measurement and the height at the last moment. In addition, the VICON system is adopted as the ground truth to verify the proposed method, and the result shows that our method can accurately measure the real-time height of moving pedestrians.

1. Introduction

Whether in reality or in virtual scene, it is crucial to evaluate the height of moving pedestrians. Although there are many works related to dynamic pedestrians, such as detection and recognition [13], positioning [46], and tracking [79], it is still a serious challenge to measure the human height accurately in the dynamic case. As a vital state attribute of pedestrians, the height can not only help locate dynamic pedestrians or track criminal suspects in reality but also help people get rid of the 3D glasses or helmets in virtual scene [10]. For example, Nilsson et al. adopted the pedestrian height and positions of feet as constraint factors to design a Kalman filter model [11, 12], which can be used for pedestrian positioning and navigation.

In the past years, some height detection methods are developed. Chen et al. developed a novel action-based pedestrian recognition method [13], which could get the rough height. Also, the pneumatic sensor-based height measurement methods are developed to detect the pedestrian height in literatures [14, 15]. However, these methods did not take height measurement as the main research content, which leads to an overall low accuracy. Besides, a significant issue is often ignored that the pedestrian height is changing when the pedestrian is walking, which may reduce the accuracy. There are few state-of-the-art motion tracking systems (MTS), such as VICON, which can precisely detect the real-time height of the dynamic pedestrian [16]. Sheng et al. used the VICON motion capture system, one of the popular spatial positioning systems in the world, to evaluate the human behaviours by analysing pedestrian attributes including height [17]. However, the MTS’s costs for installation and maintenance are extremely expensive. Therefore, it is necessary to develop a cheap system to accurately measure the real-time height for moving pedestrians.

With the rapid development of the visual sensing technology, TOF camera is widely used in many fields, such as robot research [1820], object detection [2123], 3D reconstruction, and gesture recognition [24, 25], due to its compact structure and stable characteristics. In this paper, the TOF camera is adopted as the input, and a real-time height measurement is developed for moving pedestrians; the flowchart of the measurement is shown in Figure 1. Each frame of depth image in a continuous sequence is addressed by the proposed algorithms to obtain the real-time height of the pedestrian with moving. The algorithm can be roughly divided into two steps: image processing and data processing.

Image processing is dedicated to extract the regions of interest (ROI)—head region. When the TOF camera is adopted, the depth value in depth images may be large due to a huge conversion ratio occurring between the actual distance in the world coordinate and the depth data in the image coordinate. Also, the gap between the maximum depth value and minimum depth value is massive, which is not conducive to the subsequent extraction of ROI. To reduce computation and improve the efficiency, a normalization equation is developed in the paper. Then, a D-PSO evolutionary algorithm is developed to reduce the effects of the complicated background. In the D-PSO, the difference part is dedicated to removing the complex background in surroundings, while the PSO part is responsible for the noises that appear after applying the difference part. After that, an MSER-based segmentation algorithm is adopted to extract the head region. Image processing is devoted to height calculation and correction. In this step, a novel multilayer iterative average algorithm based on the actual situation is proposed to remove the outliers and possible noises among the head data. Then, the pinhole model proposed in our previous work [26] is adopted to allow our method to work for pedestrians who are not vertically below the TOF camera. After that, considering the continuity of the height change of the moving pedestrian, Kalman filtering is adopted to combine the current measurement and previous height to improve the accuracy. In addition, the VICON system whose measurement accuracy reached 0.01 mm [27] is used as the ground truth to verify the proposed method.

To this end, our main contributions are listed as follows:(1)A real-time height detection method is developed for dynamic pedestrians. It considers the fluctuation of height while the pedestrian is walking, which is scarcely mentioned in the existing paper.(2)A new D-PSO algorithm is proposed to reduce the effects of the complicated background.(3)A novel multilayer iterative average algorithm is developed to remove the outliers and possible noises for a better performance.(4)Kalman filtering is used to improve the accuracy of the real-time height by combining current measurements and the data at last moment.

The rest of this paper is organised as follows. Section 2 presents a real-time extraction for head region. Section 3 shows a real-time correction and estimation for pedestrian height. Some experiments are introduced in Section 4 to show the feasibility and good performance of the proposed method, while the conclusion and further work are shown in Section 5.

2. Real-Time Extraction for the Pedestrian Head Region

The pedestrian head data are used in this paper to calculate the real-time height of dynamic pedestrians. Each frame of depth image in a continuous sequence is addressed by the following algorithms to extract the pedestrian head region.

2.1. Normalization

According to the imaging principle of the TOF camera, the depth value in depth image is large, and the gap among depth data is massive. Figures 2(a) and 2(b) show the depth images captured by TOF camera; the depth images are shown in Hue, Saturation, and Value (HSV) format for clarity; different colours represent different distances. Figure 2(a) is the background depth image captured in advance, while Figure 2(b) shows the depth image with the pedestrian.

To reduce computation and improve the efficiency of subsequent algorithms, the depth image is converted into the grey image by (1). Figures 2(c) and 2(d) are the grey image (image with pixel values between 0 and 255) corresponding to Figures 2(a) and 2(b), respectively:where is the pixel value of a point in the grey image corresponding to the depth value in the depth image, and is only related to the characteristics of the camera and the distance from the head to the camera; and are the maximum and minimum depth values in the depth image. Different depth values correspond to different pixel values between 0 and 255. The larger the depth value, the bigger the pixel value.

2.2. Difference-Particle Swarm Optimization (D-PSO) Denoising

It is obviously hard to obtain the accurate head data of the pedestrian due to disturbance from the complicated background. The difference algorithm, as shown in the following equation, is used in this paper to mitigate the effects of the complex background:where shows the background grey image (such as in Figure 2(c)), represents the grey pedestrian image (such as in Figure 2(d)), and denotes the result of difference (such as in Figure 2(e)).

The difference algorithm produces a large amount of noises while extracting the human body region successfully. For clarity, the 3D perspective view of Figure 2(e) is shown in Figure 2(f). To eliminate the effect of noises, several common denoising algorithms have been applied with appropriate parameters. Figures 2(g)2(j) show the results obtained by the common denoising algorithms along with the grey image (Figure 2(e)). To compare these algorithms clearly, the results of these algorithms are also shown here in 3D perspective view. In these figures, we can easily see the strength of the noise from the value of Pixel-axis. Therefore, the value of the Pixel-axis can be used as a criterion for evaluating the denoising effect. Although these algorithms can reduce the influence of noises to some extent, they may also blur the target contour and damage the pixel in head region, which is not conducive to the extraction of the head region. Figure 2(k) shows the result that is got by adopting the two-stage PCA filtering algorithm proposed in [28]. It can be seen from Figures 2(f)2(k) that, compared to common filtering algorithms, PCA can reduce noise better and has little influence on target contour and head region. However, the average time consumed by the PCA algorithm is greater than 15 seconds, which is beyond our tolerance.

Particle swarm optimization (PSO), developed by Dr. Kenney and Dr. Eberhart [29], is an evolutionary algorithm based on the study of bird or fish predation behaviour and mainly seeks an optimal global solution by following the searched optimal values of current particles [30]. Because of its fast speed, no need to manually set the threshold, etc.; it has been widely used in the field of image processing [3133] and has achieved excellent results. Thus, PSO is adopted here to remove the background noises. In the PSO algorithm, each particle travels in a multidimensional search space and adjusts its position in search space based on the experience of itself and neighbouring particles [34]. The performance of each particle is evaluated by a predefined fitness function that encapsulates the core characteristics of the optimization problem.

In each iteration, every particle in the particle swarm gets its velocity and position by (3) and (4), respectively:where k is the current number of iterations, and are, respectively, the position and velocity of the ith particle in the particle swarm during the kth iteration, and are two random numbers in [0, 1], respectively, is the inertia weight in the kth iteration, is the optimal solution available for the ith particle, is the optimal solution currently available for all particles, and and are individual learning factors and social learning factors, respectively, which are generally constant. As recommended by Dr. Kenney and Dr. Eberhart [29], we define learning factors . In this case, or multiplied by 2 to give it a mean of 1, PSO can well take into account both social learning and individual learning [35]. The scale of the particle swarm, called M, is directly related to the optimization result and time consumption. A small scale may cause the PSO to fail to find the optimal solution, and a large scale will cause unnecessary time costs [36]. Consider the two points; the particle swarm scale is defined as M = 20.

The larger the inertia weight is, the stronger the global optimization ability is, and the weaker the local optimization ability is [37]. Otherwise, the local optimization ability is stronger. In order to strike a balance between search speed and search accuracy, should not be a fixed constant. A nonlinear decreasing function for is adopted in the paper, as shown in the following equation:where and are the predefined maximum and minimum inertia weights, respectively, k and are the current and maximum number of iterations, and , , and are adjustment factors of the polynomial. After trial and error, we define  = 0.9,  = 0.3,  = 100, a = 2, b = 0.6, and m = 10. The inertia weight curve corresponding to the above parameters is shown in Figure 3. It guarantees that PSO has a high global searchability in the early stage to get the appropriate seed and has higher local searchability in the later stage to improve the convergence accuracy.

Besides, we adopted the maximum interclass variance equation (6) as the fitness function in this paper. The larger the value of the fitness function is, the closer to the optimal solution it will be:where and are, respectively, the proportion of the foreground and background images to the image; and represent, respectively, the average grayscale of the foreground and background images.

Figures 2(l) and 2(m) show the denoising results of PSO algorithm in 3D and 2D perspective view, respectively. Compared with other denoising algorithms, this algorithm can achieve better denoising effect without blurring the target contour. In this section, a D-PSO is introduced to remove the complicated background. Compared with using the difference algorithm alone, D-PSO can not only remove the complex background in surroundings, but it can also reduce the noises that appear after applying the difference algorithm.

2.3. Head Segmentation Based on Maximally Stable Extremal Regions (MSER)

When the TOF camera is used, the depth value for different parts of the pedestrian body varies greatly. In order to extract the head region, the maximally stable extremal regions (MSER) algorithm is used in the paper. The MSER algorithm refers to performing successive binarization operations on a picture; the binarization threshold is continuously increased from 0 to 255 [38]. If a connected region in the image is changed a little or even is not changed within a wide range of the binarization threshold, this region is called the maximum stable extreme region. Figure 2(n) shows the result obtained by the MSER along with Figure 2(m). In the figure, different connected regions are marked with different colours for clarity. It is obvious that MSER can separate different levels of pedestrian body parts.

Fortunately, regardless of the height and position of pedestrians, the head shapes of pedestrians are relatively stable ellipse, even for pedestrians without hair. Thus, the circularity is used as a constraint to get the head region. The circularity of each region is calculated by the following equation:where C represents the circularity of the connected region, l represents the number of pixels in the boundary of the connected region, and A represents the number of pixels within the connected region.

The standard circularity is 1 and the circularity of other noncircular objects is less than 1. According to the experimental equipment and environment, we had an empirical conclusion that the circularity of head region is better between 0.6 and 1.0. If a connected region’s circularity is beyond this range, it would be remarked as the non–head region and deleted. Due to the size of the pedestrian head in practice, the number of pixels A is used as another constraint condition. After repeated tests, we conclude that the A of head region should be during (300, 900). In other words, it is possible to be a head region only if the A of the connected region is within the range. As stated above, the constraints can be summarized in the following equation:

By calculating and comparing the above two parameters of each connected region in Figure 2(n), the head region is extracted, as shown in the yellow part of Figure 2(o). Figure 4 is the pixel distribution map of the extracted head region, where the black dots represent pixel points, and the coordinates represent the positions of the pixels in the image. From this figure, we can discover another advantage of the proposed MSER-based segmentation algorithm, which can remove the notable noises in the head region, such as salt-and-pepper noise. Since the notable noise is very different from its neighbour pixels, it will not be incorporated into the head region when the MSER algorithm is used to obtain the stable region. Therefore, the MSER-based segmentation can effectively filter out notable noises in the head region, as shown in the red rectangles in Figure 4. Note that the red rectangles are the manual markers for easy viewing.

3. Real-Time Calculation for Pedestrian Height

3.1. Multilayer Iterative Average Algorithm for Pixel Value

Although the MSER algorithm can filter out the notable noises, there will still be some noises in the head region, as shown in the 3D representation of the head region in Figure 2(f). The typical height measurement of only using the head top is not accurate. Thus, a novel multilayer iterative average algorithm (MLIA) is proposed to get the pixel average for getting the pedestrian height. The MLIA algorithm not only can improve accuracy, but also can effectively remove some outliers that MSER cannot filter out. The MLIA can be broken down into the following steps:(1)Calculating the average of pixel value: adopting the following equation to get the average of pixel value in the head region, aswhere is the pixel value average, n is the number of pixels in current head region, and represents pixel value in current head region.(2)Updating the head region: traverse all the pixels in the head region, and delete the pixels that do not meet the following equation. The remaining pixels are combined to update the head region:where is a threshold function related to the current average , and it is defined as follows:where and are the maximum pixel value and the minimum pixel value in the head region, respectively.(3)Repeat step (1) and step (2) above until satisfy the following equation:where is the empirical constant. In this paper, is selected as 2.0 according to the actual situation.

The above steps can be summarized as the following pseudocode (Algorithm 1).

Input: S-initial head region extracted by the MSER-based segmentation.
Procedure:
(1)n= count (S);
(2), ;
(3), ;
(4), ;
(5)while do
(7)  ;
(8)  ;
(9)  n = count (S);
(10)  ,
(11)  , ;
(12)  , ;
(13)end while
Output: -the average of the pixels in the head region.

By the way, the MLIA algorithm can also be applied to the multipedestrian situation. When the image contains more than one pedestrian, the MSER-based segmentation can get more than one head region. Meanwhile, the pixel value average of each head region needs to be calculated by the MLIA algorithm.

3.2. Height Calculation

Once is obtained, the average of the head region in original pedestrian grey image (such as in Figure 2(d)), defined as , can be obtained through the deformation of (2).

Then, substituting into (1) to replace , we can obtain the following equation:where is the depth value corresponding to and and are the maximum and minimum depth values in the pedestrian depth image.

According to the physical properties of the TOF camera, the following conversion equation can be used to recover the physical distance from the depth data [39]:where represents the physical distance between the TOF camera and the pedestrian head (unit: mm), is the deviation constant associated with the physical structure and placement height of the TOF camera, while is the conversion coefficient only associated with the physical structure of the TOF camera.

To allow our method to work for pedestrians who are not vertically below the TOF camera, the pinhole model proposed in our previous work [26] is adopted to correct :where is the corrected physical distance, f is the focal length, and is the distance between the centroid of the head region in the grey image M and the centre of the grey image ; the coordinates of the centroid M can be got by the following equation. More detailed information about the pinhole model can be found in the literature [26]:where n is the number of pixels in current head region, and are the horizontal and vertical coordinates of the centroid M, and and are the horizontal and vertical coordinates of the ith pixel, respectively. is the mass of the ith pixel, which is defined as in this paper.

Finally, the pedestrian height H is calculated by the following equation:where is the distance between the TOF camera and the ground.

3.3. Kalman Estimation of Real-Time Height

In the experiments, we found that the fluctuations of the pedestrian heights all approximately conform to the Gaussian distribution with variance 256 (unit: ), and the variance did not change with the state of the system. Therefore, Kalman filtering is further introduced to estimate the pedestrian heights got by (17) to achieve the more accurate real-time heights. Kalman filtering is a highly efficient recursive filter that can estimate the state of a dynamic system from a series of measurements containing redundant noise [40]. It can generate estimates of unknown variables, which have proven to be more accurate than those only based on a single measurement [4, 41]. The Kalman filter can be implemented in two stages: time update stage and measurement update stage [42].

The time update stage is dedicated to predicting the currently a priori estimates through past state and the error covariance. Equations (18) and (19) are responsible for predicting the a priori state estimate and the a priori error covariance estimate in current (kth) frame, respectively:where and are, respectively, the state and the error covariance of the previous step, is the transfer matrix that relates the state of the previous step to the state of the current step, B is the control matrix that relates the previous input , and is the variance of the Gaussian process noise. Based on the actual situation of pedestrians during the movement (no external input, Gaussian distribution of the height fluctuation, and continuity of the height change), the parameters in time update stage are defined as follows: , , ; is the a priori height estimate from the current depth image.

The measurement update stage is devoted to combining actual measurements with a priori estimates to get the improved posteriori estimates [42]. It can be achieved by the following equations:where and are the posteriori state estimate and the posteriori error covariance estimate in current (kth) step, is the Kalman gain in current step, is the matrix that relates the state to the measurement , I is a unit matrix, and R is the variance of the Gaussian measurement noise. Based on the actual situation of measurements (camera accuracy and measurement process), the parameters in measurement update stage are defined as follows: , ; is the posteriori height estimate from the current depth image, and is the pedestrian heights got by (17). In addition, the initialization is defined as and .

4. Experiments and Analysis

4.1. Experimental Setup

In this paper, an EPC660 is used as the TOF chip to offer a fully digital interface for the control circuitry, and the communication between computer and camera is realized through Gigabit network. In addition, the experiment is completed with the support of the computer with Windows 10 OS, Intel® Core™ i3-8100 3.60 GHz CPU and 8 GB RAM. The campus corridor is selected as the first test site, and the experimental scene is shown in Figure 5(a). Then, considering the fluctuation of pedestrian height in dynamic situations, the research room is chosen as the second test site, and the VICON system fixed in this site is adopted as the ground truth to confirm the feasibility of the proposed method. The experimental scene in research room is shown in Figure 5(b), where a portion of the VICON system, two of the 12 infrared cameras, is shown. While the VICON is running, four lightweight reflective balls are stuck to the pedestrian’s head; the placement layout of the balls is shown in Figure 5(c). And the average height of the four balls is adopted as the real-time height of the pedestrian.

4.2. Comparison with Other Popular Algorithms

Before the PSO algorithm is adopted to process the images with unwanted noise, other popular algorithms are deployed to process the same images for a comparison. More specifically, three algorithms are implemented for comparison here:(1)Maximum Connected Region (MCR). As the name implies, MCR refers to the method of extracting the largest connected region in an image. When only a single person appears in the field of view, such as in Figure 2(e), MCR is more likely to get desirable results than PSO. In the actual situation, however, we do not know in advance how many people will go through the test site. Take Figure 6(a) as an example; when two people go through the test site at the same time, MCR may get a wrong result, as shown in Figures 6(b) and 6(c).(2)Edge Threshold Method (ETM). In ETM, the edge operators such as Canny is firstly used to obtain the possible target contours, and the number of pixels in these contours is then calculated, respectively. Once the number is bigger than a specific threshold, the region enclosed by the corresponding contour is considered as the useful region and is retained; otherwise, this region is considered as the useless region and is removed. In the paper, the boundary between the target person and the redundant noise is usually solid, which makes it possible to split the target from the background with the ETM. More importantly, the ETM can also get good results in multipedestrian images with appropriate parameters. However, it is a very difficult task for the ETM to adaptively select parameters. Once the test environment changes, the parameters of ETM need to be reselected, which limits the application of the ETM.(3)Reaction Diffusion-Level Set Evolution (RD-LSE). The RD-LES proposed by Zhang et al. [43] is an improved level set algorithm, which is widely used in the field of image segmentation. Figure 6(d) shows the search process using the RD-LSE algorithm for the Figure 6(a), in which the yellow curves show the evolution processes, the green curve represents the initial contour, and the red curve represents the final contour. This algorithm can achieve a better result than PSO algorithm even in the case of multiple pedestrians, as shown in Figures 6(e) and 6(f). In the paper, we take 4 different types of pictures as examples, to compare the performance of RD-FLS and PSO in terms of converged iterations and CPU time. The experimental results are shown in Table 1, where images 1–4 represent Figures 2(e), 6(a), 7(a) and 7(g), respectively. The values in table are the average of 100 experiments. Table 1 shows that the computational efficiency of the PSO algorithm far exceeds the RD-FLS, which is the main reason why we choose PSO.

4.3. Experimental Results

Apart from the multipedestrian cases such as in Figure 6, many other cases with the pedestrian in different states are studied to verify the effectiveness and robustness of the proposed method. In Figure 7(a), the pedestrian raised his left hand above his head. Figures 7(c), 7(e), and 7(f) show the experimental process and result of adopting the proposed method for Figure 7(a). For clarity, the 3D representations of Figures 7(a) and 7(c) are shown in Figures 7(b) and 7(d), respectively. Although the height of the head is lower than that of the left hand, the proposed method can still get the correct result. Figures 7(i), 7(k), and 7(l) show the experimental process and result of adopting the proposed method for Figure 7(g), in which a pedestrian is kneeling. Although the proposed D-PSO algorithm does not eliminate all redundant noises, as shown in Figure 7(j), it also yields ideal experimental results due to MSER’s insensitivity to a small amount of the sporadic noise. All the above experiments show that the performance of our method is very stable and reliable.

To further verify the accuracy of the proposed method, a lot of experiments are conducted based on 6 subjects: four men and two women, who are asked to walk through the test sites at the usual speed. Here we take a set of data obtained from the research room as an example to analyse the results. Figure 8 shows the height results obtained from the six subjects using the VICON alone in several continuous seconds; the sex and static height of the six subjects are presented in the legend. It explains that it is unrealistic to keep the height on the static level when the pedestrian is walking. Thus, it is essential to study the pedestrian height in the dynamic situation.

Due to the high speed of pictures taken by VICON and TOF cameras and the slowness of pedestrian movement (0.7–1.2 meters per second), we only select 5 height data per second to show a real-time height comparison between the VICON and the proposed method. Every fifth of one second, an image is collected with the TOF camera. The pedestrian height in the image is obtained by the proposed method and compared with the height collected with VICON at the same time. Figures 9 and 10 show the experimental results of four men and two women in six consecutive seconds. In the figures, the dotted line represents our algorithm without Kalman filtering, the solid line represents our algorithm without Kalman filtering, and the dotted line with the mark “+” indicates the VICON. The waveforms show the real-time height value in 6 consecutive seconds; the static heights of men are 1760 mm, 1676 mm, 1761 mm, and 1728 mm, as shown in the legend of Figure 9, while the static heights of women are 1648 mm and 1629 mm, as shown in Figure 10.

It can be seen from the curves that the height data measured by our algorithm is almost consistent with the data obtained by VICON. In order to analyse the error of our algorithm, we sort out the errors of all the data in the six consecutive seconds; the results are shown in Figures 11 and 12. The figures show that Kalman filtering can effectively improve the accuracy of height measurement, which indicates the pedestrian height at the preceding moment facilitates the estimate of the pedestrian height in the latter moment.

Also, the sums of errors per second of the algorithms with and without Kalman filtering are given in Table 2 where the subscript “” represents male and “#” represents female. Table 2 shows that our algorithm with Kalman filtering has a smaller cumulative error and can more accurately measure the real-time height of the moving pedestrians, which proves the feasibility and validity of the proposed method.

5. Conclusion and Future Work

In this paper, a real-time height measurement based on the TOF camera is proposed for moving pedestrians. To get the target region, a new D-PSO denoising algorithm and a segmentation algorithm based on MSER are developed in the paper. In addition, a novel multilayer iterative average algorithm is designed for calculating the pedestrian height. Also, the Kalman filtering is used to improve the measurement accuracy. The experimental results demonstrate the effectiveness and practicability of the proposed method. Our future work is going to further improve the measurement accuracy and focus on tracking pedestrians in real time by using the real-time height of moving pedestrians.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors are grateful to the financial support from the Natural Science Foundation of China (61877065), the National Key Research and Development Program of China (2019YFB1405500), the National Natural Science Foundation of Guangdong (2016A030313177), Guangdong Frontier and Key Technological Innovation (2017B090910013), and the Science and Technology Innovation Commission of Shenzhen (JCYJ20170818153048647 and JCYJ20180507182239617).