Abstract

Crowdsensing leverages human intelligence/experience from the general public and social interactions to create participatory sensor networks, where context-aware and semantically complex information is gathered, processed, and shared to collaboratively solve specific problems. This paper proposes a real-time projector-camera finger system based on the crowdsensing, in which user can interact with a computer by bare hand touching on arbitrary surfaces. The interaction process of the system can be completely carried out automatically, and it can be used as an intelligent device in intelligent transport system where the driver can watch and interact with the display information while driving, without causing visual distractions. A single camera is used in the system to recover 3D information of fingertip for hand touch detection. A linear-scanning method is used in the system to determine the touch for increasing the users’ collaboration and operationality. Experiments are performed to show the feasibility of the proposed system. The system is robust to different lighting conditions. The average percentage of correct hand touch detection of the system is 92.0% and the average time of processing one video frame is 30 milliseconds.

1. Introduction

The World Health Organization in 2017 years reported that approximately 1.25 million people die every year as a result of a road traffic crash [1]. Distracted driving is a major factor of road traffic crash. The distraction caused by mobile phones, vehicle mounted central control display, and navigation display is a growing concern for road safety. Drivers watching the displays while driving slows braking reaction time and makes it difficult to keep in the correct lane and to keep the correct following distances.

In order to reduce the distractions caused by watching display as driving, a projector-camera interactive system can be used to project the display image on transparent glass on the vehicle while users can create and send messages by bare hand touching the projection content directly without any physical clicking devices. People can watch the display information while driving, without causing visual distractions.

The increasing social networking services based on mobile devices are a remarkable trend in mobile computing [2, 3]. Social networking services can be fused with real-world sensing. Crowdsensing is one kind of fusing [4]. Crowdsensing, as a special form of crowdsourcing [5, 6], leverages human intelligence/experience from the general public and social interactions to create participatory sensor networks to solve specific problems collaboratively [7].

Projector-camera interactive system is also with potential applications to projection-based augmented reality (AR) [8] which directly project augmenting information upon objects in the real world. The vehicle mounted projection-based AR can project traffic information onto the glass screen and overlap the reality. Through the crowdsensing method, drivers can report specific traffic information, such as outside buildings, and store information, road conditions in remote areas, accidents, and so on by the AR interactive system. Then, their information can be automatically aggregated and uploaded to other users in real time.

For the projector-camera interactive systems, many single camera vision-based researches have been carried out. Shah et al. realized finger clicking detection using fingertip path based on Camshift tracker [9]. However, the method detects clicks using a delay-based scheme, which is not suitable for applications that require fast response. Dai and Chung proposed a touch detection method in which the projection is embedded with imperceptible structured light [10]. He and Cheng realized touch detection method in which a kind of self-adaptive structured light is encoded and embedded into the projection [11]. However, special synchronized and high-speed camera and projector must be used to lock the phase of the embedded structured light. Hu et al. realized a kind of bare finger touch projection interaction by a novel approach based on button distortion [12]. However, the touch precision much depended on the size of button. Cheng et al. proposed a projector-camera interactive system in which touch was detected using a matched pairs of feature points by a projected white circle around the fingertip [13]. However, the additional projected white circle and the shadow of the finger on the projector screen will reduce the accuracy of foreground extraction. Park et al. implemented and evaluated a touch interface for a projection surface in which a depth camera was used for hand recognition and for extracting the hand area from the surrounding scene [14]. But the depth camera is expensive and its precision is affected by surrounding illumination. Zhou et al. proposed a method enabling estimation of the finger depth information using perceptible black-and-white stripe pattern [15]. However, the perceptible black-and-white stripe pattern on the projection screen disturbed the normal display image.

The information cues of the finger’s shadow has been used in many kinds of projector-camera interactive systems. Song et al. proposed a handwriting recognition system in which the finger and its shadow were tracked using a camera. Tracks of the finger were used to indicate the direction and location of a document to simulate typical operations [16]. Xu et al. introduced an interactive system based on the shadow derived by the projector [17]. Shadows can provide a simple interface between human and computer systems. Huang et al. proposed a FSM grammar to recognize finger gestures and a shadow-based fingertips detection method [18].

Cai et al. proposed a fingertip touching approach based on the geometric relationship between fingertip and its shadow to estimate the distance from fingertip to projection surface [19]. While these methods require ideal illumination to create shadows that can be well recognized, Dung et al. provided a touch system that used the distance between the finger and its shadow to detect the touch-timing and location [20, 21]. However, an infrared camera and an infrared ray source are necessary in the system which makes the system complicated and costly.

In this paper, a real-time projector-camera interactive system based on the crowdsensing is proposed in which a triangulation method and a linear-scanning method are used to determine the touch for increasing the users’ collaboration and operationality. These approaches can enhance the robustness and accuracy of the system. The interaction process of the system can be completely carried out automatically, and it can be used as an intelligent device in intelligent transport system. The rest of the paper is organized as follows. In Section 2, the overview of the system is outlined, and our proposed method for the system including modified multiframes difference method, linear-scanning method, and fingertip detection method is presented. Some experimental results are given in Section 3. Concluding remarks are discussed in Section 4.

2. Crowdsensing-Based Interactive System

2.1. System Overview

The proposed crowdsensing-based real-time finger interactions system for intelligent transport system consists of a computing device, a projector, a camera, and vehicle sensing devices. For different tasks such as the data collection of outside buildings and store information, road conditions in remote areas, and accidents, specific distributed crowdsourcing platform is established in which incentive mechanism with privacy protection is adopted to attract users to participate in the data collection. A hybrid incentive approach is used in this system for crowdsensing data collection. Monetary reward incentive method is used for the data collection of outside buildings and store information. Entertainment and gamification incentive method is used for the data collection of road conditions in remote areas. Virtual credit incentive method is used for the data collection of accidents.

The collected data is uploaded to the cloud service platform after being preprocessed. The cloud service platform analyzes and processes all the uploaded data. Other users download the information from the cloud service platform and project the information on the front screen of the car by projection-based AR mode. A registration method is used in the projection-based AR system based on projective reconstruction technique using natural features. At first, four points are specified to build the world coordinate which is used to superimpose virtual objects. Next, the live video natural feature is tracked using the Kanade-Lucas-Tomasi feature tracker [22]. The corresponding projective matrix is estimated using the tracked natural features in the image sequence [23]. Then, the registration matrix is computed by transforming the four specified points by the projective reconstruction technique for AR.

The hand casts its shadow on the display screen because of the projector light source, and the movements of the hand and its shadow are tracked by the camera. The system extracts the area of the hand and its shadow from the projection image and estimate the fusion degree of the hand and its shadow when the user interacts with the projector screen. When the hand and its shadow have been detected as being fused completely, the location of the fingertip will be detected and used to determine the location of the touch.

2.2. Foreground and Shadow Extraction

There are three approaches for detecting moving foreground target in image sequences: optical flow [24], background subtraction [25], and frames difference [26]. The optical flow method uses the brightness change of the image to extract the motion information of the object. The motion vector of each pixel is used to generate a complete vector field. However, the projection images changed randomly, which makes the optical flow method not suitable in our projection interactive application. In addition, the optical flow method is quite complex and has poor noise immunity, so it cannot be applied to real-time processing. The background subtraction method use the subtracting between the current frame and a reference image called background to detect the foreground target. The background is found by means of an image selection process. The frames difference approach has the same principle with the background subtraction method, but the previous frame is treated as the reference image. Because of its small computation and high real-time performance, the frames difference method is more commonly used in foreground detection. However, the detection results of frames difference are prone to image smear and holes. In this paper, combining the background subtraction method, we proposed a modified multiframes difference method to detect the foreground and its shadow in the system.

At first, the camera captured image is geometric calibrated to the corresponding projection image in order to obtain the location relationship between the projection image and the camera view. The process can be carried out automatically using geometric calibration on camera captured image. A 3 × 3 matrix homography [27] can be used to describe the mapping relationship between a point in the camera view and the projection image:

The main steps of geometric calibration are described as follows [28]. () Project a chessboard to the display screen. () Capture the projected images using the camera. () Detect the chessboard corners in projection image and captured image, as shown in Figure 1. () Estimate the homography between the projection image and captured image using the corresponding detected corner locations in the two chessboard image. To estimate the homography, only 4 points of one rectangle in chessboard are necessary. In this paper as shown in Figure 1, 40 corner points are used to achieve higher accuracy.

Next, the foreground and its shadow will be segmented from the calibrated camera captured image. If image is projected onto the screen directly, the captured current image should be equal to the reference captured front image. If there is a moving foreground in front of the screen, the surface albedo will change. In order to increase the robustness of extracting the foreground and its shadow, block searching is used to obtain the albedo change of pixel which is computed by estimating a set of albedo ratios :where is the current frame image and is the gray value of pixel in . is the previous frame image and is the gray value of pixel [x, y] in .

Pixel belongs to the foreground region if any one of satisfies

orwhere is a constant of 0.5–0.8 as a tolerant scale of the albedo change.

If the hand movement is fast enough, as described above, the hand and its shadow can be extracted using two consecutive frames’ difference. If the hand movement is slow, using the difference of two consecutive frames cannot extract the hand and its shadow completely. So, multiframes are used to make differential operation. The extracted foreground and its shadow image can be obtained bywhere is the albedo ratio between the current frame image and previous reference image .

Background subtraction method is combined with multiframes difference method for the proposed interactive projection system, which is basically similar to the frame difference method, but the reference image is background image. The projection image is used as the background image in the modified multiframes difference method. According to the difference between the background image and the geometric calibrated camera captured image, the extracted foreground and its shadow image can be obtained bywhere is the albedo ratio between the current frame image and background image.

The multiframes difference results are used to perform AND logical operations with , respectively,

Then, perform OR logical operations on , and to obtain .

The extracted foreground and its shadow result are processed by morphological erosion and dilation [29]. Because of the noise interference, there is often a small amount of color or gray similarity between the target and the reference image. Image often has many isolated points and holes, which will interfere the detection of foreground and its shadow. Through erosion and dilation treatment, isolated spots are removed and holes are filled.

The flow chart of the modified multiframes difference method as described above for foreground and shadow extraction is shown in Figure 2. Figure 3(a) is the captured current image with foreground and its shadow; Figure 3(b) is the captured previous 10th image with foreground and its shadow. Figure 3(c) is the captured previous 15th image with foreground and its shadow. Figure 3(d) shows the foreground and its shadow extraction result using the modified multiframes difference method.

2.3. Touch Detection

The principle of touch detection is based on the triangulation [30] in our interactive projection system. A linear-scanning method is proposed to detect the triangulation of the finger and its shadow without separating the hand from its shadow, which can increase the robustness of the system and reduce the computation cost. Before detecting the triangulation fusion degree of the finger and its shadow, the hand in the extracted image is demarcated to improve the accuracy and processing speed of the fusion detection for the disturbance from the rest of the captured image can be avoided. The pixels of the foreground and its shadow can be detected on the right edge of the extracted images if the user is on the right side of the projector display screen, while they cannot be detected on the left edge of the extracted images, as shown in the left parts of Figures 4(a) and 4(b). Then, a section including the hand from the far-left pixels line of the foreground and its shadow in the extracted image to the right, as shown in the right parts of Figures 4(a) and 4(b), is intercepted as the demarcated area for fusion detection of the finger and its shadow. The pixels of the foreground and its shadow can be detected on the left edge of the extracted images if the user is on the left side of the projector display screen, while they cannot be detected on the right edge of the extracted images, as shown in the left parts of Figures 4(c) and 4(d). Then, a section including the hand from the far-right pixels line of the foreground and its shadow in the extracted image to the left, as shown in the right parts of Figures 4(c) and 4(d), is intercepted as the demarcated area for fusion detection of the finger and its shadow.

A vertical scanning line is used to scan the demarcated area to estimate the fusion degree. If a vertical line is detected on which a 0-pixel segment exists and the pixel value at both ends of the segment is 255, the finger and its shadow have not completely fused, as shown in the right parts of Figures 4(a) and 4(c). Otherwise, if no such vertical line is detected, the finger and its shadow have fused, as shown in the right parts of Figures 4(b) and 4(d).

When the fingertip and its shadow are detected as being completely fused using the linear-scanning method, the location of the fingertip should be found in the camera captured image to obtain the touch location on the projection image. According to the characteristic of the user, when the user is on the right side of the projector display screen, the far-left pixel of the foreground in the extracted image is as the fingertip, as shown in Figure 4(b). When the user is on the left side of the projector display screen, the far-right pixel of the foreground and its shadow in the extracted image is as the fingertip, as shown in Figure 4(d).

3. Result and Discussion

An simulation experiment is conducted to evaluate the performance of our proposed projector-camera interactive system. The experiment platform, as shown in Figure 5, includes a Lenovo computer with 3.6 Ghz CPU and 4.0 G RAM, a 640 × 480 resolution camera, a SONY projector with resolution of 1440 × 1050, and a projector screen. Lps is 1.5 m, and Lcp is 0.5 m.

Figure 6 shows the experimental results of the foreground and its shadow extraction under different lighting. Figures 6(a) and 6(b) are the foreground and its shadow extraction results in natural lighting with bright and dark projection, respectively. Figures 6(c) and 6(d) show experimental results in LED (Light Emitting Diode) lighting with bright and dark projection, respectively. It can be seen that our method is robust in the foreground and its shadow extraction.

The touch depth accuracy is studied in the experiment. When a touch has been detected, the maximum vertical distance from the fingertip to the screen plane is defined as maximum effective touch depth (METD). 9 points on the screen are selected as our METD detection points which are represented with red asterisks as shown in Figure 7. Figure 8 illustrated the METD detection results at the selected 9 points on the projection screen in natural lighting condition with a bright projection. It can be seen from Figure 8 that most of the METDs are below 10 mm except one of 12 mm. The average METD is 8.3 mm. The results demonstrate that our proposed method is effective and it will lead to an accurate fingertip depth detection during interaction.

The randomness and accuracy of touch position on the projection screen are verified in the system. As shown in Figure 9, 220 random points are measured after the finger traverse through the screen. Most of the touch position deviation on the X-Y screen plane is below 8 mm only except one of 21 mm. The average touch position deviation is 3.3 mm. The results imply the touch position detection is accurate and unaffected whether the points locate in the center region or at the edge of the screen during the interaction.

Experiment is conducted for touch detection under different lighting conditions, as shown in Table 1. In the experiment, when the METD is below 8 mm and the touch position deviation is below 8 mm, the touch detection is supposed to be correct in normal projector interactions. In Table 1, the touch detection accuracy rates of our method are shown in comparison with Song et al.’s [16] and He and Cheng’s [11] method. Among 200 sampled frames, we show the touch detection accuracy rates experimental results in bright and dark projection on natural lighting condition and LED lighting condition, respectively. The touch detection accuracy rate in bright projection on natural lighting condition is the same between our method and Song et al.’s method; both are 98.0%, which is higher than 96.0% of He and Cheng’s method. The touch detection accuracy rate in dark projection on natural lighting condition is also the same between our method and Song et al.’s method; both are 88.0%, which is higher than 86.0% of He and Cheng’s method, while the touch detection accuracy rate of our method is much higher than Song et al.’s and He and Cheng’s method in bright or dark projection on LED lighting condition. The average percentage of correct detection of our method (92.0%) is higher than that of their methods (88.0% and 88.5%).

The execution time of the blocks of the projector-camera interactive system is shown in Table 2. The average time of processing one video frame is 30 milliseconds; the total time variance is 0.6; the total time range is 4. We can detect the touch-timing and location at 33 frames per second. That is to say, our method can achieve real-time computing in an ordinary computing system.

4. Conclusion

In this paper, we present a real-time projector-camera interactive system based on the crowdsensing which enables user to transform any flat surface into a virtual touch panel and interact with computer by finger touching. The interaction process of the system can be completely carried out automatically; it can be used as an intelligent device in intelligent transport system where the driver can watch and interact with the display information while driving, without causing visual distractions. A single camera is used in the system to recover 3D information of fingertip for hand touch detection by the finger and its shadow triangulation. Foreground and shadow are extracted based on a modified multiframes difference method which has both the advantage of the adaptability for projection dynamic environment in the frame difference method and advantage of obtaining complete target information in background subtraction method. Therefore, the interactive system can hardly be affected by the lighting environment, which leads to an accurate interactive gestures extraction and makes the system more adaptable. The fusion degree of the hand and its shadow is detected using a linear-scanning method which avoided the errors caused by inadequate separation of the finger and its shadow and increased the robustness of the system. Finally, a simple and effective method for detecting the position of the fingertip is proposed according to the user’s habit of using the projection interactive system. The experiment results indicate that the projector-camera interactive system can achieve a robust and effective performance using a single camera.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant nos. 61403365 and 61402458), the China Postdoctoral Science Foundation (Grant no. 2016M602543), the Natural Science Foundation of Guangdong Province, China (Grant nos. 2015A030313744 and 2016A030313177), Shenzhen Technology Project (Grant nos. JSGG20160331185256983 and JSGG20160229115709109), Guangdong Technology Project (Grant nos. 2016B010108010 and 2016B010125003), Shenzhen Engineering Laboratory for 3D Content Generating Technologies (Grant no. []476), Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (Grant no. 2014DP173025), State Joint Engineering Laboratory for Robotics and Intelligent Manufacturing funded by National Development and Reform Commission (Grant no. 2015581), and CAS Key Technology Talent Program.