Abstract

Global aging is becoming more and more serious, and the nursing problems of the elderly will become very serious in the future. The article designs a control system with ATmega128 as the main controller based on the function of the multifunctional nursing robot. The article uses a convolutional neural network structure to estimate the position of 3D human joints. The article maps the joint coordinates of the colour map to the depth map based on the two camera parameters. At the same time, 15 joint heat maps are constructed with the joint depth map coordinates as the centre, and the joint heat map and the depth map are bound to the second-level neural network. The prediction of the position of the user’s armpit is further completed by image processing technology. We compare this method with other attitude prediction methods to verify the advantages of this research method. The research background of this article is carried out in the context of global aging in the 21st century.

1. Introduction

In recent years, the problems of walking and movement of the elderly, the disabled, and others have received more attention from scientific researchers in various countries. Many countries have carried out scientific research in this field. Most research studies on human pose recognition focus on estimating 2D human joint coordinates from colour maps. The deep learning method based on large datasets has shown excellent results in detecting joint human points in colour images [1]. However, these algorithms cannot directly provide the position of the human body in the global coordinate system for the transfer and transportation care robot.

To ensure the accuracy of human body posture recognition of the transfer and transportation nursing robot, this paper uses the colour map human joint detection model as the first-level neural network to calculate the colour map human joint pixel coordinates. The article maps the joint coordinates of the colour map to the depth map based on the two camera parameters. At the same time, 15 joint heat maps are constructed with the joint depth map coordinates as the centre, and the joint heat map and the depth map are bound to the second-level neural network. Finally, this paper obtains the global coordinates of 15 joints. Artificial intelligence is a technology developed to simulate, extend, and expand human intelligence.

2. Transfer and Transportation Nursing Robot

Figure 1 shows the transfer and transportation nursing robot “Baize” developed by the team. The human-like back-carrying robot uses sound source localization, visual recognition, and other means to realize the position localization and posture recognition of the person cared for. The functions of autonomous obstacle avoidance and autonomous navigation are realized through the perception of the surrounding environment [2]. Tactile sensors are installed on the robot arm, chest rest, and seat to sense the user’s back hug status in real time. The real-time safety guarantee module in the core control system adjusts the robot’s movements and can safely and comfortably complete the actions of lifting, carrying, transferring, and placing. These designs provide intelligent transport services for groups with lower limb inconvenience. These all illustrate the application of artificial intelligence assistive technology in hospital professional nursing technology.

To achieve the above functions, the team designed the transfer and transportation nursing robot (Figure 2). The transfer nursing robot first uses the microphone array and the sound processing module to recognize the user’s voice and calculates the user’s angle relative to the robot to turn to the user. The human body gesture recognition module and the path planning module move to the user in front of the user at a short distance. Finally, the robot drives to the destination and places the user according to the guidance of the path plan.

3. Level 1 Convolutional Neural Network

According to the needs of the transfer and transportation nursing robot, this article defines the human joints as 15 joint points (Figure 3). The human body 3D joint position prediction is divided into a two-level network [3]. In the first-level network, we use PAF (part affinity field) to predict the state of the human body in the colour map. After calculating each pixel’s human joint likelihood value, we take the maximum coordinate as the joint coordinate.

3.1. 2D Joint Point Detection

The PAF method has the characteristics of high precision and high robustness. It can effectively detect the 2D posture of the human body from the RGB image. This method uses a 2D human joint detection model to generate a multichannel scoring heat map . We order , where represents the coordinates of the heat map and represents the coordinates of the maximum score value in the heat map , that is, the predicted pixel coordinates of the joint . Then, we get the predicted coordinates , of 15 joints. To further improve the adaptability of the PAE method to the working environment of nursing robots, we use the weights provided by as the initial weights. At the same time, we use the family environment dataset to perform migration learning on the network [4].

3.2. Loss Function

To predict the joint pixel positions, we define a 15-channel real heat map during model training. represents 15 Gaussian images with the real joint pixel position as the extreme centre. We calculate the minimum mean square error of and as the loss function of model optimization:

4. Joint Heat Map

This paper constructs a multichannel joint heat map with the pixel coordinates of the 15 human joint point depth maps as the centre. We first calculate the depth map mapping relationship of the colour map to obtain the joint depth coordinates. The article obtains the denoising depth map:where represent the pixel coordinates of the depth map; represent the values in the template ; and stands for median operation. Considering that the pixel coordinates of the depth map can be mapped to the camera coordinate system, this paper maps the pixel to the colour map to obtain the two-picture pixel coordinate registration relationship map . In this way, the coordinates of the joint depth map can be calculated in reverse [5]. The process is as follows.

Step 1. The article obtains , at the same time. Add 2 channels based on the colour image . These channels are used to store the corresponding depth map pixel coordinates. Define the coordinate of the pixel in as .

Step 2. Construct a 3-dimensional vector for pixel , where is the pixel value of at .

Step 3. Calculate the coordinate of in the depth camera coordinate system through the depth camera internal parameter matrix and .

Step 4. The article uses the space coordinate transformation matrix to transform into the coordinate in the colour camera coordinate system.where represent the rotation matrix and translation matrix between the two cameras, respectively.

Step 5. Calculate the corresponding coordinate of in the colour image coordinate system through the colour camera internal parameter matrix and the colour camera coordinate system coordinate .Repeat the above calculation for each pixel in in turn to obtain the coordinates of the mapped colour map. In this way, the pixel coordinates of the depth map corresponding to the pixel coordinates of each colour map can be obtained in reverse. Finally, we get the five-channel registration image .

Step 7. Find the joint depth map coordinate corresponding to the joint colour map coordinate according to . We define the multichannel joint heat map as the joint index [6]. The dimension of the heat map is the same as the dimension. is the pixel coordinate of the heat map. We use the 1-dimensional Gaussian function () to calculate with the coordinate of the joint depth map as the centre:

5. Level 2 Convolutional Neural Network

In the second stage, we use depth image and multichannel joint heat map binding as input. We further optimize the 2D joint detection results through convolutional neural networks to obtain 3D human joint poses. The algorithm flow is shown in Figure 4.

5.1. Network Structure

The real-time human posture recognition algorithm based on a 3D convolutional network relies on a high-performance GPU (graphics processing unit), which is unsuitable for ordinary households’ transfer and transportation care robot systems [7]. Therefore, to meet the accuracy requirements of joint prediction, especially near-range joint prediction, this paper uses a compromised 2D convolution method instead of 3D convolution. Based on the VGG16 network structure and the spatial pyramid pooling method (SPP), we design a convolutional neural network that is not limited by the size of the input image (Figure 5). We replace the 1000-unit fully connected layer final output by VGG16 with a 45-unit fully connected layer. The coordinates of 15 joints are estimated to get .

Defining the global coordinates is the same as defining the depth camera coordinates. The depth map camera coordinates are converted to the global coordinates by formula (7), is obtained, and then is obtained.

Among them, represent the internal camera parameters and represent the coordinate values of the global coordinate system.

In this paper, based on the ITOP (invariant-top view) dataset, the application scenario data training network of the transfer and handling nursing robot is added.

5.2. Loss Function

We use the minimum mean square error between the predicted joint global coordinates and the actual joint global coordinates as the loss function of the second-level neural network:where is the joint index, is the predicted joint position, and is the actual joint position.

6. Estimation of Underarm Points

We put the robot’s two arms under the user’s armpit which is the key to successfully picking up the user. This paper delineates ROI (region of interest) for the depth image based on the given depth image joint coordinates [8]. Then, we perform threshold segmentation in the ROI to obtain the underarm area and determine the target point where the robot arm is placed.

6.1. Pretreatment

We use the left (right) shoulder joint, left (right) elbow joint, and the midpoint of the left and right hip joints as vertices and delineate two triangles’ ROIs. To ensure that the armpit area obtained by image segmentation does not contain the human body, we map the grey pixel value in the human body area to 1/3 of the original range [9]. We calculate the maximum depth and minimum depth among the four joint positions of the left (right) shoulder joint, left (right) elbow joint, left hip joint, and right hip joint in each ROI in turn. According to the pixel value mapping function shown in formula (9), the ROI is converted to grayscale.

6.2. Estimation of Axillary Points

We first perform 5 × 5 median filter denoising processing on the depth map. Considering that the background and foreground depth values vary with the scene, this paper uses the Otsu segmentation algorithm based on the maximum between-class variance to obtain the mask image M of the armpit area [10]. Further, we calculate the pixel coordinate of the armpit centre point by the following formula:

7. Experiments and Results

The PC configuration used in this experiment is Intel Core i7-6700HQ CPU, with 8G RAM and 4G NVIDIA GeForce GTX 950M GPU. The colour map and depth map sensors use the Microsoft Kinect 2 sensor. For the lower limb inconvenience group, we selected 1,100 human body images in the home environment as the test dataset.

7.1. Real-Time Evaluation

The experimental results of the calculation speed of commonly used gesture recognition methods are shown in Table 1. The comparison shows that the 3D convolution method used in [10] single-time 3D joint position estimation exceeds elapsed time 2 s. This cannot meet the real-time requirements of the transfer and handling nursing robot. Therefore, the follow-up experiments of this article will not be tested. The Kinect SDK v2.0 method can complete about 30 joint predictions per second. The real-time performance of the algorithm at this time is optimal. The method in [9] is similar to the method in a single calculation, and the time is about 200 ms. This can prove the good real-time performance of the method in this paper.

7.2. Average Accuracy Evaluation

We use the 10 cm rule as the evaluation criterion for joint prediction. The article adopts the average accuracy of joint prediction (mAP) to measure the accuracy of the method used. Under the condition that the distance between the user and the camera is 550 mm–3500 mm, we collect 600 pieces of actual work scene data of the transfer and transportation nursing robot. The results of our evaluation of Kinect SDK v2.0, the method of Fujisawa et al. [9], and this paper’s method are shown in Figure 6.

It can be seen from Figure 6 that when we predict each joint of the human body, the average estimation accuracy of the KinectSDKv2.0 method is the lowest. The direct mapping method used in [9] has good environmental adaptability and high accuracy. However, due to the colour map, the joint estimation itself has certain errors. This method easily predicts failure in relatively slender parts such as hands, elbows, and knees. In addition, the method is also deeply affected by the invalid value of the depth map and the noise, which reduces the accuracy. Compared with the above methods, this method has the highest accuracy, and the average accuracy of joints reaches 91.5%.

7.3. Accuracy Evaluation at Close Range

We use the transfer and transportation care robot developed by our team. We collect 350 working environment datasets and evaluate the close-range robustness of the method in this paper under the short-distance conditions with the operating range of 550 mm–800 mm. This article only evaluates the estimation effect of each method on the near-distance joint estimation based on the coordinate prediction of the head, neck, left and right hips, left and right elbows, and left and right shoulders.

It can be seen from Figure 7 that all three methods have reached the highest estimation accuracy on the head. Kinect SDK v2.0 and [9] methods have the lowest estimation accuracy at the hip joint, while the method in this paper has the lowest estimation accuracy at the elbow joint. Overall, the estimation accuracy of this method in any joint is the highest among the three methods. The average accuracy of the joint reaches 90.3%. When the distance is between 550 mm and 800 mm, the human body information collected by the depth camera is not comprehensive. Compared with the average accuracy evaluation experiment, the accuracy of Kinect SDK v2.0 dropped sharply in the close-range experiment. The algorithm has misjudgement of joint positions and even misidentified joints that do not exist in the figure, which is unacceptable for the transfer and transportation nursing robot.

On the other hand, the depth camera based on the ToF (time of flight) principle is more sensitive to the wrinkles and colours of clothes under close-range conditions. In this way, it is easier to collect invalid values, which increases the probability of the direct mapping method [9] to map invalid values. This leads to an increase in the recognition error rate. In comparison, the two-stage network method proposed in this paper has better short-range adaptability.

7.4. Evaluation of the Accuracy of the Axillary Point Prediction

This article defines the 3 cm non-background rule as the basis for evaluating the accuracy of the axillary points. The predicted axillary point cannot be located on the human body, and the distance from the real axillary point in the X-Y plane is not more than 3 cm. It is regarded as an accurate prediction. Before calculating the user’s armpit position, the transfer and transportation care robot moves to the front of the human body to face the front of the human body. We collect 150 frontal human body images at a distance of 550 mm–800 mm. Based on this, the experimental results of the evaluation of the method in this paper are shown in Figure 7.

After calculation, the method’s accuracy for predicting the axillary points in this paper has reached 91.3%. This is because most of the operating objects are elderly people with mobility impairments under the conditions of transfer and transportation care applications. The human body is mostly in a passive state without too many complicated movements, and the body’s posture is relatively simple. In addition, the method in this paper is based on human posture detection to obtain ROI. The prediction accuracy of the axillary points is directly related to the recognition ability of left (right) shoulder joint coordinates, left (right) elbow joint coordinates, and left (right) hip joint coordinates. Therefore, when the camera is too close to collect specific joint information, the armpit point estimation cannot be reliably completed when the joint position is incorrectly recognized.

8. Conclusion

This paper designs a 3D human pose estimation system for home care robots. The system uses Kinect 2 as the RGB-D data acquisition device to realize human joint position reasoning based on colour map 2D human body pose estimation. Through image processing technology, the prediction of the position of the user’s axillary point is further completed. We compare this method with other posture prediction methods to verify the advantages of this research method and illustrate the correctness of the application of artificial intelligence assistive technology in hospital professional nursing technology [11, 12].

Data Availability

The human body images data used to support the findings of this study are restricted by the Ethical Care Committee of Qiqihar Medical University in order to protect patient privacy. Data are available from Zhangbo Xiao (e-mail address: [email protected]) for researchers who meet the criteria for access to confidential data.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the 2021 Science and Technology Projects of Basic Scientific Research Business Expenses of Provincial Universities in Heilongjiang Province (Grant 2021-KYYWF-0031) “Research on Clinical Nursing of Elderly Disabled Patients Based on Artificial Intelligence Assisted Technology.” The authors would also like to thank Chuan Zuo,Sitong Ge and Hezhang Sha for foundation assisting in this study.