Truck-lifting accidents are common in container-lifting operations. Previously, the operation sites are needed to arrange workers for observation and guidance. However, with the development of automated equipment in container terminals, an automated accident detection method is required to replace manual workers. Considering the development of vision detection and tracking algorithms, this study designed a vision-based truck-lifting prevention system. This system uses a camera to detect and track the movement of the truck wheel hub during the operation to determine whether the truck chassis is being lifted. The hardware device of this system is easy to install and has good versatility for most container-lifting equipment. The accident detection algorithm combines convolutional neural network detection, traditional image processing, and a multitarget tracking algorithm to calculate the displacement and posture information of the truck during the operation. The experiments show that the measurement accuracy of this system reaches 52 mm, and it can effectively distinguish the trajectories of different wheel hubs, meeting the requirements for detecting lifting accidents.

1. Introduction

Container terminals are facilities that provide storage and distribution services for container transportation. With the sustainable growth of global maritime trade, the development focus of container terminals has moved to automation and unmanned operations. Furthermore, terminals with a high level of automation are called automated container terminals (ACTs). Some of the advantages of ACTs are obvious, they use automated equipment to replace on-site workers, which improves operation efficiency and reduces operating costs; this also improves worker safety [1].

In the terminal operation process, containers need to be transferred between various storage areas to transfer equipment. These transfer operations are called container-lifting operations and are performed by container-lifting equipment (such as rail-mounted gantry cranes (RMG)) [2]. The truck-lifting accident is an accident that occurs in container-lifting operations, and an example is shown in Figure 1. When the container is lifted, the container lock pins are not released, and the truck is lifted with the container, which can negatively affect the container, trucks, and on-site workers.

In traditional container terminals, the container-lifting operation requires on-site workers to confirm whether the lock pin is fully released. However, the ACT requires a reduction in the number of on-site workers, and an automated accident detection method is required to prevent accidents.

Truck-lifting prevention can be considered as a target detection and tracking problem. It needs to detect and recognize the characteristics of the truck and then use it to calculate the displacement of the truck during the operation process. Existing solutions for truck-lifting prevention are based on laser scanners, such as the laser radar-based truck chassis positioning technology proposed by Chao-feng [3]. The laser scanner scans the contour information of target and restores it to a 3D model. By analyzing the geometry of the model, the size and position information of the targets is calculated by the system [4]. This technology has high detection accuracy and is not affected by weather or light conditions. However, this system relies on a high-precision laser scanner, which is expensive [5].

With the development of image sensors and computer vision algorithms, vision-based measurement (VBM) technology has become more widespread in recent years. This technology consists of only a camera and an image processing device, which makes its hardware cost much lower than the laser scanner solution. VBM technology has been widely used in industrial measurement and one of the typical applications of VBM is the automated inspection of product quality control [6]. In container terminals, vision-based detection technology is used in many applications [7], specifically in the recognition of complex features, such as container numbers [8] and container corner casting [9].

In addition to the lower equipment cost, vision-based detection technology has the following two advantages. One is that it can achieve higher measurement accuracy by noncontact measurement [10] because vision-based detection uses CMOS or CCD cameras to obtain image information; these devices have high image resolution. Another advantage is its ability to recognize complex features, which stems from convolutional neural network technology (CNN) [11].

CNNs can recognize and classify complex features from images, such as the classification of face features [12] and the classification of tumors [13]. Different recognition CNNs also have operability for training. Compared with the previous classification (such as SVM), CNN has achieved higher detection rate, detection accuracy, and calculation time [14, 15].

Nevertheless, the detection accuracy of CNNs is not perfect. The detection results of CNN are the area with the highest probability that contains the target. There is generally some deviation between the detected result and the target. However, traditional image processing has pixel-level accuracy and can achieve higher accuracy under the premise of successful detection.

Vision-based target tracking technology has been used in several applications, such as ship recognition and tracking based on video information [16, 17] and vehicle tracking based on aerial videos [18]. These technologies are usually based on detection-based tracking (DBT) [19], which is mainly because of the excellent target detection ability shown by CNN detection. The tracking principle of DBT is to use a CNN to detect the target from an image and then use the correlation algorithm to associate the same target in different frames [20]. This system has achieved a good tracking result, making the main problem of vision tracking change from detection to association.

This study proposes a truck-lifting prevention system based on a vision-based detection and tracking algorithm to provide a low-cost and easy-to-modify automated accident detection system for container-lifting operations. The system is based on a target detection method that combines CNN detection with traditional image processing algorithms and a DBT multitarget tracking algorithm. The system calculates the displacement of the truck wheel hub and determines whether an accident has occurred. This system supports real-time remote monitoring because it uses cameras to capture operation information. Moreover, the system can switch to manual monitoring when the accident detection algorithm fails, which is a function that the laser scanner solutions cannot achieve.

2. System Design and Control Principle

This system uses cameras as information capture devices, which makes it applicable for installation in most container-lifting equipment. Figure 2 shows the installation in the rail-mounted container gantry crane (RMG), which is a typical container-lifting equipment in a container terminal, and Figure 3 shows the actual installation of the cameras. At the operation site, container trucks only function on the truck road; therefore, the cameras were installed at the RMG leg to capture the image information of the trucks. Two sets of cameras were installed in the RMG leg to cover all the areas because the container truck has a long chassis.

The lifting prevention process is shown in Figure 4. When the operation starts, the cameras capture the image on the side of the truck and sends it to the image processing unit (IPU) for calculation. In the IPU, the wheel hubs of the truck are detected first and then the movement trajectory of each wheel hub during the operation is tracked to determine whether the truck has been lifted. When an accident is detected, the IPU sends the accident information to the automated crane control system (ACCS), which stops lifting the container spreader by controlling the programmable logic controller (PLC).

The reliability of the equipment was also considered. In the traditional operation process, the container-lifting operation needs to be guided by on-site workers. We did not install a backup system because it would add additional costs and complicate the communication systems. When the system fails, the traditional method is considered acceptable. As the camera is installed at a low position that can capture the image of the wheel, it can be easily wiped off when the lens of cameras is stained.

3. Truck-Lifting Detection Algorithm

There are several types of container trucks, and therefore, it is difficult to directly recognize truck chassis and measure their displacement. However, truck tires have standard specifications, and considering that the tires deform under normal loading conditions, we calculated the displacement information of the truck chassis by detecting the coordinates of the wheel hubs on the image. Because the work site was an open-air environment, the light conditions were unstable, and the color and contamination conditions of different trucks were also different. We first used neural network detection to obtain a higher target recognition rate and then used traditional image algorithms to improve the detection accuracy. Finally, a deep sort-based tracking algorithm was used to distinguish and track different wheel hubs.

3.1. First Wheel Hubs Detection Based on the Modified SSD

SSD (Single Shot MultiBox Detector) [21] is a feedforward convolutional network. It uses anchor boxes with different aspect ratios and sizes to sample the image, and several feature layers with different receptive fields are used to extract and classify features. Owing to this design, SSD has a higher detection speed than two-stage methods, such as Fast region-based convolutional network (Fast R-CNN) [22], making it suitable for real-time detection.

To achieve the best detection performance, we made some modifications to the SSD network. The original SSD has VGG-16 [23] as the convolutional layer, which was replaced with ResNet [24], a newer CNN model that uses deeper neural networks to extract more feature information. The structure of the modified SSD model is illustrated in Figure 5.

3.2. The Second Detection Stage Based on Traditional Image Processing

The result of SSD detection is not the target itself; however, it is an area that contains the target with the greatest probability. The detection results usually have a positioning error from the actual target position. However, traditional image processing algorithms have pixel accuracy, but for an entire image, it takes longer time to calculate; hence, requirements for real-time detection are not met. Therefore, after SSD detection, we perform a second-wheel hub detection based on traditional image processing to improve the detection accuracy.

A flowchart of the second detection is shown in Figure 6. The input data are the wheel hub image that was detected by the SSD. The results detected by SSD are defined in (1), where are the center coordinates of the detection result and are the size and aspect ratio, respectively:

The first part is the preprocessing operation. We used the Single-scale Retinex (SSR) algorithm to enhance the information in the dark area of the image because the operation site is open air and the light conditions are unstable. SSR was proposed by Jobson et al. [25], and it is based on Land’s Retinex theory [26]. This enhancement algorithm uses a Gaussian wrap function to convolve the image, and its expression is as follows:where is the original color value of the point on the color channel I, is the enhanced color value, And is the Gaussian wrap function and its calculation is shown in (3). C represents the scale value of Gaussian wrap; it means the neighborhood size of (x,y) during convolution operation. is a scale parameter; it must make (4) hold. The enhanced image is the merged result of each color channels:

Next, we used an adaptive HSV threshold to filter out the wheel hub area in the image. The HSV color space divides different colors by the hue H, saturation S, and brightness value V. Since to the wheel hub area usually is the higher brightness part in the image, it can be extracted from the image by filtering the lower brightness part of the image. The calculation of HSV thresholding is shown in (5); is the pixel value of pixel in the HSV V space and is the new pixel value. is the threshold value that is calculated from the average pixel value of the image , and the adjustment value is ; its calculation is shown in (6). The preprocessed image is shown in Figure 6(c); most of the noise pixels have been removed by HSV thresholding:

The second part is the detection of the contours on the preprocessed image, and the calculation of the largest circle in the contours is by Hough circle detection. Owing to the light condition and lens distortion, the wheel hub image at the edge of the image screen exhibit some deformation and defects. We used the adaptive method shown in (7) to adjust the threshold of the Hough circle detection accumulator so that the Hough circle detection can detect the largest circles with no perfect shapes. in (7) is the horizontal resolution of the image, and and , respectively, represent the original threshold and adjusted threshold of the accumulator, and is the adjustment ratio:

The second detection result is defined as , as shown in (8). Because the second detection is unstable, when the second detection result and the first detection result have a large deviation, it should be considered as a failure of the second detection. Therefore, the final detection result needs a re-evaluation, and we used (9) to estimate whether the result of the second detection is suitable as the final detection result. is the maximum error with 95% confidence of the first detection, which is calculated by normal fitting:

3.3. Trajectory Tracking Based on the Modified Deep Sort

Deep Sort [27] is an online multiobject tracking algorithm proposed by Wojke et al. in 2017. As a DBT algorithm, Deep Sort’s tracking is based on the detection result data, making it suitable for combination with CNN detection or traditional image processing detection. We modified the tracking process of Deep Sort to improve the tracking speed, and the new tracking process is shown in Figure 7.

The Deep Sort uses the state vector shown in (10) as the description model of the targets. and are the center coordinates of the target detection result, and and represent the aspect ratio and height of the target detection result, respectively. are the predicted target positions in the next frame, which are predicted by Kalman fitting, an algorithm that uses a series of measurements observed over time to produce estimates of unknown variables:

The predicted result is used to match the detection results in the next frame. The matching algorithm is based on the Kuhn–Munkres algorithm, which uses the IOU value of the prediction result and detection result as the weight to classify different tracking targets. The calculation of IOU is shown in (11), Dete is the detection result, and Pred is the prediction result. The detection result closest to the prediction result was classified as the same target:

To solve the problem of target loss when the target passes through obstacles, the original Deep Sort uses the Mahalanobis distance and the descriptor of the target after the convolution operation to match the detection result and existing trajectories. However, the calculation of the convolution descriptor requires longer time, which makes the calculation time of Deep Sort much longer than Sort [28]. Therefore, we only used the Mahalanobis distance as the standard for trajectory matching.

The calculation of the Mahalanobis distance is shown in (12), is the motion matching value between the trajectory i and the detection result j, is the covariance matrix of the observation space in this frame, which is obtained by the Kalman filter.

Because the motion of the target is continuous, the Mahalanobis distance can be used to screen the detection results. The calculation method of the screening is shown in (13). is the threshold that is defined by the chi-square distribution with 0.95 degrees. When is lower than the threshold, it means that the trajectory I is associated with the detection result j.

Because the motion of the target is continuous, the Mahalanobis distance can be used to screen the detection results. The screening calculation method is given in (13). is the threshold defined by the chi-square distribution with 0.95 degrees. When is lower than the threshold, it means that trajectory I is associated with the detection result j:

4. Experiment

The core of this lifting prevention system is the wheel hub detection and tracking method. We used a typical industrial computer configuration to verify our method, the specifications of which are as follows:CPU : Intel i7-6700GPU : Nvidia GeForce GTX970-4 GB.

The first detection algorithm was implemented by Pytorch [29], and the second detection and tracking algorithm was implemented using OpenCV [30] in a Python environment. The images used in the experiment were captured by a camera, as shown in Figure 2. The resolution of the image was 1920 × 1080, and fps was 24.

4.1. Evaluation of Wheel Hub Detection

The modified SSD training used 3000 images from the side of the container truck. The trucks in these images were driven on the truck road next to the RMG, and the distance between the camera and the trucks was approximately 4–6 m. The first detection result is presented in Figure 8.

The performance evaluation of the first detection used 500 images for testing, and the evaluation of the second detection used 500 images that only contained the tire part. The test results are listed in Table 1. The horizontal error is the distance between detection result and the center of wheel hub in the direction of truck road, and the vertical error is the distance between detection result and the center of wheel hub in the vertical direction; both error values are the 95% confidence value after normal fitting. The actual distance was estimated with reference to the pixel size of the wheel hub.

4.2. Evaluation of Wheel Hub Tracking

The tracking algorithm evaluation used several videos of trucks passing through the camera area at normal speed and several videos of trucks under the container-lifting operations. The former was used to test the track of the horizontal displacement of the truck, and the latter was used to test the vertical displacement. These videos were under normal light conditions during the daytime and night-time, and the tracking results are shown in Figures 9 and 10.

Table 2 lists the performance of the target tracking algorithm. The tracking error is defined as the distance between the detection result and the prediction result, and it is the maximum error at the 95% confidence level after normal fitting.

4.3. Discussion

The experimental results showed a detection error of 6.31 pixels (in the experimental environment, it was approximately 52 mm), and the total tracking rate (including the detection time) reached 10 fps (average of 2.5 tires per image). Because the maximum vertical displacement in the container-lifting operation was approximately 100 mm, the detection accuracy of this system met the requirements of truck-lifting prevention. However, the experimental results also showed some issues.

In the detection experiment, certain detection failures were observed, and these failure detection samples were focused on the second detection. We observed that these failures were caused by tires with defaced and low light environments, and such factors obscured the details of the wheel hub. In this study, we used the processing of pixel values in the HSV space to solve this problem, but the experimental results showed that it is not sufficient.

When the tire appeared on the edge of the image, the detection error of the second detection increased. This is because the cameras have some lens distortion, which is caused by lens distortion and coordination. This causes distortion of the edge part of the image.

5. Conclusion

To solve the problem of automated accident prevention in container-lifting operations, this study designed a vision-based truck-lifting prevention system that calculates the displacement of the truck wheel hubs to determine whether the truck is lifted. The experiment showed that the detection accuracy of this system reaches 6.31 pixels and the average fps is 10 frames, which is sufficient to detect the truck-lifting accident in time.

However, certain limitations were also observed. We believe that an algorithm to extract the contour characteristics from the tire images with defaced and low light environment should be explored. Considering that the convolutional neural network is insensitive to different defaced and light conditions, it may be possible to use the convolution operation to extract detailed information in the picture to avoid the interference of light and defacement. However, complex calculations will increase the calculation time and reduce the efficiency of the system; therefore, this problem needs to be resolved.

Data Availability

The experiment data used to support the findings of this study have been deposited in the Google Drive repository (https://drive.google.com/file/d/1mqZrmlnOMwxeLsM9pBxItZ_jMj4qsRrV/view?usp=sharing).

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This research was supported by the Science and Technology Commission of Shanghai Municipality (no. 202H1101900) and China (Shanghai) Pilot Free Trade Zone Lin-gang Special Area Administration (no. SH-LG-GK-2020-21).