Abstract

The robustness and stability of lane detection is vital for advanced driver assistance vehicle technology and even autonomous driving technology. To meet the challenges of real-time lane detection in complex traffic scenes, a simple but robust multilane detection method is proposed in this paper. The proposed method breaks down the lane detection task into two stages, that is, lane line detection algorithm based on instance segmentation and lane modeling algorithm based on adaptive perspective transform. Firstly, the lane line detection algorithm based on instance segmentation is decomposed into two tasks, and a multitask network based on MobileNet is designed. This algorithm includes two parts: lane line semantic segmentation branch and lane line Id embedding branch. The lane line semantic segmentation branch is mainly used to obtain the segmentation results of lane pixels and reconstruct the lane line binary image. The lane line Id embedding branch mainly determines which pixels belong to the same lane line, thereby classifying different lane lines into different categories and then clustering these different categories. Secondly, the adaptive perspective transformation model is adopted. In this model, the motion information is used to accurately convert the original image into a bird’s-eye view image, and then the least-squares second-order polynomial fitting is performed on the lane line pixels. Finally, experiments on the CULane dataset show that the proposed method achieved similar or better performance compared with several state-of-the-art methods, the F1 score of the proposed method in the normal test set and most challenge test sets is better than other algorithms, which verifies the effectiveness of the proposed method, and then the field experiments results show that the proposed method has good practical application value in various complex traffic scenes.

1. Introduction

Vehicle and road safety has been a key issue for the communities and governments [1]. With emerging new technologies and knowledge, advanced driver assistance systems (ADAS) have been proposed to reduce road accidents and improve vehicle safety [2]. In ADAS and even autonomous driving vehicles, the main technical bottleneck is the perception problem, which has two elements: road and lane perception and obstacle detection [3].

The robustness and stability of lane detection is vital for advanced driver assistance vehicle technology and even unmanned technology [4]. Firstly, lane detection and tracking aids in localizing the ego-vehicle motion, which is one of the very first and primary steps in most ADAS, such as lane departure warning (LDW) and lane change assistance. Furthermore, lane detection is also able to aid other ADAS modules such as vehicle detection and driver intention perception.

At present, a large number of research results have been achieved in the research of lane detection at home and abroad. The lane detection methods nowadays usually contain two main steps: (1) lane feature extraction and (2) lane modeling [3, 5, 6]. Generally, the lane detection methods can be divided into three categories: (1) traditional lane detection methods, (2) image processing method combined with deep learning, and (3) end-to-end lane line detection method.

The traditional lane line detection method can be divided into three steps [7, 8]. Firstly, the road image data is preprocessed to remove the noise and to obtain the lane line features; then, the lane line is detected from the preprocessed image by means of feature-based or model-based methods; finally, the detection results are fitted to convert the lane lines represented by image coordinates into world coordinates. Aly [9] proposed a real-time and robust model-based detection method for urban road lane lines. IPM is used to convert the front view into a top view, and then the selective directional Gaussian filter is used to filter to check the lines in the image and deal with other abnormal lines. Finally, the lane line is simulated in the front view by mapping. Bounini et al. [10] proposed a feature-based lane line detection method in virtual simulation environment. In the initialization stage, the method integrates Hough transform, Canny edge detection, and Kalman filter to greatly reduce the region of interest and predict the lane line in the future. Overall, the feature-based method has higher computational efficiency but poor robustness. Compared with the feature-based method, the model-based method is more stable, but it is difficult to implement, and the calculation is huge.

The lane line detection method combined with deep learning [11] can extract features from various types of complex environments, which breaks through the limitations of traditional image processing methods. Amayo et al. [12] designed a target classification method based on the extracted lane line geometric elements. Firstly, weakly supervised neural network is used to extract lane line elements under different conditions; then, the global energy optimization method is used to retrieve and cluster the lane geometry, and the corresponding semantic categories are given. Finally, the tracking effect of continuous frame images is optimized to improve the stability of classification effect. Song et al. [13] designed a lane line detection and classification system based on 3D vision. Firstly, the lane line is extracted from the ROI region by the Sobel filter; then, an adaptive lane line model is designed in Hough space to further extract lane lines; finally, the convolution neural network is used to obtain the specific classification results.

Due to the rapid development of deep learning [1416], researchers prefer to use end-to-end methods to solve computer vision problems [17]. As shown in Figure 1, the initial picture is input into the model, and the desired result can be obtained through the end-to-end method. The method of lane line detection based on end-to-end deep learning technology can increase the accuracy of lane line recognition in a complex environment and simplify the steps of two coordinate system transformation, that is, the transformation between pixel coordinate system and world coordinate system, thus increasing the robustness of lane line detection. Li et al. [18] proposed an end-to-end system called line-CNN to realize the lane line detection method, which uses the straight-line anchor to locate the precise position of the lane line. In this paper, the semantic information of the whole lane line at the global level is considered. Van Gansbeke et al. [19] showed a method to realize lane line detection by solving the weighted least-squares problem, which is specifically divided into two parts. One part is to predict the weight of pixels of each lane line by using the depth convolution network, and the other part is to use the differentiable least-squares fitting module to return the parameters of the best fitting curve of lane line features. Generally, the end-to-end method does not need image preprocessing and manual feature extraction, and the experimental results have great advantages in accuracy and robustness.

Although many novel end-to-end methods have been proposed to deal with lane detection applications and have achieved very good performance, they are usually implemented in high-performance PC or embedded systems so that those methods can use a complicated algorithm and utilize big storage space. However, automotive companies are very sensitive to hardware cost, so they more prefer lane detection method that can be used in low cost and resource-limited platforms. Meanwhile, the complexity and uncertainty of road conditions has led to the fact that there is still room for improvement of existing local methods. The challenges include shadows cast from trees, other vehicles and buildings, invisible or defect lanes, highlight, strange lane shapes, and bad quality lines.

To meet the challenges of real-time lane detection in complex traffic scenes, a simple but robust multilane detection method is proposed for the resource-limited automotive embedded platform in this paper. The proposed method breaks down the lane detection task into two stages, that is, lane line detection algorithm based on instance segmentation and lane modeling algorithm based on adaptive perspective transform model. The main contributions can be listed as follows:(1)The multilane line detection is considered as an instance segmentation problem in this paper, and a deep learning network is designed to solve the problems of uncertain number of lanes and lane change. The lane detection problem is decomposed into two tasks, that is, lane line semantic segmentation branch and lane line Id embedding branch, and a multitask network based on MobileNet [20] is designed to shorten the forward propagation calculation time without loss of detection accuracy. In MobileNet, a convolution method called depthwise separable convolution is used to replace traditional convolution methods to reduce the network weight parameters. By sharing the first four coding processes between the two tasks, the calculation speed and the accuracy of lane line segmentation are improved. Meanwhile, by fully considering the semantic characteristics of the encoding and decoding process, the strategy of fusing the corresponding encoded information during decoding is adopted to improve the accuracy of lane segmentation.(2)An adaptive perspective transformation model is designed to overcome the disadvantage that the conventional perspective transformation model is only suitable for smooth roads, and the image will be distorted when the vehicle is driving on the uneven road or the pitch angle of the camera changes due to turbulence. The adaptive perspective transformation model overcomes the deficiency of the transformation matrix with fixed parameters in the traditional model and can realize the accurate conversion from the camera image to the bird’s-eye view under the camera motion state so as to improve the robustness of lane line fitting.

The remainder of the paper proceeds as follows. The lane detection method is given in Section 2. Section 3 presents the lane line detection algorithm based on instance segmentation. Section 4 is devoted to the adaptive perspective transform model. Section 5 shows the experimental results and analysis. Section 6 gives the conclusion.

2. Overview of the Proposed Multilane Line Detection Method

In this paper, the multilane line detection is considered as an instance segmentation problem. Each lane line is formed into a separate category, and then each lane line is fitted. In order to increase the running speed, improve the detection accuracy, and meet the requirements of real vehicle applications, the lane detection problem is decomposed into two tasks, and a multitask network based on MobileNet is designed for lane line segmentation. The multitask network includes two branches: lane line semantic segmentation branch and lane line Id embedding branch. The lane line semantic segmentation branch is mainly used to obtain the segmentation results of lane pixels and reconstruct the lane line binary image. The lane line Id embedding branch mainly determines which pixels belong to the same lane line, thereby classifying different lane lines into different categories and then clustering these different categories.

After completing the lane line instance segmentation, these pixels belonging to the lane line need to be parametric curve fitting. Curve fitting refers to the process of finding an appropriate function that fits a finite set of data points. The commonly used curve fitting models include Bayesian fitting [21], B-spline curve fitting [22], and least-square curve fitting [23]. Generally, in order to improve the accuracy of fitting, the fixed perspective transformation model [9, 24] is commonly used to convert the image into a bird’s-eye view for curve fitting. However, the algorithm is only suitable for smooth roads. When the road is uneven or the vehicle is bumpy, the image will move and be distorted, resulting in large errors in the conversion into a bird’s-eye view. To solve this problem, the adaptive perspective transformation model is designed. In this model, the motion information is used to accurately convert the original image into a bird’s-eye view image, and then the least-squares second-order polynomial fitting algorithm is performed on the lane line pixels. Compared with the traditional fixed perspective transformation, this model has better robustness. The specific process of lane line detection is shown in Figure 2.

3. Lane Line Detection Algorithm Based on Instance Segmentation

In this paper, MobileNet is designed to train an instance segmentation model for lane line detection. Its advantage is that it can solve the problem of uncertain number of lanes and lane change. The instance segmentation is composed of two branches: (1) lane semantic segmentation, which is used to obtain the lane line binary image; (2) lane line Id allocation, which is used to determine the pixels that belong to the same lane line.

To improve the calculation speed and the accuracy of lane line segmentation, the two-branch algorithm, for instance, segmentation, shares the first four coding processes. To ensure the real-time performance of the multilane line detection algorithm, the input image is converted into a pixel format for network learning. The encoding-decoding structure of the network is shown in Figure 3.

To meet the cost requirements of in-vehicle computing, MobileNet is adopted as an encoding structure to shorten the forward propagation calculation time without loss of detection accuracy. In the encoding process, the semantic feature information is less, and the lane line position is accurate, while the semantic feature information in the decoding process is richer, and the lane line position information is relatively rough. Therefore, similar to the ENet [25], the corresponding encoded information is fused during the decoding process to improve the accuracy of lane line segmentation.

3.1. MobileNet

As a representative of lightweight convolutional networks, the core idea of MobileNet is to use a convolution method called depthwise separable convolution instead of traditional convolution methods to reduce the network weight parameters. The depthwise separable convolutions divide standard convolution into two steps. The first step is depthwise convolution, that is, channel-by-channel convolution. In depthwise convolution, one convolution kernel is only responsible for one channel, and one channel is “filtered” by only one convolution kernel. The second step is pointwise convolution, which is to “string” the results obtained by the first step. Assuming that, in standard convolution, the size of the input feature map is , the number of input channels is , the number of output channels is , and the size of the convolution kernel is ; then the calculation amount of standard convolution is

In MobileNet, the depthwise separable convolution is obtained by the two-part standard convolution solution, and the convolution kernel size is and , respectively. Then, the calculation amount of the depth separable convolution is

As an example, Figure 4 is a schematic diagram of a standard convolution where the input feature map size is , the number of input channels is 3, the number of output channels is 4, and the size of the convolution kernel is . Figure 5 is a schematic diagram of the depthwise separable convolution where the input feature map size is , the number of input channels is 3, the number of output channels is 4, and the size of the convolution kernel is and , respectively.

From (1)and (2), it can be seen that the ratio of the parameter calculation amount of the depthwise separable convolution and the standard convolution under the same input and output feature map size is

In order to compare the performance of convolutional neural networks using standard convolution and depthwise separable convolution, these two networks were used for training and testing on the ImageNet dataset in [20]. The results show that, in terms of accuracy, using standard convolution is 1.1% higher than using depth separable convolution, but in terms of calculation amount and parameter amount, the former is 8–9 times the latter. From [20], we can conclude that the use of deep separable convolution can greatly reduce the amount of calculation and the number of parameters while basically ensuring the accuracy. Correspondingly, it can reduce the difficulty of network model training, reduce training time, and reduce the performance requirements of hardware devices.

3.2. Lane Line Semantic Segmentation Branch

The purpose of lane line semantic segmentation is to obtain the segmentation result of lane pixels and reconstruct the binary image of the lane line, and then determine which pixels belong to the lane line. Due to the extreme imbalance between the lane pixels and the background pixels, the focal loss function [26] is used for model training, and the specific lane line pixel segmentation loss function formula is as follows:where indicates whether the pixel belongs to category , that is, ; represents the probability that pixel is predicted to be category ; is the number of segmentation categories, ; and are the length and width of the output prediction image, respectively; represents the focus parameter. To increase the weight of the pixels that are difficult to classify, the weight of the pixels that are easy to classify becomes smaller so that the network can pay more attention to the pixels that are difficult to learn; in this paper, . refers to the weighting parameter of the lane line category. is mainly used to solve the problem of the uneven number of pixels in each category. The mathematical calculation formula of iswhere represents the number of lane line pixels; indicates the total number of pixels.

3.3. Lane Line Id Embedding Branch

The lane line Id embedding branch mainly determines which pixels belong to the same lane line. First, the vector value corresponding to each pixel is output by the encoder. Then the pixels are clustered according to the feature that the vector distances of the pixel points of the same lane line are close, and the vector distances of the pixels of different lane lines are relatively far. Finally, the pixels belonging to the same lane line are determined. The clustering loss function consists of two parts, and . is used to cluster the pixels belonging to the same lane line to the center point according to the embedded pixel point vector value. is used to move pixels that do not belong to a lane line as far away as possible according to the embedded pixel vector value. There are groups of different lane lines, and the clustering loss function L iswhere is the number of lane lines; is the number of pixels of a lane line; is the mean vector value embedded in pixels of a certain lane line; is the embedding vector of a certain pixel; and are the mean vectors embedded in the pixels of two different lane lines and .

In the inference stage, the DBSCAN clustering algorithm [27] is used to cluster each pixel. The DBSCAN method is a clustering algorithm based on high-density connected regions, which can divide regions with sufficiently high density into clusters and can find clusters of any shape in noisy data. The DBN method is used in this paper for clustering until all lane line pixels are assigned to the corresponding lanes. The cluster center is taken as the center of the circle, the radius is 0.28 m, and the minimum number of points in the domain is 180.

4. Adaptive Perspective Transform Model

After obtaining the clustered pixel point set of each lane line based on the lane line instance segmentation, the pixel points should be fitted. According to the imaging principle of the camera, the lane line will gradually converge into a point at a long distance, so the accuracy of the lane line fitting directly on the original image will be greatly reduced. In the commonly used improvement method, the picture is converted into a bird’s-eye view through a perspective transformation model to make the lane lines parallel so that the fitting accuracy is improved. However, in the conventional perspective transformation model, the transformation matrix with fixed parameters is adopted, which is only suitable for smooth road conditions. If the vehicle is driving on an uneven road or the pitch angle of the camera changes due to turbulence, the image will be distorted, which will affect the generated bird’s-eye view. Therefore, an adaptive perspective transformation model that can realize the accurate conversion of the camera image to the bird’s-eye view under the camera motion state is adopted.

As shown in Figure 6, the origin of the world coordinate system is defined as the point where the car’s center of mass is perpendicular to the ground, and the vertical distance between the camera installation position and the origin of the world coordinate system is . According to the relationship between the world coordinate system , the camera coordinate system , and the image pixel coordinate system , the position of the pixel in the camera coordinate system can be expressed aswhere is the conversion scale factor between the camera coordinate system and the image pixel coordinate system; is the width of the image; is the length of the image.

In order to better explain the parameters of the adaptive perspective transformation model, the lateral view structure of the adaptive perspective transformation model is established, as shown in Figure 7. The x-axis in the world coordinate system can be written as a function of pixel points , , and :

When is . Substituting formula (7) into formula (10), can be obtained:

Substituting formulas (9) and (11) into (8),where is the tilt angle of the camera; is the change value of the pitch angle when the camera bumps; is the angle value between the ground pixel and the camera; is the half of the vertical field of view of the camera; is the vertical focal length of the camera.

As shown in Figure 8, the vertical view structure of the adaptive perspective transformation model is established, and the axis in the world coordinate system can be expressed as

When , is . can be obtained from formula (13):

Substituting formulas (12) and (14) into (13),where is half of the horizontal field of view angle of the camera; is the horizontal focal length of the camera.

Assuming z = 0, according to the relationship established among the world coordinate system, camera coordinate system, and pixel coordinate system, the coordinate changes of pixel coordinate on the x-axis and y-axis in the world coordinate system are updated when the vehicle motion turbulence causes the change of camera pitch angle . Therefore, the influence of image motion distortion on aerial view image is reduced, and the adaptive perspective transformation model is more robust to camera motion.

In the transformed aerial view, the pixels of the lane line are fitted by the second-order polynomial of the least-square algorithm, and the fitted curve is regressed to the original image so that the quadratic polynomial of the lane line in a real road scene is obtained.

5. Experimental Results and Analysis

The CULane dataset [13] was used to train and validate the proposed multilane line detection model, and then field experiments were conducted to verify the performance of the proposed model.

5.1. Experimental Environment Configuration

The algorithm design, training, and testing are based on the deep learning framework PyTorch. The experimental configuration used in this experiment is shown in Table 1.

5.2. CULane Dataset

Although the multilane detection network based on deep learning has been able to extract lane line pixels effectively in simple traffic scenes with good weather conditions, it is still challenging for lane line detection in complex scenes. Therefore, it is very important for supervised learning that the lane dataset covers a wide range of traffic scenes and has high annotation quality.

The challenging CULane dataset is used to train and validate the model, as shown in Figure 9. CULane is a large scale challenging dataset for academic research on traffic lane detection. Only the two lane lines on the left and right of the current lane line are concerned by CULane, which are paid most attention to in real applications. Therefore, there are only four real lane lines labeled. The dataset is divided into 88880 for the training set, 9675 for the validation set, and 34680 for the test set. Table 2 shows the various scene information of the CULane dataset.

It can be seen from Table 1 that a total of 9 types of traffic scenes are included in CULane. Among them, the number of images in normal traffic scenes with clear lane lines accounted for 27.7% of the total dataset, while the number of images in complex road scenes with unclear lane lines or interference with lane lines due to various reasons accounted for 72.3% of the dataset. It shows that complex road scenes are more concerned in CULane, and it is also consistent with the proportion of various traffic scenes encountered in the actual driving process. Therefore, this dataset is more suitable for verifying the ability of the proposed method to detect lane lines in complex road scenes.

5.3. Model Training Process
5.3.1. Data Processing Method

In order to ensure the real-time performance of the multilane line detection algorithm, the CULane input image is converted into an 800288 pixel format for network learning. The input data is normalized so that the value of the pixel is limited to the interval [0,1], which improves the computational efficiency of the model fitting process. The data augmentation strategies are used, including brightness conversion, horizontal flip, and overall image translation of 0∼2 pixels; an example is shown in Figure 10.

5.3.2. Model Training Parameters Setting

Taking into account the performance of the hardware platform used in this study, the training parameters are shown in Table 3. In Table 3, batch size represents the number of training groups input for each iteration of training; epoch represents the training period of the whole training set; the learning rate decays exponentially to prevent gradient dispersion. In addition, in order to accelerate the convergence of the network, the MobileNet pretraining model is used to initialize the weight parameters of the unmodified part of the feature extraction network. The mainstream Glorot uniform distribution method is used for the initialization of other parameters.

5.4. Training Result Analysis

To verify the performance of the proposed multilane line detection algorithm based on deep learning, is used to judge whether a lane marking is successfully detected. The lane markings are viewed as lines with widths equal to 30 pixels, and the intersection over union (IoU) between the ground truth and the prediction is calculated. Predictions whose IoUs are larger than 0.5 are viewed as a correctly predicted lane line.where (true positives) is the number of lanes lines correctly predicted; (false positives) is the number of wrongly predicted lane lines; (false negatives) is the number of missed ground-truth lane lines.

Based on the evaluation index of , the effectiveness of the proposed algorithm and other multilane detection algorithms is analyzed. The experimental results are shown in Table 4.

From Table 4, it can be seen that the score of the proposed method in the normal test set and most challenge test sets is better than other algorithms, which verifies the effectiveness of the proposed method.

5.5. Examples of Multilane Line Detection Results

Figure 11 shows the multilane line detection results of the CULane dataset, and the curves of different colors represent the detected different lane lines. As can be seen from Figure 11, the proposed method can accurately detect multilane lines in various scenarios.

5.6. Real Vehicle Experimental Verification

To verify the positioning performance of the proposed strategy, experiments were conducted on the intelligent driving vehicle platform. The platform is refitted by Chery pure electric vehicle [31], as shown in Figure 12.

5.6.1. Camera Parameters

The HD industrial camera with a USB driver interface is selected as the visual acquisition device. The specific parameters are shown in Table 5.

The HD industrial camera can maintain a good shooting effect without distortion under strong light, weak light, or no light, and the image acquisition requirements under various driving conditions can be met.

5.6.2. Industrial Computer

Advantech MIC-7700 industrial computer is selected as the data computing platform, as shown in Figure 13. MIC-7700 is a compact and fan-free computing platform designed based on the industrial market. It can work in an environment of 0∼50 degrees, can reduce the external vibration and impact on the inside of the chassis, and can be used all day in bad weather.

5.6.3. Software Platform

The real vehicle data acquisition and experiment in this paper are carried out in ROS (robot operating system). Ubuntu 16.04 is used as the installation and deployment operating system of ROS. The version of ROS is kinetic.

To achieve multitask road multitarget detection and the integration of the drivable area segmentation model on the ROS platform, as shown in Figure 14, the topic structure diagram of multitask target detection and segmentation node and network connection mode is established.

Firstly, node/lane_node is created, which subscribes to the image Topic/usb_cam/image_raw of real-time road conditions driven by the HD camera installed on the vehicle in real time, and converts the image format into the format recognized by the OpenCV tool library by calling the cv_bridge module in the ROS library. Then the image data is reasoned according to the multitask road target detection and segmentation model, and finally, the results are published in real time as topic/lane_image. ROS graphical tool RViz can display the multitask road multitarget detection and drivable area segmentation prediction results by subscribing to the topic.

5.6.4. Real Vehicle Test of Multilane Line Detection Algorithm

Figure 15 shows the multilane line detection effect of the algorithm proposed in this paper for the data collected by the onboard vision system of the intelligent driving vehicle platform. It can be seen that the proposed method can achieve better detection results in various complex scenarios. The reason is that the lane line detection problem is transformed into an instance segmentation problem, which can solve any number of lane line detection and lane change problems. Even if the lane line is blocked by vehicles or shadows or the detected lane line is a dashed line, the position of the lane line can be accurately predicted.

6. Conclusion and Discussion

Aiming at the difficulty of lane line detection in complex urban traffic scenes, a new method of lane line detection based on instance segmentation and adaptive perspective transformation is proposed in this paper. A branch multitask lane line instance segmentation network is designed for lane line instance segmentation, including the lane line semantic segmentation part and the lane line Id embedding part, which can solve the problems of multilane and lane change detection while satisfying real-time performance. The CULane dataset is used for training. At the same time, in order to deal with the change of camera inclination angle caused by vehicle turbulence, the example segmentation results are adaptively perspective transformed to obtain the lane line pixel set from the perspective of aerial view so as to realize the fitting of lane line pixels. The CULane test results show that the lane line detection effect in normal scenes is 91.2%. Integrating the model into the ROS platform realizes the real-time detection of multilane lines in various complex traffic scenes, which has good practical application value. We also conclude that the proposed strategies can be easily embedded into other advanced driver assistance approaches with slight modifications.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

Xiang Song and Hai Wang contributed to methodology; Xiang Song, Xiaoyu Che, and Huilin Jiang contributed to software; Xiang Song, Ling Li, Chunxiao Ren, and Hai Wang contributed to validation; Xiang Song, Shun Yan, and Hai Wang contributed to investigation; Xiang Song was responsible for original draft preparation; Xiang Song, Xiaoyu Che, Chunxiao Ren, and Hai Wang responsible for review and editing; Xiang Song, Ling Li, and Huilin Jiang contributed to funding acquisition. All authors have read and agreed to the published version of the paper.

Acknowledgments

This research was funded by the National Natural Science Foundation of China (Grant no. 61801227), the Future Network Scientific Research Fund Project (Grants nos. FNSRFP-2021-YB-29 and FNSRFP-2021-YB-28), the Qing Lan Project of Jiangsu (Grant no. QLGC-2020), and the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (Grant no. 20KJB130005).