Abstract

Lane mark detection is an important task for autonomous driving. Many researchers have proposed many models. But the driving environment is much more complex, especially for some challenging scenarios, such as vehicle occlusion, severe mark degradation, heavy shadow, and so on. It is difficult to detect lane mark in a limited local receptive field under the above scenarios. For that reason, we propose a lane mark detection network based on multihead self-attention. It can find spatial relationships among lane mark points in the global viewpoint and enlarge its feature map’s receptive field equally. For further extracting global and contextual features, it fuses global information and local information together to predict classification and location regression. Finally, it can promote accuracy of lane mark detection greatly especially in challenging scenarios. In the TuSimple benchmark, its accuracy is 95.76% overwhelming all other methods, and its FPS is 170.2, which is the second-highest one. In CULane benchmark its F1 achieves 75.55% and FPS reaches 170.5. Both of them are the highest compared to other methods. Our proposed model establishes a new state-of-the-art among real-time methods.

1. Introduction

Lane detection [1, 2] based on vision sensors is one of the core technologies in the auto-driving field. Currently, it is not only an important foundation for lane departure warning and lane keeping functions but also a key technology to accomplish ADAS (advanced driving aided system) [3, 4]. However, there are so many sorts of lanes in the realization world. For example, there are solid, broken, dash, merging, and splitting lanes. Lane patterns are diverse. Besides that, there are some challenging driving scenarios, including those involving heavy shadows, severe vehicle occlusion, and severe road mark degradation. Even so, there are some corner cases such as merging and splitting. In an urban environment, lanes are susceptible to illumination, load wear and tear, occlusion, etc. It is more challenging and makes a higher claim to algorithm generalization and robustness.

To resolve those existing problems, many researchers put forward some different technical solutions. In traditional computer vision, it heavily depends on some assumptions, such as that lanes and boundaries are continuous and parallel [5]. Also, it utilizes edge detection operators, a histogram, prior knowledge, and recognition to extract lane candidate points. Finally, it takes advantage of line fitting or the Hough [69] transformation to obtain the lane line parameters. More recently, the development of CNN semantic segmentation [1016] or instance segmentation [1721] is paid most attention. It extracts spatial or structural information between pixels or from slice to slice in the process of lane detection [2226]. Although it can resolve some challenging scenarios like vehicle occlusion, severe road mark degradation, and heavy shadows, its huge computation cost and much slower speed hinder its real-time application and performance, as shown in Figure 1. Consequently, recurrent neural networks, long short-term memory, gated recurrent neural networks, and attention mechanisms have been firmly established. They do well in coping with time series signal processing and sequence modeling. Especially for lane line occlusion, it can extract textual or semantic information from continuous frames.

In this work, we present a lane mark detection network based on multihead self-attention [27]. It is a lightweight model and is applied in real-time application. Its accuracy is much better than most state-of-the-art models. TuSimple and CULane are used as our benchmarks to evaluate our experimental results. This paper has three extensions, as follows:(i)A lane mark detection network based on anchors and multiheader self-attention: we propose a new network architecture combining row anchors with multiheader self-attention. It promotes accuracy a lot compared with that in [17, 2832].(ii)Multiheader self-attention mechanism: we propose a multihead self-attention method to extract global information, which further improves the performance.(iii)Presentations and experiments: two datasets are collected for performance evaluation. One is TuSimple dataset, and the other one is CULane dataset. These two benchmarks are utilized for quantitative evaluation for different scenarios, such as city lanes and rural lanes, in day and night conditions. It can promote the research and development of autonomic driving.

In the past two decades, researchers have made great efforts on lane detection technology. Especially when DCNN, LSTM, and Attention emerge, it brings a new viewpoint to lane detection methods. Totally, these methods are sorted to some categories such as traditional methods, segmentation network, anchor-based methods, and attention-based methods. In this section, we briefly summarize each category.

2.1. Traditional Computer Vision-Based Lane Detection

Generally speaking, traditional computer vision methods are mainly concerned with gray images, edge detection operators, and ROI in order to detect lane edges. Generally, it divides lane detection into two stages. One stage is lane edge searching and detection. During lane edge detection processing, it takes the IPM transformation, Sobel operator, Gaussian filter, steerable filter [33], and Gabor filter [34] with kernels in different directions, gradient, color, and texture. The other stage is lane fitting. So, many methods are extensively exploited to fit lane line; the input is a gray image, not the original RGB image. It brings about multipreprocessing methods such as template matching [35], Hough transformation, polar randomized HT [36], curve-line fitting, Catmull-Rom spline [37], B-snake [38], and so on.

2.2. Lane Detection Based on Segmentation

Global information, local information, textual information, and semantic information are very important for lane detection, especially in vehicle occlusion scenarios. The segmentation network intensifies communication among pixels in a larger receptive field. The main research directions are as follows:(1)Pixel-wise segmentation: the authors in [39] propose atrous convolution and bilinear interpolation to acquire a larger receptive field in order to get much higher classification accuracy. It utilizes atrous spatial pyramid pooling with different sampling rates to aggregate multiscale feature maps. It also takes fully connected CRF [18] to interact with pixels to accomplish lane edge localization and classification precisely. But its huge computation is boring for real-time applications. For better efficiency, the authors in [17] propose spatial CNN (SCNN), which limits communication from slice to slice and not pixel to pixel. Every layer takes former input to apply convolution operation and nonlinear activation, and sends result to the next layer sequentially. Similarly, SCNN treats rows or columns of feature maps to communicate with each other. So, it reduces computation greatly compared with that in [39]. But its computation speed is lower than 10 frames per second.(2)Row-wise or column-wise segmentation based on the anchor. Lane detection based on pixel-wise segmentation [4042] requires more computational cost, and it also cannot cope with challenging conditions such as severe occlusion and extreme lighting conditions because of its limited receptive field. For that reason, the authors in [43] propose a row-wise DNN network oriented on row anchors. Its backbone is based on ResNet. Lane detection is described as selecting certain cells. Its loss functions include classification loss, location loss, and structure loss. The row anchors are predefined and include dimensions. So it can pay more attention to global information and contextual information. The computation cost is closely connected with anchor numbers, anchor dimensions, and lane quantity; it has nothing to do with image pixels. Therefore, it reduces computation cost greatly and promotes lane detection accuracy in no-visual-clue condition [44]. In some studies, the authors put forward a sparse top-down formulation with a large receptive field opposite a down-top formulation in the segmentation network [28, 4547]. The reason is that traditional segmentation networks have some shortcomings, such as its computation speed is much slower and has a no-visual-clue problem. To resolve it, a hybrid anchor framework including row anchor-driven and column-anchor-driven representations are proposed, where the former is better for ego lane detection and the latter is right for side lane detection. To cope with global information, it proposes ordinal classification losses, including base classification loss and mathematical expectation loss. The space between classes is continuous.

2.3. Lane Detection Based on Attention Mechanism

The authors in reference [48] propose an attention-guided lane detention model. It utilizes different backbones to extract features such as ResNet-18, ResNet-34, ResNet-101, and so on. But extracting feature maps by means of a DCNN network like the ResNet model easily results in a narrow receptive field. So it adopts a self-attention mechanism to produce a weight vector for every local feature vector. Finally, it implemented matrix multiplication to obtain a global feature map. By doing this, it can predict the lane’s existence and its position under conditions of occlusion. [49] proposes expanded-self attention (ESA) module to extract global contextual information. Its main purpose is to divide ESA into HESA (horizontal expanded-self attention) and VESA (vertical expanded-self attention), respectively. Every one predicts the probability of lanes along the horizontal and vertical directions. It is easily seen that it enlarges the receptive field and acquires global contextual information. So it can promote lane detection accuracy, especially in occlusion scenarios.

3. Proposed Approach

In this section, we put forward a lane detection network based on multihead self-attention. Meanwhile, it combines a typical DCNN network such as ResNet-34 with two prediction subnetworks, one for classification and another for regression.

3.1. System Overview

Lane lines represent all sorts of different shapes, types, and colors. For example, it includes solid lines, dotted lines, straight lines, curve lines with different curvatures, emerging lines, and splitting lines. Besides those, some challenging conditions are difficult to handle, such as heavy shadow, severe mark degradation, and vehicle occlusion. Although DCNN is capable of extracting feature maps with convolutions and pooling operations with different kernel sizes and strides, but pooling operations enlarge the receptive field while causing large position offsets. So it requires a trade-off between receptive field, classification, and position accuracy, especially for challenging conditions.

For that reason, we design a multihead self-attention mechanism which taking feature maps of DCNN as inputs. In order to obtain global information, we utilize multiheader to match anchor vectors in different spatial positions. Every head represents global contextual and semantic information among anchors, as shown in Figure 2. So it can summarize and fuse all the global information equally to expand receptive field. Therefore, it also improves classification and location accuracy after sending global anchors to prediction networks.

3.2. Network Design
(1)Backbone: its backbone is ResNet-34, imported from torchvision.models.ResNet-34 which has four layers and one fully connected layer. Each layer has different residuals, which are three, four, six, and three, respectively. Its convolution kernel is three multi three. The channel numbers are 64, 128, 256, and 512 separately. The output of ResNet-34 is a feature map . For reducing dimension and computation cost, it applies 1 × 1 convolution onto it and generates channel-wise feature map .(2)Multiheader self-attention: we propose , and is the number of anchors. The points of the feature map are composed of anchors. Every row anchor is represented by coordinate frame where is equally spaced and predefined. is offset, which is the horizontal distance between the prediction line and the anchor line. is the predefined number in direction. It is easily seen that a multihead mechanism can project the dimension queries, keys, and values times including different and learned linear projection matrices to get dimension queries, keys, and values, such asIn self-attention mechanism, we compute it by modified dot-product attention, which scales the dot products by . is represented by . The process likes as follows:After we perform the attention function in parallel, they will be concatenated as follows:Finally, we will apply linear projection on multihead to get as follows:where the projections are matrices as follows: , , , and . Every and is the number of heads shown as Figure 3. We also notice that . So, has the same dimension as .(3)Classification model and regression model: before coming into the classification and regression models, we will concatenate and together. Also, it becomes an augmented feature vector . So, it will be pushed into the classification model and regression model . Finally, predicts lane line probability . There are probabilities all together while represents the number of lane line and another class is for background or invalid proposal. predicts the offset set . is the number of valid offsets in direction.(4)Loss function: in the process of training, we find that the easy negatives can overwhelm training and lead to degenerate models. To resolve it, we propose focal_loss [49, 50] to act as the loss function of the classification model and it follows as this:

In our paper, we set and . For regression model, we adopt Smooth L1 as the loss function. So, our loss function for training combines those two loss functions together. It is defined as follows: represent prediction output of classification and regression for anchor , respectively, and are ground truth of anchor . For balancing factor , we set .

4. Experiments

The widely-used TuSimple [51] and CULane lane detection datasets are used to evaluate our model. In the TuSimple dataset, there are 6,408 annotated images. We split it into a training set (3,268), a validation set (358), and a test set (2,782). The maximum lane marking number is 5. In the CULane [29, 52] dataset, it is also split into a training set (88,880), a validation set (9,675), and a test set (34,680). The maximum lane marking number is 4.

4.1. Implementation Details

Every input image resolution is . It takes 15 epochs on CULane and 100 epochs on TuSimple, whose number of images is less than the former. The learning rate is set at 0.0003, the batch size is set at 8, the total anchor number is set as 1000, and the offset number is set at 72. All experiments are computed on a personal computer with an 11th Gen Inter(R) Core(TM) [email protected] GHz and NVIDIA GeForce GTX 1660 SUPER.

4.2. TuSimple Dataset
4.2.1. Dataset Introduction

The TuSimple dataset includes 6,408 clips, where every clip consists of 20 frames collected in one second. The last frame is labeled with lane ground truth. All the images are of forehead driving scenarios on the high way. The annotations and testing are focused on the current and left/right lanes.

4.2.2. Evaluation and Testing Metrics

In order to compare the performance with other methods, we calculate the accuracy using default TuSimple metrics. It is as follows:where is the number of true prediction lane points in current clip and is the total number of ground truth lane points. A lane point is taken as a true positive if its distance from the corresponding label lane point is less than or equal to 15 pixels. While those lane points with distance greater than 20 are taken as negatives. Between them false positives and false negatives are reported and anchors are also dropped. The testing results of the multihead lane detection model based on the TuSimple dataset are shown at Figure 4.

4.2.3. Results

To verify the accuracy of our model, we compare it with several state-of-the-art models. We choose different backbones, such as ResNet-18 and ResNet-34. The qualitative results are shown in Table 1. We know that lane marker detection is extensively applied in real-time conditions. So, it needs high requirements for real-time. From Table 1, we can easily see that the runtime speed of our proposed model can reach from 167.5 to 170.2. Generally speaking, the camera frame rate is about 30 to 60 or so. So it can cope with it, and it will not cause the jam. More importantly, the algorithmic flow of auto-driving consists of perception, prediction, planning, and controlling. From perception to planning, generally, it cannot surpass 100 ms. Therefore, it is better not to exceed 25 ms. The FPS of our proposed models is between 5.875 ms and 5.970 ms. It is only 23.5% to 23.88%. Consequently, it can satisfy the real-time requirements. But because the scenarios in the TuSimple dataset are not relatively complex, our proposal model has a huge amount of room to improve.

4.3. CULane Dataset
4.3.1. Dataset Introduction

The CULane dataset [52] comprises 55 hours of videos consisting of urban, highway, and rural scenarios. All the images have a resolution of . There are 133,235 frames in total. They are split into a training set that has 88,880 frames, 9,675 for validation, and 34,680 for testing. The test set includes 9 challenging driving scenarios, such as normal, crowd, highlight, shadow, arrow, curve, cross, night, and no line.

4.3.2. Evaluation and Testing Metrics

For judging whether a model detects a lane marker correctly, the metric is the F1 according to the CULane dataset’s official references. It considers lane marking as a line with 30 pixel width. So, predictions whose IoUs are greater than 0.5 are treated as true positives. The testing results of multihead lane detection model based on CULane dataset are shown as Figure 5. The metric is given as follows:

5. Results

The results of our model, along with those of other state-of-the-art models, are shown in Table 2. We know that CULane dataset is much complex compared with TuSimple dataset. It has more challenging scenarios, such as crowd, highlight, shadow, and night. So, we also see that our proposal model is best in challenging scenarios, such as crowds, highlights, and nights. In most challenging scenarios, we achieve better results, except in the shadow scenario alone. We also know that lane marker detection is sensitive to time. From the results of CULane dataset, we can see that FPS of our proposed models is between 167.8 and 170.5 about. That is to say that it takes 5.865 milliseconds to 5.959 milliseconds from getting image input to outputting lane marker points. From the previous analysis, we can also easily see that it can satisfy not only the camera frame rate but also the real-time requirements in auto-driving scenarios.

5.1. Ablation Study

This experiment evaluates the impact of the different-head self-attention mechanism in our proposed model. In Table 3, we can easily see that the 2-head self-attention model achieves the highest accuracy, which is 95.76%. But every different-head self-attention model shows no obvious difference in accuracy. It grows up only 0.33% between highest one and lowest one. In Table 4, we also see that 8-head self-attention overwhelms all other proposal models while increasing 0.12% on F1. The 2-head self-attention model achieves the highest recall while leading 0.12% compared with the other two proposal models. On precision 8-head self-attention, it outperforms other proposal models, rising by about 0.47%. Analyzing the results from the TuSimple dataset, we chose 8-head self-attention to act as our lane detection model. Our main purpose is F1 and precision. But we also know the difference is not obvious.

6. Conclusion

In this paper, we propose a lane marker detection network based on multihead self-attention. It combines row anchoring with multihead self-attention to extract global information to resolve challenging scenarios like vehicle occlusion. It also achieves state-of-the-art performance. On the TuSimple dataset, our proposal method achieves the second-highest accuracy while being much faster than the top-F1 method [28]. On the CULane dataset, our proposal method outperforms other methods. In addition to this, we also find that our proposed approach can be used widely in image classification problems. In [53], it segments the pap smear image using the appropriate threshold. A texture descriptor is proposed titled modified uniform local ternary patterns (MULTP). Then, an optimized multilayer feed-forward neural network is used to classify the pap smear images. The proposed deep neural network is optimized using a genetic algorithm in terms of the number of hidden layers and hidden nodes. In [54], a new version of local binary pattern, that is called completed local quartet patterns, is proposed to extract fabric image local texture features [53, 54] have enough relation. Although we put forward a proposed lane mark detection model, there are some limitations. For example, how to make every head independent in order to focus different subspace and how to set rational anchor number and offset number all need further research. Besides that, it also needs to trade off computation efficiency and computation complexity in the model. For a much better way in the future, we will search for a new architecture synthetically combining encoder-decoders, RNNs, and GANs.

Data Availability

Previously reported TuSimple and CULane data were used to support the findings of this study and are available at https://github.com/TuSimple/tusimple-benchmark and https://xingangpan.github.io/projects/CULane.html. These prior studies and datasets are cited at relevant places within the text as references [16, 41, 4752].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments