Abstract

Automotive intelligence has become a revolutionary trend in automotive technology. Complex road driving conditions directly affect driving safety and comfort. Therefore, by improving the recognition accuracy of road type or road adhesion coefficient, the ability of vehicles to perceive the surrounding environment will be enhanced. This will further contribute to vehicle intelligence. In this paper, considering that the process of manually extracting image features is complicated and that the extraction method is random for everyone, road surface condition identification method based on an improved ALexNet model, namely, the road surface recognition model (RSRM), is proposed. First, the ALexNet network model is pretrained on the ImageNet dataset offline. Second, the weights of the shallow network structure after training, including the convolutional layer, are saved and migrated to the proposed model. In addition, the fully connected layer fixed to the shallow network is replaced by 2 to 3, which improves the training accuracy and shortens the training time. Finally, the traditional machine learning and improved ALexNet model are compared, focusing on adaptability, prediction output, and error performance, among others. The results show that the accuracy of the proposed model is better than that of the traditional machine learning method by 10% and the ALexNet model by 3%, and it is 0.3 h faster than ALexNet in training speed. It is verified that RSRM effectively improves the network training speed and accuracy of road image recognition.

1. Introduction

As car ownership has risen continuously, traffic jams, delays, and accidents spiraled upward. According to statistics [1, 2], 16.12% of traffic accidents on highways are attributed to slippery road conditions and the driver’s response to changes in terrain caused by road damage. To improve vehicle safety, research on vehicle safety control has gradually changed focus from passive safety to active safety. As an important part of the vehicle’s perception of the surrounding environment, road surface type recognition plays an important role not only in the power, smoothness, and comfort of intelligent driving vehicles, but also in vehicle safety.

In the 1960s, Wiesel and Hubel [3] found that their unique network structure could effectively reduce the complexity of the feedback neural network when they studied the neurons used for local sensitivity and direction selection in the cortex of cats and proposed a convolutional neural network (CNN). Lecun et al. [4] made a great breakthrough in optical character recognition and computer vision by using a CNN, which promoted the development of computer vision. In recent years, CNNs have been widely used in many fields and have shown excellent performance in image target detection [57] and classification [8, 9]. The appearance of a CNN also provides a new solution for road condition recognition.

Several researchers and institutions have focused on pavement type identification and adhesion coefficient prediction. Chen [10] extracted the feature parameters of the gray-level cooccurrence texture matrix of the pavement image, studied the selection of pavement texture features, and achieved certain results. However, this method has the disadvantages of fewer image features and lower recognition accuracy. Bekhtike and Kobayashi [11] used a camera to collect pavement images and evaluated the texture attributes obtained from the fractal dimension using Gaussian process regression for function approximation and predicted road types by fusing road texture features and vibration data received from motion. This method still has some limitations. For instance, when the background lighting changes obviously, motion blur occurs, or if the road is covered by rain, snow, or ice, it is difficult to accurately identify the road type.

Ward and Iagnemma [12] successfully classified asphalt, paved, and gravel roads with acceleration sensors. This method has drawbacks when the road surface roughness is similar, and it is obviously insufficient to use acceleration data to distinguish the road type. Alonso et al. [13] proposed a real-time acoustic pavement state recognition system based on tire noise, using a noise measurement system and a signal processing algorithm to classify the pavement state, and achieved accurate classification of wet and dry pavement states.

Neupane and Gharaibeh [14] proposed a method for detecting pavement types based on heuristic lidar and identified the pavement type by the mean and variance of the laser reflection intensity. This method is mainly used for asphalt pavement. Jonsson et al. [15] proposed road classification based on near infrared camera image spectral analysis, using KNN and support vector machine (SVM) methods to classify dry, wet, icy, and snowy roads and achieved certain results. Bystrov et al. [16] used automotive ultrasonic sensors to analyze reflected ultrasonic signals for road classification, with a recognition accuracy of up to 89%.

Meng [17] proposed a method based on the basic principles of machine learning to classify pavement types by combining data from vertical acceleration sensor signals and camera features. The accuracy of using an acceleration sensor or image data to identify road type was only 62% and 88%, respectively. When the two were combined, because of the small sample size, accuracy reached only 90%. Wang [18] classified and discriminated road images based on high-dimensional features and RBF neural networks and performed recognition experiments on eight different road images with an accuracy of approximately 78.4%. Based on the SVM, Zhao et al. [19] obtained the best classification model by PSO parameter optimization, classified the road types, and improved the recognition accuracy of the test image, achieving an accuracy rate of over 90% for the five basic road types.

Casselgren et al. [20] studied the light performance of asphalt pavements covered by water, ice, or snow. They conducted a detailed study on the changes in light intensity with the angle of incidence and spectrum changes and proposed two different wavebands to classify road conditions. Linton and Fu [21] described a networked vehicle-based winter road condition (RSC) monitoring solution that combines vehicle-based image data with data from road weather information systems. Jokela et al. [22] presented a method and evaluation results to monitor and detect road conditions (ice, water, snow, and dry asphalt).

The developed device is based on light polarization changes when reflected from the road surface. The recognition capability has been improved with texture analysis, which estimates the contrast content of an image, but the results show that the proposed solution does not currently adapt to different conditions perfectly well. Yeong [23] and Yu and Salari [24] developed a pothole detection system and method using 2D LiDAR. Caltagirone [25] developed a method for road detection in point cloud top-view images using fully CNN. However, according to the material presented in [26, 27], even LIDAR, which is the safest laser, can cause damage to the human eye during longer exposure (e.g., cataracts and burn of the retina). In the future, with the popularity of smart cars, this type of laser may be a problem. We consider a method to improve road condition recognition through image vision.

To summarize, most road recognition algorithms are based on traditional machine learning. Traditional machine learning extracts artificial image features as algorithm input. It was found that the process had a certain randomness, and the whole process including the classification algorithm was complex. To solve these problems, this paper proposes a road surface condition identification method based on an improved ALexNet model, namely, the road surface recognition model (RSRM).

Therefore, the main contributions of this paper can be summarized as follows:(1)The ALexNet [28] network model is pretrained on the ImageNet [29] dataset offline. The weights of the shallow network structure after training, including the convolutional layer, are saved and migrated to the proposed model. In addition, the fully connected layer fixed to the shallow network is replaced by 2 to 3, which improves the training accuracy and shortens the training time.(2)The traditional machine learning and improved ALexNet model are compared, focusing on adaptability, prediction output, and error performance, among others.

2. Research Method for Identifying Road Surface Conditions Based on Improved ALexNet Model (RSRM)

The traditional road type identification method has some limitations, such as a complex extraction process, weak adaptability, poor light robustness, low recognition accuracy, and difficulty in practical application. Meanwhile, the rapid development of artificial neural networks has also given birth to the progress of deep learning [30] in recent years. Common deep learning networks include autoencoders [31], deep belief networks [32], and CNNs. In deep learning, CNNs play a key role in image recognition. Road condition recognition belongs to the field of image recognition; therefore, in this study, the road image recognition model is built by combining CNNs and deep learning theory. With the help of CNN’s self-learning and training of road image features, the actual road types can be identified.

2.1. Convolutional Neural Networks (CNN)

CNN [33] has high efficiency and accuracy in image recognition, which is due to the shared parameters of convolutional kernels in the hidden layer and sparsity of interlayer connections. A CNN model is generally formed by alternately stacking convolutional layers and pooling layers, and the specific operation for input data is saved in the weight of this layer. The loss function is used to evaluate the difference between the output and target values. The optimizer uses the difference between the target value and the output value as the feedback signal to update the weight value through the backpropagation algorithm [34] and finally reduces the loss value corresponding to the current target, which makes the network prediction more accurate. The feature values of the last layer of the pooling layer generate a list of vectors through the fully connected layer and input them to SoftMax [35], for classification and recognition. The CNN training process is shown in Figure 1.

2.1.1. Convolutional Layers

The convolutional layers principally perform convolution operation on the image or feature map, which is input into the convolution layer, to extract feature and output the convoluted feature map. Therefore, as shown in equation (1), each feature map of convolution layer is obtained by combining and calculating multiple feature maps output from the previous layer:where is the feature map set filtered from the input feature map, is the th feature map in the th layer, is the th element of the th convolution kernel in the th layer, is the th offset of the th layer, and “” is the process of convolution.

2.1.2. Pooling Layers

The pooling layer, also known as the lower sampling layer, is mainly used to reduce the calculation amount of feature extraction. The pooling layer retains the number of feature maps but changes the size of the feature maps; equation (2) represents the calculation process of the sampling layers.where is the lower sampling (pooling) function, is the th multiplication offset of the th layer, and is the th offset of the th layer.

The lower sampling function is largely divided into mean-pooling and max-pooling. Mean-pooling is to calculate the average of all elements in the pooling area.

The max-pooling is to select the maximum element in the pool area.where is the th pooling area in the feature map and is the th pixel value in .

2.1.3. Fully Connected Layers

The fully connected layers generally locate at the last part of the hidden layers in CNN. The fully connected layers form a multilayer perceptron like the shallow neural network, which nonlinearly combines the feature vectors output by the convolutional layer and the pooling layer to get the output.

2.1.4. Output Layers

The output layers in CNN are usually behind the fully connected layers. For image classification problems, the output layers use a logical function or a normalized exponential function (SoftMax function) to output classification labels. The range of the multiclassification label y in SoftMax regression is y ≥ 2. The training sample set is composed of k labeled samples:where is the classification labels, and is the sample set. represents different classifications, and it is estimated probability value. The probability that a single sample is classified into class K is

The regression sample set is transformed into a k-dimensional probability vector, and it is given bywhere is learning parameters, and ; the purpose of is to normalize the probabilities and make the sum of the probabilities be 1.

Through the training of sample set, the optimizer adjusts parameters to minimize model loss function value, and its loss function formula is defined as

2.2. ALexNet Network

The typical CNN models usually include GoogLeNet [36], VGG-16 [37], and ALexNet. GoogLeNet and VGG-16 have 22 and 16 layers, respectively. Theoretically, the deeper the number of model layers, the better the classification effect.

As a classic model, ALexNet accelerates the development of deep learning, which is a milestone in image recognition. Before the research, we have done the comparison between ALexNet and VGG, GoogleNet, and other networks, and ALexNet network can reach a higher recognition accuracy in a shorter time.

Second, theoretically, the deeper the model layer is, the better the classification effect is. However, the training process of deep convolution network is extremely difficult. For example, many parameters lead to the disappearance of backpropagation gradient and overfitting. At the same time, the deeper network often needs to consume more computing resources. ALexNet can meet the accuracy of road image recognition, while reducing computer resources. So, the diversity of road images is low, and ALexNet can achieve higher recognition accuracy and occupy less computer resources.

Third, the road image is relatively simple, and the latest network is usually to solve more complex image classification problems. ALexNet has been able to solve the problem of road condition image recognition extremely well.

This is due to several advantages of the ALexNet network:(1)In the training process, dropout is used to randomly ignore some neurons to avoid overfitting the model.(2)Samples are data augmented [38] to expand the samples with insufficient training images.(3)Rectified Linear Units (ReLUs) [39] are used as the excitation function of the network, which improves its nonlinearity and solves the problem of gradient dispersion. To solve the problem of gradient dispersion, ALexNet adopts the ReLU activation function. ReLU is defined as follows:

In Figure 2, comparing ReLU and sigmoid [40] activation function curves, it shows that when x is greater than 0, the ReLU gradient value is always a constant of 1. The derivative of the sigmoid function is like the curve shape of the Gaussian function, but not constant. The derivative at both ends of the sigmoid curve becomes smaller. Therefore, the network with ReLU as an activation function converges quickly, which is helpful in accelerating training.

Figure 3 and Table 1 show the structure and parameters of ALexNet. The model is mainly composed of five convolutional layers and three fully connected layers. The number of convolution kernels in five convolution layers is 96, 256, 384, 384, and 256, respectively. The role of the pooling layer is mainly to reduce the size of the feature image after convolution. The nodes of the three fully connected layers are 4096, 4096, and 1000, respectively. SoftMax can classify 1000 categories.

2.3. Road Surface Recognition Model Based on RSRM

The ALexNet network was pretrained on the ImageNet database with at least one million images offline, and the weights and parameters of each layer were obtained after training. The trained network has a strong ability to learn features, especially curves, edges, and contours of an image. To improve the efficiency of network training and reduce the training time, this study takes the trained ALexNet network as the pretrained model and transfers its parameters to the RSRM using fine-tuning transfer learning [41]. ALexNet, SVM, and BP use the classic structure. SVM algorithm is based on the characteristics of the road image for road color and texture feature extraction experiments.

Similarly, RSRM consists of a convolutional layer, pooling layer, fully connected layer, and SoftMax classification layer. By analyzing the characteristics of actual pavement images, nine typical pavement types are selected, as shown in Figure 4, focusing on nine typical road surface types; therefore, 9-label SoftMax is used to replace the original classifier in the ALexNet network. In addition, as shown in Figure 5, two fully connected layers are trained on the actual road pavement test set and to replace the original three fully connected layers. The number of nodes in the two fully connected layers are 4096 and 1000, respectively. Through the above steps, the problem of road surface image classification and recognition is solved.

3. Experimental Settings

3.1. Road Surface Data Acquisition System

The road collection test vehicle was a sedan with a length of 3564 mm, width of 1620 mm, and height of 1527 mm. Its wheelbase was 2340 mm. The camera model was LeTMC-520. As shown in Figure 6, the camera was installed at the air intake grille in the front of the vehicle, at an angle of −10° from the horizontal grille. The installation height from the ground was 350 mm. In this study, considering the complex weather conditions in the actual driving process, three typical weather conditions, namely, cloudy, sunny, and rainy, were selected for road image data collection. Note that the images of the actual road test set are all taken by the vehicle during driving.

In addition, a road surface data analysis system server configuration was performed on a desktop computer with a 64 bit operating system, 16 GB of memory, an AMD Ryzen 5 3600 6-Core Processor, and a GeForce GTX 1660 graphics processing unit.

3.2. Establishment of Road Surface Image Database

The image standards were selected according to the typical pavement types: asphalt, concrete, grass, mud, rain, rock, soil, wet asphalt, and wet concrete, and the images with clear quality were used for the road surface image database (RSID). The sample size of each pavement was 2000, in which the training set and test set were divided in a ratio of 7 : 3.

3.3. Experimental Procedure
Step 1: Image preprocessing.Scaling and cropping operations are performed on all road surface images to ensure a uniform image size, which can meet the requirements of the neural network module in MATLAB.Step 2: Building the training set and test set.RSID is divided into the training set and test set in a ratio of 7 : 3.Step 3: Building the RSRM.Focusing on nine typical road surface types, the 9-label SoftMax is used to replace the original classifier in the ALexNet network. The next step is to use the trained ALexNet network as the pretrained model and transfer its parameters to the RSRM using fine-tuning transfer learning.Step 4: Model trainingModel training that uses the stochastic approach initializes the model parameters; sets the momentum parameters, learning rate, and training time; and freezes the parameters of the five convolutional layers and pooling layers. Through the above, we replace the parameters of the two fully connected layers and 9-label SoftMax with a fresh new one.Step 5: Model testing.The remaining 30% of the RSID was used as a test set to verify the accuracy and speed of the RSRM.

4. Results and Analysis of Experiment

4.1. Experiment of Road Image Feature Extraction

The role of the convolution layer is to extract features by performing a convolution operation on the image or feature map. First, we pretrained the improved ALexNet network model (5 convolutional layers) on the ImageNet dataset. Second, the weights of the shallow network structure after training were saved and transferred to the RSRM. Finally, to observe the feature extraction effect of RSRM more clearly, taking the mud image as an example, the output features of each convolution layer were visualized. Figure 7 shows the mud pavement image after preprocessing.

As shown in Figure 8(a), the preprocessed mud image is extracted with 96 feature maps through Conv1. The convolution layer mainly extracts edges and details of the image. After several convolution kernel operations in the convolution layer, the image retains most of the information of the original image. As shown in Figure 8(b), the convolved image is processed by the ReLU1 activation function, and the edge information and detailed information of the mud surface road image are more obvious. Figures 8(c) and 8(d) show the feature map after Conv3 and relu3. It can be seen from the figure that the convolution kernel can extract more edge information, and the outline of the mud road surface image is clearer. Figures 8(e) and 8(f) show the feature map after Conv5 and relu5. It also reveals that as the number of convolutional layers increases from the first layer to the fifth layer, the resolution ratio of the image decreases, and the image output from the convolutional kernel becomes increasingly abstract.

According to the above image feature extraction experiments, the convolution layer integrates shallow features or underlying features to form more abstract features. This makes the expression of road information more comprehensive and can also use high-level abstract features for pavement classification and recognition.

4.2. Experiments of Road Surface Type Recognition Based on RSRM

RSID contains 18000 images of nine pavement types, such as asphalt, wet asphalt, rain, concrete, wet concrete, soil, mud, grass, and rock. To verify the validity of the RSRM proposed in this paper, 70% of RSID were randomly selected as the training set, with a total of 12600 pieces, and the remaining 30% was used as the test set. There were 600 images for each type of pavement in the test set, for a total of 5400 pavement images.

Table 2 shows the RSRM training parameter setting. The test tolerance is the number of iterations that the loss of test set before network training stops can be greater than or equal to the previously smallest loss. This can stop training by setting the test tolerance when test loss is no longer reduced, to avoid overfitting, save computer memory and improve training speed.

Table 3 shows the classification results of test samples based on RSRM in this research. Table 4 lists the recognition results of some image based on RSRM.

According to Step 2, there are 600 samples per category in the test set; this is true for the asphalt pavement type. As can be seen in Table 3, an asphalt pavement image (600 samples) was misidentified, and the recognition accuracy was 99.8%. This is because part of the asphalt pavement presents a dry-wet state, which makes it extremely similar to the image characteristics of asphalt pavement; thus, it is misidentified as wet asphalt. A total of 598 concrete pavement samples were correctly identified, and the remaining two were identified as soil and mud pavements, with an accuracy rate of 99.7%. The reason is that the color and image texture of some concrete, soil, and mud pavements are similar under dry conditions. The number of rain pavement samples correctly identified is 598, with the remaining two misidentified as wet asphalt and wet concrete pavements; meanwhile, the identification accuracy rate was 99.7%. For soil pavement, 599 samples were correctly identified, and the remaining one was classified as mud. A wet soil road surface often forms the mud surface, and the high probability of these two pavement features cooccurring in a single image is the main factor leading to false positives. The total number of wet concrete surfaces is 600, of which 580 are correctly recognized, 9 are identified as soil, one is identified as rock pavement, and the last 10 are identified as mud pavement. Thus, the recognition accuracy of wet concrete pavement is 96.7%. This is because the color of the wet concrete pavement is brown-gray after being wet. The recognition accuracy of grass, rock, and wet asphalt are higher than other surfaces, which is due to the significant difference in color and texture features compared to other road images.

4.3. Experiments of Classification Method Comparison

In this study, RSRM is compared against the ALexNet model, support vector machines (SVM), and backpropagation (BP) neural networks. The results are shown in Table 5. According to previous research, color and texture are the main features of road images. The SVM [42] classification model needs to extract road image features manually. In the three-color spaces of HSV, RGB, and YCM, there are nine color components of the road image, namely H, S, V, R, G, B, Y, C, and M. The gray-level cooccurrence matrix is used to extract four texture similar information of road surface images, such as contrast, correlation energy, and entropy. The BP neural network [4345] has five layers, the number of nodes in each layer is 100, and the optimization algorithm uses stochastic gradient descent.

In this section, RSRM is compared with the BP neural network, SVM, and ALexNet models, focusing on the analysis of model prediction output, error performance, training time, and detection time.

As shown in Figures 9 and 10 , RSRM significantly improves accuracy of road surface identification compared to ALexNet. Specifically, RSRM converged at 216 iterations, realizing an accuracy of 96.38%. However, the ALexNet network has yet to converge after 500 iterations. Therefore, transfer learning and optimization of the fully connected layer can effectively improve the training efficiency and accuracy of the model. The ALexNet model requires a longer training time and larger dataset to match the accuracy of RSRM.

Figure 11 and Table 5 illustrate the identification accuracy of different methods. The average recognition accuracy of BP, SVM, ALexNet, and RSRM was 92.84%, 89.59%, 97.57%, and 99.48%, respectively. The accuracy of RSRM and ALexNet is more than 95%, which shows the superiority of deep learning methods. Traditional machine learning methods, such as SVM and BP neural networks, are not suitable for representing variations in illumination intensity due to their manual features. The SVM classifier is suitable for small datasets, which is why it has not achieved good results in road datasets. Table 6 shows the average time taken by each learning model to classify a test image. The test times of all models for a given road image are almost the same. The results show that the accuracy of the deep learning model is higher than that of the traditional machine learning method. SVM is effective in dealing with small-scale datasets, which is difficult to implement for large-scale training samples. In the process of image classification based on BP, the upper layer of neurons and the next layer of neurons are fully connected, which leads to excessive training weight and overfitting. However, a CNN can effectively reduce the training weight and improve the training speed using a convolution operation.

In addition, this study proposes a method for testing tolerance thresholds to stop model training and reduce the number of fully connected layers. Based on this, the SoftMax classifier for nine labels is designed. The training time for the proposed method was 1.6 h, BP was 2.2 h, ALexNet was 1 h, and RSRM required 0.4 h (RSRM does not include pretraining time), and it took 0.14 s to classify a test image. The training speed of RSRM is four times that of SVM and five times that of BP. Meanwhile, the recognition accuracy was 1.91% higher than that of ALexNet, 6.64% higher than BP, and 9.89% higher than SVM. RSRM can effectively improve the training efficiency and accuracy of the model.

In summary, the BP neural network is not suitable for recognizing multiple ranges of road image databases because of the large number of neurons, the number of network layers cannot be too large, and the computing time is long, which can easily lead to overfitting and inconvenience in processing high-dimensional data. SVM feature extraction is complex and only suitable for small datasets. The proposed method not only achieves fast and high-precision recognition of road surface types in a short training time but also meets the perception requirements of actual road conditions.

5. Conclusion

This paper presents a pavement identification method based on an improved ALexNet model. First, the ALexNet network model is pretrained on the ImageNet dataset offline. Second, the weights of the shallow network structure after training, including the convolutional layer, are saved, and migrated to the proposed model. In addition, the fully connected layer fixed to the shallow network is replaced by 2 to 3, which improves the training accuracy and shortens the training time, and the 9-label SoftMax replaces the original classifier in the ALexNet network. In addition, the proposed method is compared with the BP neural network, SVM, and ALexNet models, focusing on the prediction output, error performance, and rapidity of the model. The results show that the recognition accuracy of RSRM is 99.48%, which is higher than that of ALexNet, BP, and SVM by 1.91%, 6.64%, and 9.89%, respectively. Moreover, this paper proposes a method for testing tolerance thresholds to stop model training and reduce the number of fully connected layers, which can save 0.6 h of training time and increase the training speed to four times that of SVM and five times that of BP. In conclusion, the deep learning model not only has higher accuracy than the traditional machine learning method but also can achieve higher recognition accuracy in a shorter time, which can meet the perception requirements of actual road conditions. The research method is not only suitable for road recognition, but also suitable for human-vehicle-road collaborative perception of the vehicle environment.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded by the National Natural Science Foundation of China, under grant no. 51575232, and the Shanghai Youth Science and Technology Talents Sailing Project, under grant no. 19YF1434600.