Abstract

Automated recognition of road surface objects is vital for efficient and reliable road condition assessment. Despite recent advances in developing computer vision algorithms, it is still challenging to analyze road images due to the low contrast, background noises, object diversity, and variety of lighting conditions. Motivated by the need for an improved pavement objects classification, we present Dual Attention Convolutional Neural Network (DACNN) to improve the performance of multiclass classification using intensity and range images collected with 3D laser imaging devices. DACNN fuses heterogeneous information in intensity and range images to enhance distinguishing foreground from background, as well as to improve object classification in noisy images under various illumination conditions. DACNN also leverages multiscale input images by capturing contextual information for object classification with different sizes and shapes. DACNN contains an attention mechanism that (i) considers semantic interdependencies in spatial and channel dimensions and (ii) adaptively fuses scale-specific and mode-specific features so that each feature has its own level of contribution to the final decision. As a practical engineering project, dataset are collected from road surfaces using 3D laser imaging. DACNN is compared with four deep classifiers that are widely used in transportation applications. Experiments show that DACNN consistently outperforms the baselines by 22–35% on average in terms of the F-score. A comprehensive discussion is also presented regarding computational costs and how robustly the investigated classifiers perform on each road object.

1. Introduction

Automation in road condition assessment is a crucial yet challenging task in smart transportation management. The goal is to label various road objects in pavement images and to establish appropriate maintenance and repair strategies to ensure road serviceability and safety. Manual road assessment, however, is labor intensive, time-consuming, and inconsistent. Automated road object detection is an alternative way for objective and scalable assessment of road networks. Fast and accurate automated road assessment can be used as quantitative data for optimal maintenance and rehabilitation practices to improve road performance and decrease the overall life-cycle cost.

To automate the road condition assessment, data are usually collected by surveying vehicles equipped with digital cameras that acquire images from pavement surfaces at high speed. There are two main high-resolution imaging techniques frequently used in road survey projects: (i) two-dimensional (2D) imaging technology in which line-scanning cameras are used to generate 2D intensity images; (ii) three-dimensional (3D) imaging technology that provides additional range (depth) images in addition to the intensity images. Recently, the 3D imaging technology has been increasingly adopted by state and local transportation agencies for data collection of road networks [1, 2]. The 3D imaging equipment employs high-resolution laser imaging devices associated with a high-precision inertial measurement unit (IMU) to capture 3D pavement surface profile data at highway speed. One of the main advantages of the 3D technology is that it is less sensitive to light effects and less prone to noises coming from oil or water stains, dirt or sand, skid marks, etc. Furthermore, the combination of intensity and range images provides additional information to model object boundaries and global layouts and to better recognize pavement defects.

Despite those advantages of new 3D imaging technology, existing kinds of literature [36] lack investigations to quantify improved performance in road object detection due to 3D technology using additional range images, compared to traditional 2D technology relying on intensity images only. Existing studies address the recognition of pavement defects, mostly cracks, using intensity images by employing deep convolutional neural networks (CNNs) [79]. CNNs have been successfully employed for various visual recognition tasks including image classification [10, 11], object detection [12], and semantic segmentation [13]. Although CNNs have demonstrated good performance on pavement defects recognition using intensity images, the performance tends to be degraded when detecting defects in complex scenes. The complexity comes from intensity inhomogeneity, low contrast, background noises, objects diversity in terms of shape and size, variety of lighting conditions, etc., when using intensity images only. For example, when there exists low contrast between cracks (as the foreground) and asphalt (as the background) or when dealing with thin cracks, it is difficult to distinguish between background and foreground based on only intensity data. In the case of objects with similar color and texture (such as crack seals and patches), it is easy to misclassify those objects into the same categories. Moreover, intensity-based features extracted from pavement 2D images are sensitive to illumination differences among images. The abovementioned limitations motivate the joint use of range and intensity images to enhance the classification of pavement objects. Figure 1 shows a surveying vehicle installed with a 3D laser imaging device developed by Korea Institute of Civil Engineering and Building Technology (KICT) used in this study, and a sample of intensity and range images collected by the system.

We present the novel Dual Attention Convolutional Neural Network (DACNN) to utilize additional range of input images along with intensity images to improve pavement objects classification. In this paper, DACNN classifies pavement tiles into 8 classes, including crack, crack seal, patch, pothole, marker, manhole, curbing, and asphalt. DACNN leverages multiscale input tiles that capture scale-sensitive information for multiclass classification of various road objects with different sizes and shapes. Furthermore, DACNN adopts two attention modules to effectively fuse heterogeneous features in terms of (i) scales (multiscale input tiles) and (ii) modes (range and intensity tiles). The scale and mode attention modules focus on spatial and channel-related informative features and suppress the noninformative ones for performance improvement. The dual attention mechanism is designed to identify semantic image regions relevant to specific pavement objects. Pruning feature maps in both spatial and channel dimensions enhance the quality of feature representation, contributing to more accurate and efficient object classification.

The contribution of this study is not only limited to the architectural design of DACNN. We also evaluate the effectiveness of the additional range of data in 3D technology over 2D technology through quantitative comparison using different CNN models, including VGG16, VGG19, ResNet50, DenseNet121, as well as the DACNN. The goal of the above comparisons is (i) to understand the effects of the additional range data to improve object classification, (ii) to understand how the scale and mode attention modules can effectively fuse heterogeneous information to improve objects classification, and (iii) to understand the effects of CNN model selection to the number of trainable variables, training time, inference time, and classification accuracy. Our main contributions in this paper are summarized as follows:We present the new DACNN framework to systematically utilize both intensity and range images collected with 3D imaging devices for multiclass classification of pavement images. Considering the variety of pavement objects and surveying field conditions, DACNN extracts scale-specific and mode-specific features from images robustly. The dual attention mechanism used in DACNN is designed to adaptively fuse multiscale multimodal features, helping the network to capture discriminative object-specific features related to their spatial and channel information.The classification performance comparison is conducted for 8 different pavement objects using CNN models. The results show that our DACNN outperforms other models for all road object classes. We also present quantitative comparisons to understand how the additional range of images in 3D technology can improve object classification performance for compared CNN models.

2.1. Deep Learning in Pavement Assessment

Conventional image processing and more recent deep learning methods are two main approaches for automated pavement image analysis. The image processing methods can be considered as feature engineering techniques in which images are represented with human-specified feature vectors. They can be sorted into intensity-thresholding [14], edge detection [15], wavelet transforms [16, 17], and texture-analysis [18, 19]. A major problem with the conventional methods is that the prediction performance mainly relies on the validity of human-specified features. Extracting those features can be subjective, domain-specific, and inefficient, which makes the detection process ungeneralizable and tedious. Especially in pavement applications, hand-crafted features are not robust enough to detect distresses in the complex background with high variations. For instance, thresholding approaches for crack detection only achieve acceptable results under certain scenarios. If there exists a complex background or the illumination changes, either the parameters should be adjusted or the method is not applicable to the new scene.

Deep learning methods overcome the drawbacks of conventional image processing methods by automatically capturing complex structures of data with multiple processing layers. CNNs are the most studied deep learning models using vision-based input data in which automated feature learning is done at many different levels of abstraction to catch the topology of input images. Partial connections, sharing weights, and pooling layers in CNNs not only decrease the computations but also demonstrate state-of-the-art results in computer vision tasks [20, 21]. Detection, classification, and segmentation of pavement distress, especially cracks, are the main three branches of deep learning research in automated pavement assessment. Alfarrarjeh et al. [22] employed YOLO [23] as the object detection method to detect distresses, including cracks, potholes, and rutting, in pavement images. Maeda et al. [24] adopted SSD [25] as the training algorithm to detect the same defects on pavement surfaces. Song et al. [26] utilized Faster R–CNN [27] algorithm to detect pavement distresses, including cracks, potholes, and bleeding. Li et al. [28] presented a CNN model to classify pavement tiles into different types of cracks including longitudinal, transverse, alligator, and block cracks. Gopalakrishnan et al. [29] utilized a pretrained VGG16 [30] on ImageNet and then fine-tuned it on a pavement dataset for a binary crack classification. Lau et al. [31] proposed a U-Net [32] based model in which the encoder is a pretrained ResNet34 [33] to segment pavement crack images. Inspired by SegNet [34], Chen et al. [35] proposed a fully convolutional neural network (FCNN) to detect pavement cracks at pixel level.

2.2. Attention in Deep Learning

The performance of deep learning-based approaches has been constantly improving by developing new architectural designs, and the attention mechanism is one of them. The main idea behind an attention mechanism is to give higher weights to relevant features while minimizing the irrelevant ones by giving lower weights. Focusing on the distinctive parts when processing large amounts of information, the attention mechanism enhances the quality of feature representation, contributing to a more accurate and efficient performance of the designed network. Attention was initially proposed by [36] for machine translation. Then, it was employed for various tasks, such as action recognition [3739], speech recognition [40, 41], image captioning [42, 43], and recommendation [44, 45]. More specifically, the attention mechanism is investigated in computer vision community in three aspects: (i) spatial attention in which the network learns the locations that should be focused on [46, 47]; (ii) channel attention in which the network adaptively recalibrates channel-wise features by modeling interdependencies between channels [48, 49]; and (iii) Self-attention in which long-range dependencies are captured by the network [50, 51]. In pavement applications, attention modules have been also applied for defect detection. Song et al. [52] presented a channel of attention to detect and classify different types of cracks in pavement images. Wan et al. [53] proposed an encoder-decoder network, called CrackResAttentionNet, containing spatial and channel attention modules after each block in the encoder to segment pavement cracks. Similarly, Qiao et al. [54] proposed CrackDFANet in which a channel-spatial attention module is designed to increase the generalization ability of the model in predicting cracks under different conditions of roads. Wang et al. [55] proposed using DenseNet121 as an encoder and a spatial attention module to combine multiscale features. Eslami et al. [56] designed a channel-spatial attention module to adaptively fuse multiscale features for pavement image classification. Zhou et al. [57] presented a VGG16-based network to predict crack maps, and employed spatial and channel attention modules to further refine the model. Qu et al. [58] employed Res2Net [59] along with an attention module to capture global context and long-range dependency for a better pavement segmentation. Pan et al. [60] proposed SCHNet with VGG19 as the base net in which a self-attention module is designed to global as well as semantic interdependencies in the channel and spatial dimensions. Finally, Li et al. [61] proposed a self-attention module along with a scale-attention module to enhance feature representation for pavement crack segmentation.

In this study, we propose a dual attention approach to capture semantic interdependencies in both spatial and channel dimensions for scale and type of input images. The dual attention mechanism achieves a fast focus on more important features and enhances the representativity of more relevant features for better classification performance. The dual attention approach enables modeling global context as well as multimodal features to improve classification performance for both small objects (e.g., cracks) and large objects (e.g., patches), which are in trade-off using other CNN models.

2.3. 3D Image Data in Pavement Assessment

Most of the existing deep learning studies were based on only intensity images using 2D imaging devices in transportation applications. With 2D intensity input images, CNNs suffer from some important limitations. The complexity of scenes, diversity of objects, background noises (stains, oil spills, and tire marks), and surrounding changes (light and shadow) make it difficult to distinguish foreground objects (defects) from the background (asphalt) in 2D images. With the advances in sensor technology, 3D imaging systems are available and increasingly employed by state and local transportation agencies for automated road condition assessment. A survey showed that 18 states in the U.S. adopted a 3D data collection system by 2017, and 17 states intended to utilize this technology by 2019 [1]. Different approaches have been studied for transportation applications such as GPR, LiDAR, Microsoft Kinect, and laser profilers [3]. In pavement applications, laser profilers are commonly used in surveying road roughness and megatexture (ASTM E950, ASTM E1926, and ISO 13473–5) [6264]. Other techniques offer limitations such as relatively low resolution (in case of LiDAR) or low frequency (in case of Microsoft Kinect) to collect road surface profiles. The 3D laser imaging technique, such as Laser Crack Measurement System (LCMS) [65], is commercially available to collect high-resolution road surface profiles. This system utilizes surveying vehicles equipped with two laser imaging devices (left and right) and IMU. Using the 3D imaging system, intensity and range images can be acquired at speeds up to 100 km/h on on-road lanes with 4 m width under various lighting conditions. The 3D laser imaging technology has been used to evaluate crack [66, 67], pothole [68], raveling [69], rutting [70], joint [71], and texture [72]. Ghosh et al. [73] employed YOLO and Faster R–CNN to detect cracks in range images collected by the 3D imaging system. Yang et al. [74] utilized 3D laser technology to measure the growth of crack lengths when they are sealed and non-sealed to quantify the crack sealing benefit. Li et al. [28] proposed a CNN framework to classify range images into transverse cracks, longitudinal cracks, block cracks, and alligator cracks. Lang et al. [67] proposed a clustering-based algorithm to classify range images into the same categories of cracks as Li et al. [28]. Fei et al. [75] presented a deep CNN, called CrackNet-V, to segment cracks on asphalt range images. Li et al. [76] applied a filter-based method to segment cracks using 3D pavement images. Zhang et al. [77] proposed a recurrent neural network (RNN), called CrackNet-R, to detect pavement cracks at pixel-level in range images. Gui et al. [78] utilized laser-scanning 3D to detect pavement cracks by extracting hand-crafted features. Tsai and Chatterjee [68] proposed a threshold-based method to detect pavement potholes in range images collected by 3D laser technology. Zhang et al. [79] proposed a CNN-based architecture, called CrackNet to segment cracks in 3D pavement images. Zhang et al. [80] improved the crack segmentation results on 3D pavement images by proposing a deeper network, CrackNetII, in which the need for hand-crafted features is eliminated. Li et al. [81] presented a frequency analysis to detect pavement cracks from background texture in range images.

While there are existing studies using 3D laser imaging technology, they are limited to the use of either range or intensity images. In this study, we show that extracting features from both intensity and range (depth) images can significantly improve the CNN performance. We also show that by fusing intensity-specific and depth-specific features systematically, one can robustly and accurately classify not only cracks but also other pavement objects, including crack seals, patches, potholes, markers, manholes, and curbing in multiclass classification.

3. Data Preparation

3.1. Ground-Truth Labeling

The dataset used in this study contains 296 intensity images and the same number of range images with the size of 3700 10000 pixels spatial resolution of 1 mm/pixel. The gray-scale intensity and range images are collected by the 3D laser imaging device developed by Korea Institute of Civil Engineering and Building Technology (KICT) shown in Figure 1. The technical specifications of this device are provided in Table 1.

We provide pixel-level annotations of road objects for 8 categories, including 4 distress classes (crack, crack seal, patch, and pothole), 3 non distress classes (marker, manhole, and curbing), and 1 pavement class (asphalt) as the background. We annotate the intensity images using an in-house developed semiautomated software that makes the annotation process fast yet accurate. The annotation procedure is performed in two steps: (i) labeling area objects (all classes except for cracks) and (ii) labeling linear objects (i.e., cracks). To label area objects, the original image, shown in Figure 2(a), is grouped into homogeneous regions, called superpixels [82, 83]. As shown in Figure 2(b), superpixel segmentation preserves the edges and boundaries of objects. Therefore, superpixel-level labeling, rather than pixel-level labeling, can be performed, which reduces the labeling work significantly. To further facilitate the annotation process, an unsupervised mean shift clustering is applied, which groups the neighboring superpixels into a bigger cluster. The result of the superpixel clustering procedure is shown in Figure 2(c). Then, the human annotator can easily select the clusters that belong to the same object and label them. Also, the annotator is able to define new segments, which are missed by the clustering algorithm. Figure 2(d) demonstrates the final pixel-level labeling mask. Although the superpixel segmentation technique is beneficial for labeling area objects in the dataset, it is not effective for linear object labeling such as cracks. To label cracks, a morphological technique, called MorphLink-C, is employed to extract crack pixels in original images. MorphLink-C consists of a series of morphological operations, which is proposed by Wu et al. [84]. The original image in Figure 3(a) is a zoomed-in pavement image for better visualization of the existing crack. The cracks detected by MorphLink are shown in Figure 3(b) with the bounding boxes. Having the detected cracks, the human annotator can select the truly detected cracks within the image, as shown in Figure 3(c).

Figure 4 demonstrates the contents of different objects in the dataset. We observe that the population of road object pixels are highly imbalanced, for example, there are more than three million of asphalt pixels but only more than 4000 crack seal pixels in the dataset. Detecting objects with high variations in shape and size within a highly imbalanced dataset is a major challenge in pavement applications.

3.2. Data Preprocessing

In road surveying projects, the depth information in range images is often used to measure the macrostructure of pavement surface (ISO 13473–1) [64]. Although the depth resolution of the laser device on an absolute millimeter scale is important to determine the mean profile depth (MPD) in macrotexture surveying, a small variation in surface profile (e.g., crack depth) and low contrast in range images could be a disadvantage in road objects detection. To enhance the contrast, a histogram equalization (HE) can be applied to range images. HE enhances the contrast by effectively spreading out the most frequent intensity values (stretching out the intensity range of the image). It allows for areas with lower local contrast to obtain a higher contrast. In this study, Contrast Limited Adaptive Histogram Equalization (CLAHE) [85] is applied to a range of images. CLAHE differs from ordinary HE algorithms in two ways: (i) An adaptive HE computes several histograms, each corresponding to a small region of the image rather than computing the histogram for the entire image. Therefore, it improves the local contrast and edges in each region of the image. (ii) CLAHE sets a threshold to limit the contrast in each small region. The contrast limiting procedure prevents the over-enhancement and amplification of noise in the image. Figure 5(a) shows a range image with cracks spreading all over the image. Also, the intensity distribution of the image and the cumulative distribution are presented for the range image as histogram and cdf, respectively. Figure 5(b) demonstrates the range image after using CLAHE enhancement and its corresponding histogram and cdf. We can see that the visibility of cracks is improved by redistributing the lightness values of the image without introducing noises to the image. Comparing the histograms before and after applying CLAHE to the image, the intensity range of the road image is expanded within the lower range (dark pixels 0–50) by redistribution of the values, as shown in Figure 5(b).

After the contrast enhancement of range images, we divide the original images into nonoverlapping 50 50 tiles to conduct multiclass classification experiments on pavement images. Then, each image tile is assigned to one of 8 categories of road objects. When a 50 50 tile has more than one class of pixels, the tile class is determined by a majority vote between the pixel number of nonbackground classes if exists, otherwise, the tile is classified as the background (asphalt). By aggregating the assigned classes for all tiles generated from an original image, a segmentation mask with a resolution of 50 50 mm2 can be produced. The reason for 50 50 tile generation comes from two sources: (i) Due to the large size of the original images (3700 10000), the segmentation task on the whole image is memory intensive and not practical; (ii) 50 50-pixel tiles, equivalent to 50 50 mm2, is small enough to contain only one pavement object for the classification task. Therefore, assembling the classification results into the whole image produces a segmentation mask with a high-resolution, which is satisfactory in pavement applications. Although having small input tiles results in high-resolution segmentation masks, it sacrifices the contextual information required from the deep networks to perform well. Due to the importance of contextual information for the classification task, we generate 250 250, and 500 500 tiles surrounding each 50 50 tile with the same center. Feeding multiscale tiles into the deep networks improves the classification performance of the smallest tile, which will be explained in Section 4.1.

4. Method

4.1. Dual Attention Convolutional Neural Network Architecture

The Dual Attention Convolutional Neural Network (DACNN), illustrated in Figure 6, is presented to classify pavement image tiles into one of the 8 existing classes in the dataset. The DACNN provides a systematic way of data fusion for heterogeneous input images including (i) intensity and range images (i.e., mode), and (ii) 50 50, 250 250, and 500 500 (i.e., scale), which is more effective than a simple feature concatenation. For this, the DACNN consists of two main streams of intensity and range modes, which are merged later by a mid-fusion strategy (i.e., mode-level attention module). Each mode steam consists of three scale streams to extract multiscale features, which are combined later using a mid-fusion strategy (i.e., scale-level attention module). The high-level architecture of the DACNN is shown in Figure 6.

Multiscale Input Tiles. Input tiles are extracted from the original intensity and range images at three scales, 50 50, 250 250, and 500 500. All the input tiles are resized to 50 50 before they are fed to the DACNN.

Feature Extraction (Scale). A conventional to combine multiscale multimodal input data is directly concatenating them at the input level. This approach has a disadvantage in that only similar patterns will be captured across the scales and modes. Instead of concatenating heterogeneous input data in an early fusion, we propose to feed input tiles to 6 separate CNNs to extract scale-specific and mode-specific features. Each CNN consists of three convolution layers with the filter numbers 32, 32, and 64, respectively. The filter size is 3 3 pixels for all convolution layers. Each convolution layer is then followed by a Batch Normalization layer and a rectified linear unit activation (ReLU), which are not shown in Figure 6 because of space limitation. It should be noted that up to this point the extracted feature maps are processed independently at each scale and mode level.

Mid-Fusion with Scale-Level Attention Module. The main idea of using multiscale input tiles is to allow features extracted from different levels of spatial context around the smallest tile (50 50) to contribute to the classifying decision. The level of contribution at each scale for different objects varies for different objects. For example, scale 1 is more informative for small objects (e.g., cracks), while scale 3 is more informative for classifying large objects (e.g., patches). Therefore, we use a scale-level attention module that decides how much attention to pay to scale-sensitive features. Unlike simple concatenation of multiscale features, the scale-level attention module weights the features from different input scales at each mode. The scale-level attention module consists of three convolution layers of 1 1 64, and one sigmoid layer to generate the weight scores for each scale. The generated score maps reflect the importance of scale-specific features at a specific position and scale for classifying the object in the tile.

Feature Extraction (Mode). After the mid-fusion with the scale-level attention module, the weighted feature maps get concatenated in intensity and range modes, separately. Then, they are passed through three convolution layers with the filter number of 128 and max-pooling layers. At this stage, the network is expected to extract more complex multiscale features in each mode. Depth-specific patterns can complement intensity patterns and help the overall model with this useful information.

Mid-Fusion with Mode-Level Attention Module. For the effective mid-fusion of complementary information of intensity and range data, we use a mode-level attention module that weights the mode-sensitive features extracted from intensity and range images, determining the contribution level of mode-sensitive features to the final classification output. In this way, the feature maps can be fused with different weights based on the contribution levels of road object classes, instead of being treated uniformly.

Feature Extraction (Classification). For each mode, the mode-level attention module outputs weight maps that are multiplied by the feature maps. The weighted feature maps get concatenated and passed through shared layers. Four convolution layers with the filter size of 256, 512, 512, and 1024 with two max-pooling layers are applied to extract higher-level multimodal features. Then the feature maps are flattened and passed to six fully-connected layers with the sizes 2048, 1024, 512, 256, 128, and 8.

Classifier’s Output.The last fully-connected layer generates 8 numbers showing the probability of the 50 50 tile belonging to the 8 existing classes in the dataset. The higher the number is, the more probable the tile belongs to that specific pavement class. By assembling the predicted labels for the smallest tiles into the whole image, the segmentation mask with the spatial resolution of 50 50 mm2 is created.

Effects of Range and Intensity Input Image Tiles. Range and intensity input images provide complementary information about road objects, which can improve object classification performance compared to intensity-only input images. Depth is a key feature for road object classification, such as cracks and potholes. These objects can be small or have a similar color and texture to the clean asphalt, and it makes them difficult to detect in gray-scale intensity images. However, they appear more clearly in range images due to their depth differences. Other pavement objects, such as markers, that have a distinct color or texture or do not have a significant depth can be easier to detect from intensity images. Figure 7 demonstrates the advantage of using intensity and range images over intensity images only containing markers, patches, and cracks.

4.2. Attention Modules

We design two types of attention modules as a mid-fusion strategy to adaptively aggregate multiscale multimodal features extracted from intensity and range image tiles. The mechanism of an attention module is to attend to relevant parts of input features, which is important for having a robust classification. The scale-level and mode-level attention modules enable the deep network to focus on visual representations that are more informative for the classification of the object in the input tile. Scale-level and mode-level modules incorporate both spatial and channel-wise attention into the network.

As illustrated in Figure 8(a), the scale-level attention module generates the score maps with the dimension of for each scale, where is the scale number, is the number of channels, is the width, and is the height of the input features . The weighted feature maps, , are generated by the inner product of:orwhere is the weighted feature at the spatial position for the channel number at the scale ; and is the score corresponding to the input feature at the spatial position for the channel number at the scale . The attention module assigns a score between 0 and 1 to the feature maps of each scale in each channel and spatial position. Therefore, each element in the feature map is revised to , in which scale, channel, and spatial information is considered. This module not only localizes the object spatially but also selects the most discriminative channel.

The mechanism of the mode-level attention module, shown in Figure 8(b), is similar to the scale-level one. In this module, the shared module among the modes generates the score maps for each mode to focus on the most discriminative part of visual representations. The attention module assigns higher weights to the channel and regions of the mode features that are more relevant and informative for the classification step of that particular object.

4.3. Implementation Details

We train the classifiers in a fully supervised manner. The Adam optimizer with a learning rate of = 0.0001,  = 0.9,  = 0.999, and  = 1e–8 is used, where and are exponential decay rates, and is a constant for numerical stability. The Adam optimizer inherits the advantages of other optimization algorithms, including the momentum feature of SGD and the adaptive learning feature of AdaDelta. The Adam optimizer also provides faster computation time and requires fewer parameters for tuning. The networks are trained for 800 epochs with a mini-batch size of 200. In each epoch, the network uses 60,000 random tiles out of more than 6 million tiles in the training dataset. The model with the best performance on loss for the validation dataset is selected as the model used in the testing mode. The training is conducted on an NVIDIA TitanX GPU with a memory configuration of 12 GB. The codes are implemented in Python 3.7.3 and TensorFlow 1.14.0.

5. Experiments

5.1. Baseline Models with Single-Scale Input Images and Results

Four different baseline classifiers, widely used in pavement applications, are trained to classify pavement image tiles into one of the existing 8 classes in the dataset. The deep CNNs compared in this study can be divided into three categories. (i) VGGNet was proposed by Simonyan and Zisserman [30] for ImageNet challenge 2014. The main idea behind VGGNet is to use filters with a small size (3 3), decreasing the number of parameters, and stack more of them to achieve the same receptive field as if a larger filter were used. VGG16 and VGG19 have a total number of 16 and 19 convolutional and fully-connected layers, respectively. The deep architecture of VGGs is proved beneficial for image classification tasks. However, the gradient vanishing problem has appeared with the deeper architectures. (ii) ResNet proposed by He et al. [33] for ImageNet challenge 2015, alleviates the gradient vanishing problem by introducing skip-connections so that the input in each layer is passed to the next layer. Using identity skip-connections as well as batch normalization allows for training deep networks. ResNet50 has a total number of 50 convolutional and fully-connected layers. (iii) DenseNet proposed by Huang et al. [86] in 2017, extends ResNet’s idea by including skip-connections from all previous layers. The dense concatenation to all subsequent layers preserves the features in preceding layers and allows for the classification of images in a wide range of scales. DenseNet121 has a total of 121 convolutional and is fully connected.

Figure 9 shows an overview of the deep networks used for pavement object classification in this study. The classifiers are trained with only intensity input tiles as well as intensity and range input tiles to evaluate the effect of exploiting depth information along with intensity information. As shown in Figure 9(a), 50 50 image tiles are generated and are concatenated as a 3-channel image to train the deep networks with only intensity images. When training the networks with both intensity and range images, as shown in Figure 9(b), 50 50 image tiles of each mode are concatenated at the input level as a 2-channel image (early fusion) and fed to the network.

Table 2 summarizes the results for all classifiers using (i) only intensity and (ii) intensity and range input pavement tiles. The performance of each classifier is evaluated on each pavement object and on average in terms of precision, recall, and F-score.where TP, FP, and FN are true positives, false positives, and false negatives, respectively. The precision determines how many of positive predictions are really positive, while the recall shows the ability of the network in predicting all the relevant instances. The F-score is a harmonic mean of precision and recall that is a useful measure to find the balance between these two metrics. The results show that using both range and intensity images improves the performance of all classifiers in terms of overall precision, recall, and F-score.

In more detail, we compare the baseline models’ performances for different classes when they are trained with intensity-only images and intensity-range images. To interpret the results, we divide the classes into two categories: (i) the pavement objects having a height difference with adjacent pixels including crack, crack seal, pothole, manhole, and patch; (ii) pavement objects having no significant height difference with adjacent pixels including marker, curbing, and asphalt. Using range-intensity input images improved the performance of VGG16, VGG19, ResNet50, and DenseNet121 on the first category of objects, including crack, crack seal, pothole, manhole, and patch, on average by 18.8%, 20.6%, 11.9%, and 14.5% in terms of F-score. The average improvement of the baseline models on crack, crack seal, patch, pothole, and manhole are 12.6%, 22.5%, 21.6%, 22.1%, and 3.6% in terms of F-score. The lower improvement of manhole classification compared to the other four objects comes from the fact that manholes have distinct shapes and textures in intensity images. Therefore, providing range data as complementary information to the network has a milder effect. Incorporating range images into the network barely changes the performance of baseline models on the classification of pavement objects in the second category. In fact, the range image of marking, curbing, and asphalt provide no extra information to the networks for the classification task.

Providing depth information to the DACNN improves the classification results on the first category of objects by 3.2% in terms of F-score. In more detail, utilizing range-intensity images increases the performance of the DACNN on the classification of crack, crack seal, patch, pothole, and manhole by 2.4%, 7.8%, 1.2%, 2.6%, and 2.3% in terms of F-score, respectively. The improvement of DACNN performance by adding depth information is less than such improvement in baseline models. This is because of the high performance of the trained DACNN with intensity-only images which creates less capacity for improvements. As shown in Table 2, the average F-score for DACNN with intensity-only images is 92.9% while the number for VGG16, VGG19, ResNet50, and DenseNet121 is 59.9%, 59.9%, 62%, and 63.4%, respectively. The DACNN also outperforms VGG16, VGG19, ResNet50, and DenseNet121 on average by 23.3%, 22%, 25.4%, and 22.4%, respectively, in terms of F-score when the networks are trained with range-intensity input data. The significant improvement of DACNN classification performance over the baseline models comes from encoding contextual information to the network and adaptively fusing the features through the attention modules. In section (5.2), we show that the performance of baseline models improves by providing multiscale input tiles to the networks. However, DACNN still outperforms those models by having an effective fusion strategy for combining multiscale multimodal features.

Figure 10 demonstrates sample segmentation at a spatial resolution of 50 50 mm2 for different algorithms when trained with intensity-only and intensity-range pavement tiles. It can be seen DACNN achieves the best results by extracting a robust representation of range and intensity images. In more detail, we can see that cracks at the top left corner of the image are identified better when the depth information is encoded into all the networks. Range data provide more distinctive features helping the networks to distinguish between foreground and background when intensity values are not distinctive.

5.2. Baseline Models with Multiscale Input Images and Results

Figure 11 shows an overview of the deep networks trained with multiscale input tiles to classify pavement objects. The multiscale image tiles are generated at three scales, 50 50, 250 250, and 500 500, for each mode of intensity and depth. As shown in Figure 11(a), multiscale tiles are concatenated as a 3-channel image to train the deep networks with only intensity images. When training the networks with both intensity and range images, as shown in Figure 11(b), the 3-channel image of each mode are merged at the input level (early fusion) and fed to the network.

Table 3 summarizes the performance of baseline models on the classification of 8 pavement classes in terms of precision, recall, and F-score. Comparing the results with the single-scale version of the networks, incorporating the contextual information into the networks improves the average F-score of VGG16, VGG19, ResNet50, and DenseNet121 by 28.3%, 29.3%, 24.3%, and 24.4%, respectively, when trained with intensity-only images. Furthermore, extracting depth features along with intensity features increases the average F-score of the VGG16, VGG19, ResNet50, and DenseNet121 by 4.1%, 3.4%, 4%, and 5.1%, respectively.

Although encoding the contextual information and incorporating the depth data into the network significantly enhances the performance of the baseline models, the DACNN classifies the objects more robustly by having an effective mid-fusion strategy. The DACNN outperforms VGG16, VGG19, ResNet50, and DenseNet121 trained with multiscale multimodal features by 2.8%, 2.5%, 4.8%, and 2.2%, respectively, on average in terms of F-score. More specifically, the DACNN improves the crack classification (as one of the most important distress types in pavement condition assessment) by 8.8%, 7.2%, 8.7%, and 7% in terms of F-score compared to VGG16, VGG19, ResNet50, and DenseNet121, respectively. This demonstrates the effectiveness of attention modules for pavement object classification.

6. Discussion

6.1. Qualitative and Quantitative Analysis of DACNN

One of the most important comparison metrics to evaluate the performance of multiclass classification models is their capability to distinguish between classes. AUC (Area under the Curve) of ROC (Receiver Operating Characteristics) is a measure of how strongly the classifier separates the classes. Higher the AUC, the better the model is capable of predicting true classes. To evaluate the DACNN performance, ROC curves for all investigated methods are plotted in Figure 12. Comparing the AUC values, DACNN demonstrates a stronger ability to separate classes while predicting the pavement objects.

Figure 13 shows segmentation samples of DACNN generated by integrating classified pavement tiles. The corresponding heatmaps for the pavement classes are also demonstrated for qualitative comparisons. A hotter color means a greater probability that the pixels belong to the corresponding class. The heatmaps reveal that the DACNN predicts the pavement object robustly with a strong separation from the rest of the objects.

Figure 14 visualizes the performance of the classifiers in terms of TP, TN, FP, and FN. Having the networks’ predictions, we are able to analyze their performance in more detail. Especially in pavement applications, we care about not only increasing TPs but also decreasing FNs and FPs simultaneously. The reason is coming from: (i) having a high FN means that positive distresses are missed leading to an underestimation for road condition assessment, which is dangerous for safety considerations; (ii) having a high FP means that pavement tiles are misclassified as distresses leading to an overestimation, which is not cost-efficient for road assessment. As we can see in Figure 14, DACNN not only increases TPs but also significantly reduces FPs and FNs compared to all other methods. Other than DACNN which presents the best results, encoding depth information into all other networks also increases TPs and reduces FPs and FNs. For the pavement objects with a more distinctive representation in range images including cracks, crack seals, patches, potholes, and manholes, the improvements are more significant after combining the range data with intensity images. Figure 14 shows that DACNN generates the largest number of FPs and FNs for the crack classification. The reason mainly comes from the low contrast between cracks and the background within pavement images. Figure 15 demonstrates examples of DACNN predictions with FPs and FNs on crack classification.

6.2. Contrast Enhancement

As described in section 3.2, a histogram equalization technique, CLAHE, is employed to adjust the intensity values and improve the contrast in range images. CLAHE is a modified version of adaptive histogram equalization that limits the contrast to avoid overamplification and noises in the images. Cliplimit value is the threshold defined to apply a limit over the image contrast. In this study, we conducted a grid search to optimize this hyperparameter for DACNN algorithm. Table 4 summarizes the DACNN performance while using different cliplimit values. Considering the F-score values, cliplimit = 4 is used as the threshold value for CLAHE.

6.3. Computational Cost

We compare the computational cost of investigated algorithms in this study in two cases: (i) The networks are trained with only intensity input tiles; (ii) The networks are trained with both intensity and range input images. This way, we can examine how encoding depth information to the networks affects the computational costs. To highlight the trade-off between performance and speed, our proposed method, DACNN is also compared to the baseline approaches. Table 5 summarizes the computational costs for different classification approaches used in this study, in terms of the number of trainable variables, training time per epoch, and inference time for 100 batches. While the first column presents the costs for intensity-only trained networks, including VGG16, VGG19, ResNet50, and DenseNet121, the second column presents the costs for the same networks trained with both intensity and range images. Comparing the first two columns reveals that the extra computational costs brought by encoding depth information to the baseline models were almost negligible. However, the average F-score increased by 16.5% for objects with discriminative features in the range of images (crack, crack seal, pothole, manhole, and patch). The third column shows the computational costs for DACNN when the depth branch is removed, and the last column shows the cost for DACNN trained with both intensity and range images. It can be concluded that by providing a limited extra source of computations, we can improve the classification results. Training with intensity-only, DACNN enhances the classification results by capturing contextual information by 31.6% in F-score compared to the baseline methods (first vs. third column). Training with both intensity and range, DACNN improves the classification results by an adaptive fusion strategy by 23.3% in F-score compared to the baseline methods (second vs. fourth column). It should be noted that DACNN is not developed with the goal of having a real-time classification. In most practices, automated assessments of road conditions are performed offline where accuracy and robustness are the most important factors.

7. Conclusions

A deep learning-based model termed DACNN is presented to improve the performance of multiclass classification for road objects. Both intensity and range images are fed to the DACNN to enrich the image representation learned by the network. Discriminant feature representations obtained by encoding range images help the network to capture complex topology and to handle noises and illumination variances. Furthermore, feeding multiscale input images into the DACNN enables the network to catch both local and global fields of view, which is beneficial for classifying pavement objects with various sizes and shapes. We designed dual attention modules as an effective way to fuse scale-specific and mode-specific features to model the semantic interdependencies in spatial and channel dimensions. The position attention selectively aggregates the feature at each position by a weighted sum of the features at all positions, and channel attention selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. This way, the network learns better the relevant content for each specific object at each scale and mode contributing to more precise classification results.

The effectiveness and feasibility of the DACNN were compared with four baseline CNN models. The comparison results showed that the DACNN outperforms all compared CNNs. The results also showed that encoding depth information into the networks improves the classification results of VGG16, VGG19, ResNet50, DenseNet121, and the DACNN by 11.9%, 13.2%, 7.7%, 9.3%, and 2.2% in terms of averaged F-score, respectively, compared to when these models are trained with intensity-only images. The classification improvements are even more significant for pavement objects that are distinctive in range images by having height differences with neighboring pixels. For example, incorporating depth data with intensity information improves the crack classification by 17.9%, 10.2%, 10.9%, 11.2%, and 2.4% in terms of averaged F-score in VGG16, VGG19, ResNet50, DenseNet121, and the DACNN, respectively. In addition to encoding depth data, DACNN yields more improvements by capturing global context through multiscale input tiles, as well as focusing on the most important feature representations through attention modules. The DACNN outperforms VGG16, VGG19, ResNet50, and DenseNet121 by 23.3%, 22%, 25.4%, and 22.4%, respectively, in terms of averaged F-score, while they are all trained with range-intensity tiles.

Although the developed DACNN achieves great performance in pavement object classification, some limitations still exist in our model. Therefore, extra effort is required to make our model more practical and effective. Firstly, our model classifies 50 50 pavement tiles into different categories. Although 50 50 mm2 spatial resolution is acceptable in most road surveys, a pixel-level segmentation is required for some pavement applications such as crack width measurements. Secondly, quantifying the severity of pavement distresses is of necessity for road condition assessment, but it cannot be obtained directly from our model. Lastly, self-attention mechanisms capturing long-range dependencies in the network can be explored for further improvements. Furthermore, one can conduct hyperparameter studies for the training of the network and provide quantitative comparisons.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no potential conflicts of interest.

Acknowledgments

This project was partially supported by Korea Institute of Civil Engineering and Building Technology (KICT). Data Transfer Solution (DTS) partially helped in the preparation of the ground-truth dataset used in this study.