Abstract

Automatic crack detection with the least amount of workforce has become a crucial task in the inspection and evaluation of the performances of concrete structure in civil engineering. Recently, although many concrete crack detection models based on convolutional neural networks (CNNs) have been developed, the accuracy of the proposed models varies. Up-to-date, the issue regarding the convolutional neural network architecture with best performance for detecting concrete cracks is still debated in many previous studies. In this paper, we choose three established open-source CNN models (Model1, Model2, and Model3) which have been well-illustrated and verified in previous studies and test them for the purpose of crack detection of concrete structures. The chosen three models are trained using a concrete crack dataset containing 40,000 images those with 227 × 227-pixel in size. The performance of three different convolutional neural network (CNN) models was then evaluated. The comprehensive comparison result indicates that Model2 which used batch normalization is capable of the best performance amongst the three models as selected for concrete cracks detection, with recording the highest classification accuracy and low loss. In a conclusion, we recommend Model2 for a concrete crack detection task.

1. Introduction

A significant number of civil infrastructures have progressively approached their life span expectancy; thus, the integrity of the structural system needs to be checked. At the same time, how to constantly and automatically check the structure with even less workforce has become a vital research path with the aging of the population and the rise in labor costs [14].

Visually observed cracks are often a concern for inspection engineers because cracks not only provide access to dangerous and corrosive chemicals within the concrete but often result in water and deicing salts infiltrate concrete that damage the integrity of structures [5, 6]. For instance, more than 100,000 bridges in the United States have early age deck cracks [7]. Besides, among the 570,000 bridges in the USA, 40% were listed as deficient, requiring repair or reconstruction according to the requirements of the Federal Highway Administration (FHWA), with an expected cost of 50 billion dollars [8, 9]. More than 510,000 new bridges have been constructed in China over the last 40 years, 15% of which have reached their service life earlier due to cracks. Because cracks will effectively function as a significant predictor in assessing structural damage, crack identification would have elevated functional implementation values in activating early bridge repair alerts, safety assurance, and loss mitigation [10].

Human-based visual inspection is a familiar method for inspecting and evaluating the health of concrete structures. However, the human-based visual inspection and evaluation are time-consuming and subjective [11]. The accuracy of damage diagnosis depends mainly on the skill level and experience of the inspectors. Therefore, automatic damage detection is a crucial task for achieving objectivity and efficiency of damage assessment [12].

Digital technologies have developed computer models to process images automatically gathered from structures to detect structural damage and cracks. In this context, many previous studies have developed models for defect detection, and most of these models have focused on the detection of cracks in civil structures such as techniques based on image processing, clustering, edge detection, filter-based models and edge detection, and image filtering-based machine learning (ML) [1325]. Most of these remarkable works used computer vision-based methods to improve the performance of crack detection. Computer vision-based methods used the image process techniques (IPTs) to extract crack detection and evaluate the extracted features. However, the results by this kind of method have been influenced by the extraction of the false feature using IPTs because the performance of these methods depends on the extracted crack features [26, 27]. To overcome these challenges, a convolutional neural network (CNN) is used to detect concrete crack without the extracting process of crack features [28, 29]. Yu et al. [30] proposed a deep convolutional neural network (DCNN) capable of automatically extracting high-level features from low-level features and optimally selecting the combination of extracted features to satisfy any damage recognition objective.

Recently, remarkable progress has been achieved in crack detection procedures using convolutional neural networks (CNNs) [26, 29, 3135]. For example, research studies regarding the application of CNNs to concrete crack detection include, using a four-layer CNN [36], a deep fully convolutional network (FCN) [29], a fully convolutional neural network (FCN) named Ci-Net [37], and a deep convolutional neural network (DCNN) architecture with eight layers [38]. The accuracy of these proposed models is, respectively, 87%, 90%, 93.6%, and 98%. Li et al. [39] proposed an approach for detecting safety helmets on construction sites in real-time using a convolutional neural network-based SSD-MobileNet algorithm. The recorded training model precision was 95%.

Researchers have also proposed a crack detection method using pretrained CNN models such as VGG16, AlexNet, Resnet50, and InceptionV3 from scratch for classification [4043]. Deep CNN pretrained on the original datasets (e.g., ImageNet) has proven beneficial for several computer vision problems, including new problems involving different classes [36].

Han et al. [44] proposed a solution that integrates Monte Carlo simulations into a local thresholding processing method to address the fundamental problems of traditional methods in terms of complex background, spatially varying illumination, and block size uncertainties. Kang et al. [45] proposed an automated crack detection, localization, and quantification approach that detects crack regions by integrating a faster region proposal convolutional neural network (Faster R-CNN) algorithm. The Faster R-CNN-based crack damage detection obtained results with an average precision of 95%. Choi and Cha designed the SDDNet semantic segmentation network to detect cracks, with a processing speed of 36 FPS (frames per second) for images with a resolution of 1025 × 512 pixels [46].

The models of CNN-based crack detection have the outstanding advantage that it avoids multifarious work from features pre-extraction and computation compared with traditional methods. In addition, CNN does not need to convert the format of the input images but learns the crack features from the image automatically, thus reducing the workload of crack detection [47]. Moreover, CNN is particularly powerful for detecting thin cracks under illumination conditions that are difficult to detect when using conventional methods [29]. Recently, although many concrete crack detection models based on convolutional neural networks (CNNs) have been developed, the accuracy of the proposed models varies. Up-to-date, the issue regarding the convolutional neural network architecture with best performance for detecting concrete cracks is still debated in many previous studies.

In this paper, we choose three established open-source CNN models (Model1, Model2, and Model3) which have been well-illustrated and verified in previous studies and test them for the purpose of crack detection of concrete structures. The chosen three models are trained using a concrete crack dataset containing 40,000 images those with 227 × 227-pixel in size. The performance of three different convolutional neural network (CNN) models was then evaluated. Comprehensive comparison result indicates that Model2 is capable of best performance for concrete cracks detection task.

This paper is divided into six sections. Section 1 provides a brief review of research on concrete crack detection, while Section 2 describes the construction of the dataset. Similarly, Section 3 introduces the research methodology while Section 4 concerns the presentation and comparison of the experimental results. Section 5 discusses the result of the study. Finally, Section 6 presents the conclusions of the study.

2. Methodology

The convolutional neural network (CNN) is a type of deep learning method for image classification and recognition tasks. The CNN architecture is comprised of multilayers neural network, including an input layer, convolutional layer, activation function, fully connected layers, and output layers. Besides, batch normalization (BN) and dropout layers which are provided according to the purposes of use are additional layers of CNN.

In this paper, a CNN is built by modifying three models taken from Kaggle. Those models have been trained based on image data of natural scenes around the world, and the output number of their image classes is 6, whereas the output number of image classes in this paper is 2 (cracked and noncracked).

An attempt was made in order to modify and adjust the models’ structures thereby making an additional dropout layer to the first and third models to avoid overfitting. Also, the number of epochs for all the three models was changed from 30, 30, and 50 to a unified number of 10 epochs for all and hence yielded robust models which allow for obtaining better results in terms of automatic crack detection.

The CNN architectures of the three models are shown in Figures 13. The input images are 3-channel RGB images, with a size of 227 × 227 pixels. The detailed operations and dimensions of each layer for the three models are shown in Tables 13.

2.1. Data Preprocessing

OpenCV2 to load images from a file is used in Model1. At the same time, the ImageDataGenerator class provided by TensorFlow in Keras is used in both the Model2 and Model3. It reads the image from the disk and preprocesses it into the appropriate tensioner. It helps prevent overfitting and helps the model to generalize better.

2.2. Convolution Layer

The filter size was 3 × 3 and used the ReLU activation function of each model. ReLU activation function is superior to tanh and sigmoid functions in almost all applications [51]. Besides, ReLU is used as an activation function in neural networks because the convergence rate of ReLU is faster during training than networks using sigmoid functions.

2.3. Pooling Layer

The main function of the pooling layer is to perform downsampling, progressively reduces the size of the image, preserves the distinctive features of the image, and be useful for dimensionality reduction role. A pooling layer is usually added after the convolutional layer to decrease the amount of parameters, filter redundant information, and avoid overfitting when training the network. In this study, pool sizes of 4 × 4, 2 × 2, 2 × 2, for Model1, Model2, and Model3, are used, respectively.

2.4. Padding

Padding is one of the parameters that control the spatial size of the output volume. It is useful when the output size is required to be the same as the input size. In Model2, padding = “SAME” because the output and input sizes are the same, and the padding type is called SAME.

2.5. Dropout

Dropout is used to avoid the issue of overfitting in the neural networks and reducing the co-adaptation between neurons to generate more effective training examples. Dropout layers were used in the three models. Moreover, we have added the dropout layer after the third layer of the convolutional in Model1 and the dropout layer after the second layer of the convolutional in Model3.

2.6. Batch Normalization (BN)

Batch normalization (BN) can improve neural network speed and efficiency in addition to creating more reliable models [29]. Batch normalization can also reduce the epoch number to train a neural network. Batch normalization, when applied to network architectures, avoids overfitting and achieves greatly improved classification accuracy in fewer epochs. In Model2, batch normalization (BN) is used before each pooling layer. The mathematical expression of batch normalization is as follows:where and are the mean and variance of the current training batch, respectively; represents the normalized output after subtracting the mean and variance of the current training batch; and represents the result of translation and scaling, where parameters that need to be learned are the scaling factor γ and the translation coefficient ß.

2.7. Hyperparameters

An Adam optimizer was used in the three models. The dropout rate at the dropout layer located before the ReLU is 0.5. The models are trained for 10 epochs with a batch size of 64. Model2 sets a lower bound on the learning rate on 1e − 8.

3. Concrete Crack Image Dataset

This paper used a publicly available concrete crack images dataset [52]. The dataset of concrete crack images is split into cracked and noncracked images for image classification. The dataset contains a total of 40,000 images with 227 × 227 pixels and an RGB channel, which contains 20,000 positive crack images and 20,000 negative images of concrete structures. The datasets were generated from 458 high-resolution images (4032 × 3024 pixel) with the method proposed by Zhang et al. [36]. The preparation of these datasets is as follows. Training dataset: it used 70% of images randomly selected from the base dataset. The dataset for training consists of 28 K images. Validation dataset: it used 10% of images randomly selected from the base dataset. The dataset for validation consists of 4 K images. Throughout the training, the validation dataset is used to monitor the network’s learning curve. Testing dataset: it used 20% of images that are randomly chosen from the base dataset. The testing dataset consists of 8 K images. Some images of the dataset are shown in Figure 4.

4. Results

In this study, the performances of three CNN models have been compared using the same dataset for concrete crack detection. The models were trained for 10 epochs on Google Colab, which is a cloud computing platform for machine learning where accuracy and the categorical cross-entropy loss function are used to measure the performance of classification models. Figure 5 shows the accuracy of the training and validation of the models. Model2 achieved superior performance, as shown in Table 4. The recorded training accuracies of the models were 93.23%, 99.85%, and 99.44% for Model1, Model2, and Model3, respectively. Similarly, the recorded testing accuracies of the models were 98.04%, 99.89%, and 99.69% for Model1, Model2, and Model3, respectively. Moreover, another metric used to evaluate network performance is the low loss value. The loss function is used to optimize the values of the parameters in the neural network architecture. Whereas the categorical cross-entropy loss function is used for measuring the performance of a classification model with probability values between 0 and 1. Therefore, the low loss values obtained using the categorical cross-entropy approach indicate that the models were not affected by the overfitting problems. For instance, Model1 has a loss value of 18.45% and 7.16% for training and testing, respectively. Furthermore, the loss value of Model3 has a lower value than that of Model1 as the loss values were 2.12% and 1.34% for training and testing, respectively. At the same time, Model2 achieves a lower loss than the other two models, with values of 0.42% and 0.38% for training and testing, respectively. Comparison of training time of models is shown in Table 5. The incorrectly predicted images were verified and printed a total error in the batch selected. Model2 prediction was best where there was no error prediction. Model3 came next best, where there were few false predictions. Model1 was incorrectly predicted for more wrong prediction. Some images of the wrong prediction are shown in Figure 6.

5. Discussion

5.1. The Effect of the Total Number of Parameters and Network Complexity in terms of the Number of Convolution Network

It can be seen that Model1 has more layers than Model3 and fewer total parameters than the other models, but it starts from a deep convolutional network with decreasing depth, while Model3 started from a shallow convolutional network and gets progressively deeper. Model2 has more layers and more of the number of parameters than the other models and also starts from a shallow convolutional network and gets deeper. Similarly, the last four convolutional layers of the network in Model2 are deeper than the other models. Although the total number of parameters in Model2 is higher than the other models, the computation time required to train 28 K data sets per period for all networks is close to that of Model3. It seems that increasing the number of convolutional layers started from a shallow convolutional network and gets progressively deeper, and more of the number of parameters helps to improve the performance of the network in the crack detection task compared with the other models. This proves that the CNN structure of Model2 showed good performance compared with the other two. The test results of this study are shown in Table 2. The accuracy was 99.85% and 99.89 for training and testing, respectively. At the same time, the loss values were lower than the other two models with 0.42% and 0.38% loss for training and testing, respectively. Furthermore, Model2 prediction was best where there was no error; some images of the prediction are shown in Figure 7.

5.2. The Effect of Batch Normalization (BN) on the Training Speed and Accuracy in Concrete Crack Detection Models

The results show that the type of network used has an impact on the speed and performance of the model, as well as on the accuracy obtained in training and testing. Although the total training parameters in Model2 are the largest compared with the other models, using batch normalization (BN) improved the performance and speed of the neural network, as well as producing more reliable models [53]. Batch normalization can also reduce the epochs number to train a neural network. Batch normalization, when applied to network architectures, avoids overfitting and can achieve a significant improvement in classification accuracy in fewer epochs than other neural network models that do not use batch normalization. The second model achieves more than 99% training accuracy in the second epoch, while the first model achieves more than 99% training accuracy after eight epochs, and the third model achieves more than 99% training accuracy after the fifth epoch. Hence, BN has performed well in improving the accuracy of CNN network models for detecting concrete cracks [4, 9, 54].

5.3. Limitations

However, this study has some limitations. For example, the study is applied only to the most commonly used dataset for training and testing of computer vision-based crack detection [51]. Similarly, physical parameters, such as depth, width, and length of the concrete crack and also damage size, and minimum and maximum crack width, were not studied. In this study, three CNN models were compared using approximately average input sizes. Mishkina et al. [55] showed that using 128 × 128 pixel images is sufficient to make qualitative conclusions about optimal network structure and is an order of magnitude faster than using standard 224-pixel images. Because the accuracy depends on image size linearly, the needed computations grow quadratically, so it is a very difficult way to perform again. The input size of some studies to monitor infrastructures is shown in Figure 8.

6. Conclusions

This study compared the performance of three CNN models on the same concrete crack dataset, which is a database of 40,000 images with a resolution of 227 × 227 pixels. The models were trained on Google Colab, which is a cloud computing platform for machine learning. The dataset is split into training, validation, and testing in a 70 : 10 : 20 ratio. We used open-source models, which have been trained on other image data, and then used them on the concrete crack image domain for comparative analysis. Based on the results of the study, the following conclusions can be drawn: (1)For preprocessing, ImageDataGenerator class provided by TensorFlow in Keras was used in Model2 and Model3 and showed good performance. It reads the image from the disk and preprocesses it into suitable tensors. It helps prevent overfitting and helps the model to generalize better.(2)Based on comparative analysis, Model2 achieved superior performance; the recorded accuracy was 99.85% and 99.89% for training and testing, respectively.(3)Based on comparative analysis, Model2 achieves a lower loss than the other two models, with loss values of 0.42% and 0.38% for training and testing, respectively.(4)It is better to use ReLU activation, padding, batch normalization, and pooling layers for the neural network architecture in concrete crack detection.(5)The number of neural network layers in Model2 was 7 convolutional layers, started from a shallow convolutional network and getting deeper. Similarly, the last 4 convolutional layers of the network are deeper than the convolutional layers in the other models; this indicates that increasing the number of convolutional layers and going deeper gradually help to improve the performance of the network in the crack detection task.

In future studies, we will focus on improving the accuracy of models to detect various physical parameters such as depth, width, length, corrosion, spalling, and voids in concrete structures. However, the development of deep learning-based algorithms for detecting physical parameters of concrete cracks and autonomous crack size estimation is still problematic in these studies, mainly when the test image contains many noisy crack-like features. Therefore, future research should focus on improving the proposed methods to make autonomous crack density estimations.

Data Availability

The Concrete Crack Images data supporting this study are from previously reported studies and datasets, which have been cited. The processed data are available at https://data.mendeley.com/datasets/5y9wdsg2zt/2.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was financially supported by the National Key R&D Program of China (grant no. 2018YFD1100401); the National Natural Science Foundation of China (grant no. 52078493); the Natural Science Foundation for Excellent Young Scholars of Hunan (grant no. 2021JJ20057); and the Innovation Driven Program of Central South University (grant no. 2019CX011). These financial supports are gratefully acknowledged.