Abstract
This work presents a new method for sleeper crack identification based on cascade convolutional neural network (CNN) to address the problem of low efficiency and poor accuracy in the traditional detection method of sleeper crack identification. The proposed algorithm mainly includes improved You Only Look Once version 3 (YOLOv3) and the crack recognition network, where the crack recognition network includes two modules, the crack encoder-decoder network (CEDNet) and the crack residual refinement network (CRRNet). The improved YOLOv3 network is used to identify and locate cracks on sleepers and segment them after the sleeper on the ballast bed is extracted by using the gray projection method. The sleeper is inputted into CEDNet for crack feature extraction to predict the coarse crack saliency map. The prediction graph is inputted into CRRNet to improve its edge information and local region to achieve optimization. The accuracy of the crack identification model is improved by using a mixed loss function of binary cross-entropy (BCE), structural similarity index measure (SSIM), and intersection over union (IOU). Results show that this method can accurately detect the sleeper crack image. During object detection, the proposed method is compared with YOLOv3 in terms of directly locating sleeper cracks. It has an accuracy of 96.3%, a recall rate of 91.2%, a mean average precision (mAP) of 91.5%, and frames per second (FPS) of 76.6/s. In the crack extraction part, the F-weighted is 0.831, mean absolute error (MAE) is 0.0157, and area under the curve (AUC) is 0.9453. The proposed method has better recognition, higher efficiency, and robustness compared with the other network models.
1. Introduction
China’s total railroad mileage is expected to exceed 128,000 km by the end of 2020, prompting researchers to improve maintenance techniques for railroad infrastructure [1]. In Figure 1, the sleeper is used to support the rail and transfer the huge impact brought by the train to the roadbed. Accordingly, the sleeper needs to have a certain degree of flexibility and can be slightly deformed to cushion the pressure. However, the cracks and other damage generated within it will undermine the integrity of the sleeper and diminish the support force provided by the sleeper to the train above when the load bending moment is greater than the cracking strength. This situation poses a safety hazard to trains passing at a high speed. In recent years, nondestructive testing techniques, such as those in the literature [2], have been widely used in the maintenance of track facilities. This method of sleeper cracking can be quick and efficient in preventing accidents.

At present, the main method of sleeper crack detection has shifted from manual identification to a series of physical detection means, such as ultrasonic, eddy current detection, and ray detection. Although this method has been developed, it still has the limitations of the use methods and the common problem of poor crack detection. The efficiency and accuracy of crack detection have been enhanced with the development of the computer vision technology. The main methods applied to this field are as follows: image processing-based methods [3], machine learning-based methods [4], and deep convolutional neural network (DCNN)-based methods [5]. The methods represented by DCNNs are subdivided into methods based on image classification [5], object detection [6], and pixel-level segmentation [7], depending on the way the crack detection problem is handled. The network used to detect cracks in sleepers in this cascade is based on the latter two types of methods.
The main crack detection methods based on object detection include Faster R-CNN [8], single-shot multibox detector (SSD) [9], and You Only Look Once (YOLO) [10] to determine the location of cracks in the input image and localize them with bounding boxes. Cha et al. [11] proposed a concrete crack detection method based on Faster R-CNN. The network is improved to quickly detect and locate multiple types of cracks in real time, allowing for more accurate detection results. Mandal et al. [12] proposed an automated detection method based on DCNNs for road concrete cracks. However, the achieved detection accuracy is low. Li et al. [13] proposed an improved YOLO network to improve the detection accuracy of track plate cracks. However, the method is less versatile due to the single background information of the track plate. Bao et al. [14] proposed a triplet graph reasoning network for the problem of insufficient samples of metal surface defects.
Crack detection methods based on pixel-level segmentation mainly include fully convolutional networks (FCNs) [15], U-Net [16], and Seg-Net [17]. Labels can be assigned to crack pixel points to determine the presence of cracks and to obtain important features, such as the location, size, and shape of cracks. Cheng et al. [18] proposed an automatic U-Net-based road crack detection method and tested it in a crack dataset to obtain a high pixel-level segmentation accuracy. Islam and Kim [19] proposed a full CNN-based concrete crack detection method. This network consisting of encoder and decoder patterns is tested and exhibits good detection results on publicly available crack datasets. Dung [20] designed a full CNN with Visual Geometry Group-16 (VGG-16) based on a codec framework. This network further improves the accuracy of crack detection. Literature [21] compared three U-Net algorithms of different depths for automatic pavement crack detection systems. The objective is to verify whether a model architecture with greater depth necessarily results in better detection accuracy. Experiments prove that choosing a network architecture with the right depth can guarantee the detection accuracy and improve the detection speed.
Although great progress has been made in the field of crack detection based on DCNNs, how to obtain more detailed crack features still needs to be explored. For the sleeper crack detection, the crack is small, similar to the background of the sleeper, the boundary is unclear, and the regional information is incomplete. This paper proposes a new cascade network for crack detection. YOLOv3 is used as one of the mainstream frameworks for object detection. The YOLO series is improved on the basis of YOLOv3. Given that YOLOv3 uses a residual network in the feature extraction part, three feature layers of different depths are simultaneously extracted, and a stacked stitching approach is used to obtain the prediction results [22]. The aforementioned method can be used to detect cracks of different sizes. However, the crack detection effect is unsatisfactory for the complex background of the rail sleeper. Accordingly, we add the squeeze and excitation (SE) module at the end of the YOLOv3 backbone network to improve the crack region extraction accuracy. Further quantitative parameter detection of cracks is needed to complete high-precision crack identification and provide more scientific detection data. Crack encoder-decoder network (CEDNet) and crack residual refinement network (CRRNet) are used to extract and optimize the features of rail sleeper cracks. The shallow information of the crack image can be passed to the corresponding decoding process after the feature extraction of the input rail cracks by the coding part of CEDNet. Consequently, the low-level detail features are fused with the high-level complex semantics to improve the network feature extraction performance. CRRNet is added because the coarse saliency map obtained in the previous step has deficiencies, such as blurred crack boundaries and missing important regions. CRRNet can be optimized by learning the residuals between the coarse saliency map and the ground truth.
The main contributions of this paper are summarized as follows:(1)A two-level cascade network based on DCNN is proposed. This network fuses CEDNet and CRRNet, which can play the role of crack feature extraction and optimization in one step. Its F-weighted is 0.831, mean absolute error (MAE) is 0.0157, and area under the curve (AUC) is 0.9453.(2)An improved YOLOv3 network is proposed to localize the cracks, and the attention mechanism, SE module, is added at the end of the backbone network. The mean average precision (mAP) is improved by 6.9% compared with YOLOv3.(3)The optimization effects of loss functions binary cross-entropy (BCE), intersection over union (IOU), and structural similarity index measure (SSIM) on crack recognition are superimposed to propose a new hybrid loss function for the crack recognition. Particularly, our method improves by 68.4%, 74.8%, 84.1%, and 99.0% on , , , and , respectively.
The rest of this paper is organized as follows: Section 2 introduces the method overview, including the overall steps and the specific theory for each step. Section 3 shows some experimental results of our method and compares them with other methods. Section 4 gives the conclusion and outlook.
2. Method Overview
In the acquired image of rail sleeper cracks, the edge of ballast can interfere with the recognition of rail sleeper cracks because the imaging of ballast and concrete rail sleeper is similar. Given that the edge of the rail sleeper has obvious features, a strict size regulation, and differs from the grayscale of the ballast, the rail sleeper area can be first segmented. The cracks on the rail sleeper can then be located and identified by using the network. The proposed crack detection algorithm is divided into two parts: crack localization and crack identification. The crack recognition part incorporates a feature extraction network and a boundary refinement network. The overall methodological flow is shown in Figure 2. In the first step, we choose the gray projection method to extract the sleeper area first because the large amount of ballast in the background of the sleeper affects the crack detection. In the second step, a modified YOLOv3 is used to locate and segment the cracks on the basis of the extraction of the rail sleeper area. In the third step, further quantitative parameter detection of cracks is needed to complete high-precision crack identification and provide more scientific detection data; hence, CEDNet is used for feature extraction. A boundary refinement network is designed for further optimization because the extracted cracks have partial boundary and region information incompleteness:(1)The location of the sleeper is extracted by using the gray projection method [23] combined with the empirical value of the sleeper pixels, and then, SE [24] and spatial pyramid pooling (SPP) [25] are added at the end of the YOLOv3 backbone network to locate the sleeper cracks(2)CEDNet, a crack coarse saliency feature extraction network, is used to obtain more detailed saliency information by fusing low-level features and high-level features of crack images through the network structure of codec patterns(3)CRRNet, a crack boundary refinement network, is used to learn the residuals between the original and ground truth maps of the crack for optimization purposes by fusing the outputs of the network feature layers

2.1. Crack Location Module
The dimensions are strictly defined, and they differ from the ballast grayscale because the sleeper edge features are obvious. The gray projection method combined with the empirical values of the sleeper pixels can be used to locate the position of the sleeper. The gray projection method has better results for object edge detection with complex backgrounds, relying mainly on the peaks and valleys in the gray projection curve to determine the coordinates of the object edge position. Assuming that the image is represented as , the gray projection function in the x-direction is , the coordinates of the pixel points in the image are , and the value of the gray projection function in the horizontal direction is
The edge coordinates of the horizontal direction of the sleeper can be obtained in accordance with the gray projection method. The pixel width of the edge of the sleeper is relatively fixed in the captured roadbed images. Figure 3(a) shows the original drawing of the ballasted roadbed. The valley of the horizontal projection in Figure 3(b) depicts the contact edge between the sleeper and the ballast. Figure 3(c) presents the segmentation results.

The prediction results are obtained by stacking and splicing after simultaneously extracting three feature layers with different depths because YOLOv3 uses a residual network in the feature extraction part. Therefore, this network can be used to detect cracks of different sizes. However, in the complex background of the sleeper, the crack detection effect is poor. Inspired by the literature [24–26], the SE module suppresses the interference of background and other noises, and the SPP module can improve the operation efficiency by relieving the network of the size requirement for input images while ensuring that the images are not distorted. The end-to-end semisupervised object detection method, the object detection head with unified awareness from the attention perspective, and Composite Backbone Network Version2 (CBNetV2), which eliminates the pretraining process, can avoid the more complex multistage training approach in the literature [27–29]. However, the algorithms in the above documents still have some shortcomings, such as slow detection speed, large consumption of network resources, low accuracy and recall rate, and poor detection accuracy. Therefore, we choose to add SE and SPP modules at the end of the backbone network to make the model simpler in the training process and to improve the accuracy of crack region extraction while minimizing additional overhead. An improved algorithm based on YOLOv3 is designed in this paper, and its overall structure is shown in Figure 4.

The SE module belongs to one of the more classical algorithms of the attention mechanism. The accuracy of crack detection can be significantly improved by designing special parameters capable of removing the invalid information extracted by the YOLOv3 network [25]. This module compresses the sleeper crack image to a size of 111024 after a global averaging pooling layer. The activation is performed by two modules in fully connected layers and activation functions. The crack feature channels are weighted uniformly. The designed residual module ensures effective training so that the network extracts more accurate information about crack features and suppresses interference from other noises in the sleeper images.
When performing prediction of the a priori frame on three scales of the crack image, YOLOv3 requires consistent size of the crack feature maps outputted by the backbone feature extraction network. The cropping or shape change of the image tends to cause partial loss of information, resulting in biased crack detection results. Accordingly, the SPP module is added after the SE module to remove the limitation of the fixed size of the input image [26]. The sleeper crack images outputted from the backbone network of this module are simultaneously pooled at three scales after one convolution operation. The output crack features are fused and inputted to the fully connected layer. We can obtain a fixed size crack image output without losing the original information for any size and scale of the crack image input.
2.2. Crack Recognition Module
After locating and segmenting the cracked area of the rail sleeper, this paper proposes a crack identification module to obtain more detailed crack characteristics. The module uses a crack boundary refinement network to optimize the predicted saliency map because the extracted crack information is incomplete. The final crack saliency map is obtained by fusing the crack boundary refinement network with the feature extraction network, and the general block diagram of this module is shown in Figure 5.

2.2.1. Feature Extraction Module
The backbone network used for feature extraction is the crack coarse saliency feature extraction network CEDNet, which is a codec network focusing on crack regions and boundaries. The network is built on the basis of ResNet-34 (Residual Network with 34 parameter layers) [30] using a codec form. After feature extraction of the input sleeper cracks in the encoding part, the resulting image features are further optimized and processed by the decoding part. The shallow information of the cracked image is passed to the corresponding decoding process, which enables the fusion of low-level detailed features with high-level complex semantics as a method to improve the network feature extraction performance. The structure is shown in Figure 6.

The specific structure and operational steps of the network are as follows:(1)The coding part consists of an input convolutional layer and six stages consisting of basic residual blocks, with a modified ResNet-34 structure for the input convolutional layer and the first four convolutional stages. The improvements mainly include the use of a 33 convolution filter and a convolution kernel with a stride of 1. The pooling operation is removed after the input convolutional layer to guarantee that the feature map in the first stage has the same spatial resolution as the input image. By contrast, the first feature map in the original ResNet has only one-quarter of the resolution of the input map. This change allows the network to obtain higher resolution feature maps in previous layers although reducing the overall receptive field. Consequently, Conv5 and Conv6, which are two convolutional stages consisting of 512 filters and three basic residual blocks, are added to obtain a greater extent of the object detection region on the original map and achieve the same receptive field as the original ResNet.(2)A bridge connection structure is used to further obtain the global information of cracks. The bridge connection structure contains three modules consisting of a Conv layer, a batch normalization (BN) layer [31], and a rectified linear unit (ReLU) activation function [32], where each convolutional layer consists of 512 33 dilated convolutions [33].(3)The input of each level of the decoding section is cascaded from the previous level and the pooled output of the corresponding level in the encoding section. A sigmoid function is added to each layer after using bilinear up-sampling for mapping the predicted values to [0, 1]. Seven saliency mappings are generated in this module, containing six postcascade feature mappings and the final output feature mapping. However, only the last feature map with the highest accuracy can be inputted into the CRRNet. The supervision of the ground truth map is supervised at the last layer of each decoding stage to reduce overfitting, as in holistically nested edge detection [34].
2.2.2. Edge Refinement Module
After the object detection and feature extraction, the predicted crack coarse saliency map can be obtained for the sleeper cracks. Figure 7 shows the original map of cracks, the ground truth map, and the coarse saliency map after the CEDNet extraction.

In the coarse saliency map, the crack boundary is blurred, some salient regions are missing, and the background is incorrectly marked as the object and inaccurately located. Therefore, the boundary information and local details of the extracted crack feature map are incomplete. Therefore, the extracted feature map is fed into CRRNet for further optimization.
The network is built in codec form and achieves optimization by learning the residuals between the original and the ground truth maps, using two 1D filters (i.e., 31 and 13 convolutional layers) rather than of 33 in size, which can improve the network optimization performance while avoiding a large computational effort [35]. Coarse feature maps of the input and stacked outputs are fused by using residual module propagation with identity mapping branches to facilitate training, and iterations are conducted to optimize coarse saliency map accuracy. The boundary refinement map under the sigmoid function mapping is used as the final output of the network, as shown in Figure 8.

The network structure consists of three parts: encoder, decoder, and bridge connection.
The coding section consists of four stages with two 1D filters and a maximum pooling layer for down-sampling and reduced computational effort. The order of the built convolutional layers is 31 in front and 13 convolution in the back. Only one ReLU layer is added after the former, and a BN layer and a ReLU layer are placed after the convolutional layer of the latter [36]. This design allows the network to be built to a deeper level with less degradation in performance and mitigates to a certain extent the effect of gradient diffusion on network training, balancing network optimization performance and computational efficiency.
The decoding part is composed of a bilinear interpolation unit for up-sampling to match the feature dimensions and two 1D filters identical to the encoding part. The 1D filter is built in the reverse order of the coding part. This part also consists of four stages, and the codec pattern is reflected in the decoding part, where the 13 convolution in each stage is cascaded with the 31 convolution in the corresponding stage of the coding part.
The bridge connection part contains a Conv layer, a BN layer, and a ReLU layer. The convolutional layer in the structure has 64 filters and a convolutional size of 33.
2.3. Hybrid Loss Function
The training loss function in this paper is defined as the sum of the outputs of all saliency feature mappings:where is the loss of the kth lateral output and is the weight of each loss. is taken as 8, indicating the presence of 8 outputs of the supervised sleeper crack detection network, 7 of which are from CEDNet and the rest from CRRNet. A hybrid loss function that mixes three losses of BCE, SSIM, and IOU is used to obtain a high-quality detection object with complete information:where , , and denote the BCE [37], SSIM [38], and IOU losses [39], respectively.
BCE is used as a loss function in this network to supervise the training accuracy of object detection from the pixel level, which can be performed pixel by pixel. The pixel points of foreground and background pixel points are considered equally important and ignore the labeling of the neighboring regions. Accordingly, all pixel points can be converged. BCE is mainly applied to binary classification and segmentation tasks. The definitions are as follows:where is the ground truth label of the pixel and is the predicted probability of the saliency object.
SSIM is used as a loss function for supervised object detection from the local domain level to evaluate the image quality. This loss function assigns a higher weight to the boundary making the loss near the boundary higher, that is, focusing on the attention to the foreground and background boundaries. Progressively more important background losses come into play as the prediction of background pixel points approaches the ground truth, making the boundaries of cracks in the background prediction clearer. SSIM captures structural information in the image; therefore, it is integrated into the blend function to learn the structural information of the saliency object. The definition is as follows:where and are the pixel values of two corresponding patches cropped from the predicted probability map and the binary ground truth mask , respectively, and are the mean and standard deviations of and , respectively, and is their covariance. and to avoid dividing by zero.
IOU is originally used to calculate the similarity between two sets and extended to a standard method for evaluating the effectiveness of object detection and segmentation. After the foreground loss is reduced to zero combined with the three loss functions, the BCE can be used to maintain all pixel point gradients and make the IOU focus more on the foreground as the prediction confidence of the foreground network gradually increases. At the feature map level, the following formula is used to oversee the training of object detection and ensure its differentiability in the training loss function.where is the ground truth label of the pixel and is the predicted probability of the saliency object.
3. Experiment and Results
3.1. Dataset
The image acquisition device used in the paper is mainly composed of industrial high-speed line matrix camera and camera lens used in accordance with the field design requirements. As shown in Figure 9, the image acquisition system consists of an industrial computer and the LQ-H3X module, where the LQ-H3X module mainly consists of a laser light source and a line array camera. The main parameters of the LQ-H3X module are shown in Table 1.

3.2. Experimental Setup
The model in this paper runs under a Win10 operating system, with dual CPU Intel Xeon Silver 4214 2.2 GHz and NVIDIA RTX 2080Ti 11 GB graphics card. The three networks of object localization, coarse saliency feature extraction, and boundary refinement are built and run under the integrated development environment of PyTorch framework and PyCharm.
3.3. Hyperparameter Configuration
For the saliency detection part, several parameters with deeper influence, such as initial learning rate, batch size, and epochs, are adjusted during model training. The initial learning rate is closely related to the update of the weight parameters. If it is extremely large, the loss value increases, and the network model is infinitely divergent. If it is extremely small, the loss value decreases extremely slowly, and the parameters are updated extremely slowly. Choosing minibatch stochastic gradient descent and appropriate epochs can improve the running speed of neural network, and let the model converge properly. The actual situation with different combinations of important parameters is compared through several experiments to improve the model training speed, and the results are shown in Table 2.
Initially, with the batch size and epochs unchanged, the loss value decreases faster and faster with the downward adjustment of lr. On the basis of determining the lr of 0.001, the batch size of 4 is selected first in accordance with the performance of the device graphics card and GPU memory size. The epochs are chosen to be adjusted downward from 300 to 100 for the case that the rail crack dataset does not have data diversity. The parameter combination of the lowest loss of 0.046 is established. In consideration of improving the running speed of the neural network, the epochs are increased from 200 to 300 to achieve the same accuracy when the batch size was adjusted to 5. The loss value does not drop as fast as the former in the whole process.
In summary, the optimal combination of parameters selected for the crack recognition module in this paper is as follows: initial learning rate, batch size, and epochs are set to 0.001, 4, and 200, respectively, and the results are shown in Table 3.
3.4. Evaluation Metric
The selected evaluation metrics include F-measure, mAP, F-weighted [40], MAE [41], and AUC [42]. The F-measure is a comprehensive index for the evaluation of the final obtained crack detection results. mAP is used as the average accuracy rate to measure the recognition accuracy, with larger values indicating higher accuracy rates. F-weighted is calculated from the corresponding PR value. The weight of the PR value is the percentage of samples in the total number of samples. The larger the value, the stronger the network performance. MAE is used to measure the error of the test results. The AUC value indicates the high or low performance of the network in classifying the crack and rail background. The closer to 1, the better the network classification.
Its calculation formula is as follows:where denotes the precision, denotes the recall, and is 0.3, similar to those in reference [40]; , , and are the weight ratios of each precision. After the recall is calculated, the F-weighted is obtained from Equation (7). and are used to represent the length and width of the input sleeper crack image to be processed.
3.5. Hybrid Loss Function
This work compares and verifies the performance of the proposed hybrid loss function with single and multiple forms of loss function combined with the network model. As shown in Figure 10, the saliency map predicted by the proposed algorithm is the closest to the ground truth. The integrity of the cracked part of the region with the clarity of the boundary is shown to be the best situation compared with the others.

The quantitative analysis is shown in Table 1. After the comparison experiments for individual loss functions, the more effective and are then selected for the combined analysis. Table 4 shows that the network performance can be optimized only when all three loss functions are simultaneously used. Particularly, our method improves by 68.4%, 74.8%, 84.1%, and 99.0% on , , , and , respectively.
3.6. Object Detection
In this experiment, for the comparison of YOLOv3, YOLOv4, and YOLOv5, we conduct the corresponding experiments. The settings of our experimental parameters are shown in Table 5. The initial parameter values for input size, initial learning rate, class, batch size, and epochs for the training of rail crack images are provided.
On the basis of this experimental condition, tests are performed for Tiny YOLOv3, YOLOv3, YOLOv4, and YOLOv5. The model accuracy is verified in terms of the three metrics: precision, recall, and MAP, and the model speed is verified in terms of frames per second (FPS), as shown in Table 6.
YOLOv3 has a higher recognition accuracy than Tiny YOLOv3 and a faster recognition speed than YOLOv4 and YOLOv5. The recognition accuracy can be optimized with the help of SE module and SPP module. In accordance with the experimental results, YOLOv3 can reach the same or even exceed the level of YOLOv4 and YOLOv5.
Therefore, a preliminary conclusion is that YOLOv3 is a more ideal target for optimization. This conclusion can be verified in the final optimized test results.
The prediction frame when the network locates cracks is more accurate compared with the original YOLOv3 by using the improved YOLOv3 network to complete the detection of cracks in the sleeper due to the added attention mechanism to improve the ability to capture the location of cracks. The detection effect is shown in Figure 11.

YOLOv3 and the proposed algorithm are used to detect cracks of the overall roadbed image and the segmented sleeper image by using gray projection method. The comparison of experimental results is shown in Table 7. The comparison of the two inputs of the overall roadbed and sleeper areas shows that the mAP of crack detection is improved by 35.4% and 38.8% on YOLOv3 and improved YOLOv3 after rail sleeper area extraction, respectively, proving the necessity of sleeper area extraction for crack detection. The data entered in the sleeper region column show that the improved YOLOv3 improves the mAP by 6.9% compared with the original network, proving the significant superiority of the present algorithm for sleeper crack detection.
3.7. Feature Extraction
With regard to the sleeper crack dataset constructed in this work, the results of sleeper crack saliency detection obtained using the method of this work are compared with those of several other network models. The models include BAS [43], R2Net [44], SOD100k [45], EDR [46], PFA [47], HED [34], and POOLNet [48]. Figure 12 shows that the proposed algorithm has a good detection of cracks in a variety of situations, including low contrast (1st, 4th, and 6th columns), small target (4th and 6th columns), and complex background (2nd, 3rd, 5th, and 7th columns).

The above evaluation metrics are applied to make a quantitative analysis of all network performance, as shown in Figures 13 and 14. In terms of AUC, the proposed algorithm improves by 6.0%, 0.2%, 1.2%, 2.8%, 3.8%, 10.4%, 15.5%, and 50.9% compared with CEDNet, EDR, BAS, POOLNet, R2Net, PFA, SOD 100 k, and HED, respectively. This result indicates that the proposed algorithm has better classification prediction performance. The MAE value of this work is 0.015, verifying that the algorithm has a small error and high accuracy rate compared with the other networks. The closer the curve composed of precision and recall to the upper-right corner, the better the network classification, and the larger the area enclosed by the F curve and the horizontal axis, the stronger the performance of the network.

(a)

(b)

(c)

(a)

(b)
The proposed algorithm has better crack integrity and clarity than other algorithms and depends on the form of cascade network used herein. A more complete crack feature can be obtained after cascading the residual networks of codec modes (i.e., CEDNet and CRRNet). In comparison with EDR, the pooling operation after the input convolutional layer is removed in the feature extraction stage to improve the image resolution in this work, and Conv5 and Conv6 are designed to restore the network receptive field. The crack information obtained in this stage is more detailed. By contrast with BAS, a 1D filter is used in the optimization part to balance the refinement performance and computational efficiency. In FPN-based U-Net structures, such as POOLNet and R2Net, the high-level semantic features are continuously diluted because of their structural limitations when fusing with low-level image features, and the different receptive fields in each layer of the network lead to the loss of local information in the crack saliency map.
4. Conclusion and Expectations
We propose a method for detecting cracks in rail sleepers based on DCNN to address the lack of accuracy in crack detection in crack recognition. The CNN used consists of a modified YOLOv3 network for localization and CEDNet and CRRNet for extracting and optimizing the rail sleeper crack features, respectively. In locating the rail sleeper crack region, the crack on the concrete rail sleeper has some similarity with the ballast edge in the captured images due to the lighting and other causes. However, a grayscale difference can be observed between the rail sleeper and the ballast. Hence, the rail sleeper area is first segmented for the next step. The attention module SE is added at the end of the original YOLOv3 network to extract the cracked areas, thereby improving the accuracy of the rail sleeper crack detection while preserving the network computation speed. CEDNet is constructed to extract more crack information by fusing the high- and low-level features of crack images. The crack boundary refinement network CRRNet is added to optimize the cracks, and the stacked output of the crack coarse saliency feature map and the network can be optimized by learning the residuals from the ground truth. A cascade approach is adopted for the above two networks to obtain a crack saliency map with more complete boundary and region information. The conclusions of this work are as follows:(1)A new crack detection method is designed. A cascade network combining CEDNet and CRRNet is used to improve the integrity of crack detection. Its F-weighted is 0.831, MAE is 0.0157, and AUC is 0.9453.(2)An improved YOLOv3 network is proposed to localize the cracks, and the attention mechanism SE module is added at the end of the backbone network. The mAP is improved by 6.9% compared with that of YOLOv3.(3)The optimization effects of loss functions BCE, IOU, and SSIM on crack recognition are superimposed to propose a new hybrid loss function for the crack recognition. Particularly, our method improves by 68.4%, 74.8%, 84.1%, and 99.0% on , , , and , respectively.(4)A comprehensive evaluation of the proposed methodology is conducted. Our method has strong robustness and high level of crack detection efficiency compared with the seven state-of-the-art methods.
The proposed crack recognition module consists of two parts. In the optimization stage, we perform the crack boundary refinement process directly on the basis of the first output. Compared with end-to-end learning, this approach requiring secondary adjustment of model parameters increases the time cost and requires more manual processing. Therefore, if the optimization part can be encapsulated into a plug-and-play module, it will greatly improve the efficiency of model operation, which is the next optimization intention of this paper. This paper effectively improves the accuracy of the identification of cracks in the rail sleeper but does not measure the geometric parameters. How to calculate the actual size of the cracks on the basis of existing data is a direction for our future efforts, which is extremely helpful for practical engineering applications.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (nos. 51975347 and 51907117).