Abstract

Condition rating of bridges is specified in many countries since it provides a basis for the decision-making of maintenance actions such as repair, strengthening, or limitation of passing vehicle weight. In practice, professional engineers check the textual description of damages to bridge members, such as girders, bearings, expansion joints, and piers that are acquired from periodic inspections, and then make a rating of the bridge condition. The task is time-consuming and labor-intensive due to the large amount of detailed data buried in the inspection reports. In this paper, a natural language processing- (NLP-) based machine learning (ML) approach is proposed for automated and fast bridge condition rating, which can efficiently extract the information of deficiencies in bridge members. The proposed approach involves three major steps, say, data repository establishment, NLP-based textual data processing, and ML-based bridge condition rating prediction. The data repository is established with the inspection reports of 263 concrete bridges, and in total there, are four condition levels for the bridges. Then, the NLP-based textual data processing approach is implemented to calculate the word frequency and the word clouds to visualize the characteristics of bridges in different condition levels. Finally, four typical ML techniques are adopted to generate the predictive model of the bridge condition rating. The results indicate that the NLP-based ML prediction model has an accuracy of 89% and is very efficient so that it can be used for large-scale applications such as condition rating for regional-level bridges.

1. Introduction

The safety of bridges is significant to ensure the functioning of the transportation network and economic development in our society. During their service life, bridges are subjected to various loads such as temperature [1], wind [2], traffic [35], and earthquakes [6], which aggravate the defects of bridges. These defects could cause bridge failures if they are not detected and repaired. Wardhana and Hadipriono [7] studied more than 500 bridge failures in the United States between 1989 and 2000, revealing that 76% of them occurred during operation. Xu et al. [8] analyzed 302 catastrophic highway bridge collapses in China between 2000 and 2014, out of which 171 were in-service bridges and had an average service life of only 18.7 years.

Generally, periodic inspection bridge by bridge with visual and nondestructive techniques is widely adopted around the world, such as in the United States [9, 10], Japan [11, 12], Korea [13], China [14, 15], Australia [16], and European countries [17]. In practice, conditions of bridges are rated into different levels [18], and the deficient bridges are given more attention for load-carrying capacity evaluation [19] or the implementation of a vehicle weight limit [4]. Condition rating is mainly performed based on the textual description of damages to bridge members such as girders, bearings, expansion joints, and piers, and then the results are combined to obtain the overall rating of the bridge. The engineers for condition rating must be familiar with relevant condition rating standards and be experienced with bridge inspection practice. Considering that a condition rating is required for each bridge with a large amount of textual data buried in the inspection reports, it is a time-consuming and labor-intensive task. Therefore, it is paramount to propose methodologies for efficient and accurate condition rating of bridges.

To achieve fast and automated bridge condition rating, in the past few years, machine learning (ML) methods have enjoyed fast development based on relevant information from bridge database or inspection [2022]. State-of-the-art algorithms include artificial neural networks (ANN) [2325], clustering [26, 27], decision trees (DT) [28, 29], support vector machines [3033], ensemble learning methods [3438], and unsupervised learning methods [39], which have indicated both high accuracy and efficiency. Using the relevant historical inspection and inventory data and records of maintenance, Huang [40] developed an artificial neural network (ANN) model to predict bridge deck deterioration with 11 features such as bridge age, deck length, number of lanes, number of spans, and design load. The predicted condition rating showed an accuracy of around 75%. Li and Burgueno [41] adopted several ANN methods to evaluate bridge abutment conditions in the state of Michigan with 8 input variables of bridge length, bridge width, skew angle, age, annual temperature difference, average daily truck traffic, approach surface type, and structural type. Liu and Zhang [42] presented a deep learning-based highway bridge components condition rating prediction approach with 24 input features including geographic region, structural configuration, and other bridge attributes. A case study on the condition rating of deck, superstructure, and substructure demonstrated an accuracy of over 85%. Xia et al. [4345] developed a condition assessment approach for network-level bridges using machine learning-based techniques. The maintenance scheme is also optimized based on a deterioration model [46].

These studies indicate that ML-based methods have the potential to achieve efficient and reliable condition ratings for bridges. However, they mainly utilized the structural parameters and geographic or environmental features as input variables instead of directly utilizing the detailed textual inspection data. In fact, detailed bridge inspection reports contain rich information on bridge deficiencies, which is also usually required by bridge operators. However, it is challenging to be directly used in the ML methods because the textual descriptions in bridge inspection reports are usually nonstandard and follow various writing patterns, which hinders effective information extraction. Moreover, the textual description in the inspection reports is actually at the component level, while the final condition rating is at the structure level. Hence, a highly nonlinear and complicated relationship exists between the input variables (textual description) and out variables (bridge condition rating). These two major issues make it difficult to build up simple ML models for the prediction of the bridge rating.

To address similar challenges where huge amount of textual data are presented, natural language processing (NLP) technology has been adopted recently to convert text data into digital vectors for better data mining. For instance, Zhang and El-Gohary [47] utilized semantic NLP and machine learning techniques to extract concepts from documents such as building codes and match them to concepts in the industry foundation classes for building information modelling. Le and Jeong [48] implemented NLP to detect data elements from text documents and used machine learning to extract and classify roadway data items. A human-encoded test showed that the precision and recall rate are 92.76% and 81.02%, respectively. Liu and El-Gohary [49] used semisupervised conditional random fields for extracting information on deficiencies and maintenance actions from bridge inspection reports, which achieved a precision of 94.1%. Mangalathu and Burton [50] applied an NLP-based postdisaster damage evaluation approach using the long short-term memory (LSTM) deep learning algorithm to classify building damages after an earthquake.

Based on the above-mentioned background, the present paper aims to develop a bridge condition rating approach with direct textual data from inspection reports by combining the NLP and ML methods. In total, 263 bridge inspection reports are collected, and the NLP is utilized to process the textual data in the inspection reports to convert them into vectors. Afterwards, they can be fed to the state-of-the-art ML algorithms to train the condition rating predictive model. The novelty of the present paper can be explained in three-fold as follows:(1)With the NLP, the textual description in the inspection reports can be directly used as input, and the term frequency (TF), term frequency-inverse document frequency (TF-IDF), and word clouds can be obtained to illustrate the extracted information of the bridges and show a direct visualization of the reports(2)The ML methods can achieve high accuracy and automated prediction of the condition rating of the bridges, avoiding the time-consuming and labor-intensive works by experienced engineers(3)Since the approach is accurate, efficient, and automatic, it can be extended in a large-scale application, such as condition rating of the regional bridge network

A novel approach for estimating bridge condition rating is introduced, utilizing NLP and ML techniques. Findings in this study suggest that the NLP-based ML approach can capture the information buried in the inspection reports and provide a promising tool for efficient and accurate bridge condition rating.

2. The Proposed Method for Textual Data-Based Bridge Condition Assessment

With tens of thousands of in-service bridges being inspected every year, it is a challenging task to implement the condition rating. This paper proposes an approach that combines NLP with ML techniques for rapid, automatic, and reliable condition rating. As illustrated in Figure 1, the approach is composed of three steps, i.e., data repository establishment, NLP-based textual data processing, and an ML-based bridge condition rating prediction model. Herein, the datasets are established with inspection reports of 263 highway bridges in China, which are classified into four condition ratings according to the condition levels, say, A, B, C, and D, with the damage level increasing from light to severe. The three steps are described as follows.

In step 1, the data repository is established with inspection reports of a regional network of 263 bridges. Each bridge consists of different components, such as the deck, girder, bearing, expansion joints, pier, and foundation. The defects in each component of the bridge are recorded in the report after inspection by engineers using visual or nondestructive techniques. For example, textual descriptions of concrete cracks could be expressed as “A longitudinal crack is observed in the deck bottom plate with a width of 0.15 mm and a length of 2.2 m.” The condition rating of each bridge is determined as the weighted sum of the damage extent of all the components that are rated by professional engineers, and it is extracted as the output for prediction in the ML model. Note that the guidelines to rate the damage extent of the components are defined by national standards.

In step 2, NLP-based textual data processing is performed on the inspection reports to convert the text into vectors, which can be directly used as the input of the ML model. The text in reports is preprocessed to remove redundant information such as punctuation marks, excessive space, and stop words. Two word frequency analysis parameters, i.e., TF and TF-IDF, will illustrate the extracted information in the inspection reports. Word clouds with TF or TF-IDF can visualize the condition-level information in the inspection reports of bridges in different condition ratings.

In step 3, the textual data processed in step 2 is fed into the ML model to build up the predictive model of the bridge condition ratings. The datasets are divided into training and testing sets as 70% and 30%, respectively. In the training process, the grid search method is adopted to find the hyperparameters with optimal performance for the ML models, while the 10-fold cross-validation (CV) is conducted to avoid any bias due to random sampling of the training dataset. Then the state-of-the-art ML models are evaluated on the testing dataset, and four typical measures, i.e., precision, recall, F1-score, and accuracy, are used to estimate the model performance.

3. Data Repository for the Bridge Condition Assessment

3.1. Introduction on the Regional Network of Bridges with Inspection Reports

As mentioned above, the data repository is established with inspection reports of a regional network of highway bridges located in Jiangsu Province, China. A total of 263 bridges in different health conditions are collected, which are reinforced concrete or prestressed concrete bridges. The construction of these bridges ranges from the 1980s to the 2000s, and they are mostly in satisfactory condition. The spatial distributions of the inspected bridges on the highway network are illustrated in Figure 2, in which we can find that they are located on 8 different highways, say, highways G2, G15, G25, G2501, G36, G40, G205, and S102, where G and S represent national and provincial highway, respectively. They almost cover the whole area of Jiangsu Province, and are representative of regional bridges.

The variations of bridge types and condition ratings are further illustrated in Figure 3. Figure 3(a) shows the distribution of the bridges with different span lengths, i.e., super long-span bridges (>150 m), long-span bridges (40–150 m), medium-span bridges (20–40 m), small-span bridges (5–20 m), and culverts (<5 m). Among medium-span bridges and culverts take the most and fewest percentage of 50.57% and 6.46%, respectively. The distribution of bridges with condition ratings is displayed in Figure 3(b) according to the results of the inspection reports. Among the 263 bridges, 164, 82, 13, and 4 bridges were labelled as Levels A, B, C, and D, respectively. Note that damage of bridges becomes severe from Levels A to D, which requires different maintenance actions. The dataset covers the major bridge types and condition ratings, which can potentially contribute to predicting the condition rating for a new bridge.

For an illustration of the bridge inspection reports, an example is shown herein, and the textual descriptions are used as input in the machine learning model. The contents of the inspection report are shown in Figure 4 which is translated into English. Chapter 1 usually describes the overview of the inspected bridges, such as the location and type. It is followed by two chapters showing the purpose and basis of the inspection and inspection instruments and components, respectively. Chapter 4 introduces the inspection content, method, and condition rating procedures. Chapter 5 provides the statistical analysis of the inspection results, while Chapter 6 is focused on analysis and maintenance suggestions about the main typical defects. The conclusion is given, and the annexe shows the summary of information on the bridges and defects. Figure 5 illustrates an example of the translated description of defects in the expansion joints extracted from Chapter 6 of the inspection report. The main defects are described, and the causes are then explained. Suggestions are given finally for treatment of the defects.

3.2. Condition Rating Rules of the Bridges

The inspection reports were completed after the field inspection of the bridges visually or with nondestructive devices, and then the condition rating of bridges is conducted by experienced engineers through information processing of the textual descriptions of bridge member damages. The engineers should have professional knowledge about the bridges and be familiar with relevant inspection and maintenance standards. Note that the bridge condition rating contains different levels in different counties; for example, 10, 4, and 5 rating levels are adopted in the United States, Japan, and China, respectively. In this paper, the condition rating was conducted following the Chinese code for maintenance of highway bridges [51], where the bridge health condition is divided into five levels, i.e., Levels 1 to 5. The bridges of Level 1 are in good condition, bridges of Level 2 require only minor repairs or daily maintenance to ensure their safety, bridges of Level 3 are in a relatively poor state with many defects, and a small number of functional diseases have taken place, which need to be repaired accordingly. In bridges of Levels 4 and 5, a lot of severe material defects have appeared and the structural functionality greatly deteriorated, thus the safety of these bridges is questionable and the traffic should be closed for special inspection and major repairs. Considering that only a limited number of inspected bridges are rated as Levels 4 and 5 in the dataset so they are combined together and finally four levels are designed as the condition ratings of the bridges, say, Levels A, B, C, and D, representing the damage level of the bridge, increasing from light to severe. A detailed description of the condition rating of bridges in four condition levels is presented in Table 1.

According to the current maintenance standard of highway bridges [51], the detailed condition rating rule is shortly described below. Note that no cable-supported bridges are included in the 263 bridges, and the rating method is applicable. The condition rating of a bridge is defined as follows:where is the final rating score of the bridge, ranging from 0 to 100; is the rating of each member ranging from 0–5; is the weighting coefficient of each member, and satisfies  = 100; and is the total number of the components. According to the rating score, the condition is classified into the above-mentioned four levels as follows:(1)Level A: (2)Level B: (3)Level C: (4)Level D:

The weighting coefficient of each member is shown in Table 2. It can be observed that different members in the bridge have different weights due to the load-carrying mechanisms. The substructures such as abutment, pier, and foundation have the highest weighting coefficient, followed by the main and ordinary load-bearing members in superstructures. In addition, the rating criterion of each member is also given in Table 3, which is determined according to the defect condition that is evaluated by experienced engineers. The specific rule actually involves three steps. First, the rating of the member is determined according to the observed defects and is rated as 0, 1, and 2 based on the degree of the defects (from small to big and/or few to many and/or light to severe); then the rating is further modified considering the effect of the member defects on the structural function, which is divided into three levels, say, no, or unimportant effect, small or secondary effect, and big or important effect. After the combination of the two steps, the rating is again modified to reflect the time-dependent evolution and development of the defects. If the defects are stable, the rating score will be minus 1; if the defects develop fast, then the score will be increased by 1; otherwise, if the defects develop slowly, the score keeps the same. Finally, the score of the member ranges from 0 to 5, representing the member in perfect, good, fairly, poor, and dangerous conditions, respectively. In summary, this rating procedure is very complicated and requires lots of prior professional knowledge to complete this task, thus, it will be time-consuming and labor-intensive.

4. NLP-Based Textual Data Processing and Information Extraction

4.1. NLP-Based Processing of the Textual Data

The bridge reports in the dataset are written in human language, so they are usually very long to include as much information as possible. Only experienced engineers can extract key information from the report. In order to develop fast and automatic predictive models for condition rating, a reliable information extraction method is required to process a large amount of textual description data in the reports. Herein, we adopted NLP to conduct this task, in which NLP refers to transforming human language into machine language that computers can process.

The NLP process involves tasks such as automatic information retrieval, text and speech processing, and language translation. NLP methods fall mainly into three categories, which are rule-based, statistics-based, and machine learning-based. NLP in early times are usually based on hand-crafted rules in information extraction from a large amount of textual data [52]. The common rules are usually related to syntax or semantics, which refer to grammar and meaning, respectively. Statistics-based NLP methods utilize probability on sequences of words, which usually uses TF and TF-IDF as measures in the framework. ML-based NLP makes use of algorithms that learn from input textual data and focus on the most common patterns in the data.

In this paper, NLP is used to analyze the lexical and grammatical structure on the basis of text in the reports and then convert the whole text into discretized word vectors. To this end, the following preprocessing procedures are conducted on the text data in the reports: (1) removing punctuation marks such as periods, commas, and brackets; (2) removing excessive space; and (3) removing stop words such as articles, conjunctions, pronouns, and prepositions. By removal of these terms, little information is lost, and more focus can be given to the important words. In addition, the size of the dataset can be reduced which saves much training time in the next step. A few typical examples are demonstrated in Table 4.

It is observed that the stop words such as “the,” “is,” “and,” and “in” are removed with the processing procedure, and it can avoid the mistaken separation of meaningful descriptions. For example, the word vectors extracted by NLP are “superstructure,” “box girder,” “diaphragm,” “vertical crack,” “diagonal crack,” “wet joints,” “transversely cracked, and “efflorescence” in condition Level B, which can correctly represent the information in the original textual descriptions.

4.2. Word Vectorization and Frequency Analysis

After preprocessing of the text datasets, the textual descriptions are converted to word vectors that can be directly fed into the ML algorithms, and also, the word frequency analysis can be conducted to obtain the frequently used words. Here, two measures are adopted to reflect the importance of words in the textual data of inspection reports, i.e., TF and TF-IDF. Term frequency (TF) is defined as the frequency of a word that occurs in the text, which reflects its importance and is expressed aswhere represents the times of a word occurs in a text ; and represents the total number of words in text .

It can be seen from equation (2) that a word will have a large TF value if it often occurs in the report. However, some words that are not so relevant to the defects of the member will also have a large TF value, e.g., the words about the member types will also have a high TF value. To solve this problem, a new index, inverse document frequency (IDF), is proposed to measure the informativeness of a word , i.e.,where represents the total number of documents in the data repository; and df represents the total number of a word in the document set .

Evidently, the more common a word is in the document, the larger the denominator, the smaller the IDF, and the closer it gets to zero. Then, the index TF-IDF is the multiplicative value of TF and IDF as follows:

It can be seen that TF-IDF is proportional to the number of occurrences of a word in the document and inversely proportional to the number of occurrences of the word in the entire language. Therefore, TF-IDF tends to filter out common words and retain important ones. Compared with TF, which sometimes is not enough to measure the importance of a word in the report, TF-IDF has the advantage of being simple, fast, and easy to understand.

After the vectorization of the bridge reports and calculation of the TF and TF-IDF measures, we can obtain the word cloud, where the key words are extracted to demonstrate their importance. The larger the words shown in the word cloud, the higher TF they have. Figure 6 gives the word clouds for the four condition Levels A, B, C, and D to make a direct comparison between the key words in the textual data of the four different levels. It is illustrated that the frequent words are highly related to the health condition of the inspected bridges regarding the existing deficiencies, which means the information on bridge deficiencies can be efficiently captured and visualized with the NLP. Note that though the bridge inspection reports were in Chinese, the proposed approach is also applicable to inspection reports in other languages.

In addition, the first 10 words in the four condition levels ranked according to their TF values and TF-IDF values are also analyzed herein, which are listed in Tables 58. It is noticed that in bridges with condition Level A (Table 5), the bridge component names such as “deck,” “superstructure” and “substructure” appear the most, with TF values of 1.45, 0.99, and 0.99, respectively. The top TF-IDF values of 0.426 and 0.413 belong to “inspection” and “deck.” In comparison, words describing the bridge deficiency appear less frequently, and “crack” has TF and TF-IDF values of 0.98 and 0.244, respectively. However, in bridges with condition Level B (Table 6), words that describe the bridge deficiency become more in the data, e.g., “crack” and “damage”, respectively, have TF values of 3.05 and 2.83, which are also the top two most frequent words. The top three words ranking following the TF-IDF values are “rebar rusting,” “damage,” and “crack,” whose TF-IDF are 0.427, 0.413, and 0.409, respectively.

In Table 7, the two most frequent words are “damage” and “crack,” whose TF values are 3.08 and 2.85 and TF-IDF values are 0.421 and 0.358, respectively. It is found that the frequency of words describing deficiencies obviously increased in Levels B and C compared with Level A. Furthermore, for bridges with Level D (Table 8), the TF values for “cracking” and “crack” reach 8.75 and 8.00, and the corresponding TF-IDF values are 0.567 and 0.487. The TF and TF-IDF values for other words describing deficiencies are also very high. For example, the TF values are 6.50, 5.75, and 5.00 for “rebar rusting,” “damage,” “exposed rebar,” etc.

In general, it is observed that words of bridge member names are most frequent in Level A, while words describing bridge defects such as “crack” and “damage” appear fewer times. In comparison, the words of bridge deficiency are more frequent in bridges of the other three condition levels. The frequent words describing defects observed in Level D clearly stand out. However, Levels B and C can hardly be separated due to the close frequency of words representing defects with TF or TF-IDF. The tabulated word frequencies offer a comprehensive overview of the damages across all bridges at each level. It presents an integrated perspective on the words associated with the condition of the bridges. However, it cannot aid in identifying the damage level of each specific bridge. Therefore, the TF-IDF of the inspection report for each bridge is utilized as the input in the machine learning model for the prediction of the condition rating of the bridges.

5. Adopted Machine Learning Techniques

Four commonly used algorithms, i.e., DT, SVM, RF, and GBRT, are employed to build up accurate and reliable prediction models for the condition rating of bridges. A brief introduction of the backgrounds is first presented herein.

5.1. Decision Tress (DT)

DT is a tree-structured model that can be used for both classification and regression problems [53]. A DT is generally made up of tree nodes and directed edges. Generally, there are three kinds of nodes in a DT, i.e., a root node, several internal nodes, and several leaf nodes. The decision process of the model starts from the root node, where a certain feature of the sample will be tested and split to the child internal nodes according to the results. The sample is tested and split recursively until it reaches the leaf node, which is taken as the final decision result. The path from the root node to each leaf node corresponds to a decision path. According to the split criterion, different algorithms for building DT can be used. Here, the classification and regression tree (CART) algorithm which is based on the mean square error is adopted. Supposing the input space can be divided into subspaces , the DT model can be expressed aswhere is the input variables; is the indicator function; and is the prediction value by the leaf node.

Moreover, to avoid the over-fitting issue, tree pruning should be also carried out using some specific loss function to make the tree simple and robust. The major model parameters are the maximum depth of the tree; maximum leaf nodes; minimum samples for split; minimum samples of leaf node; etc.

5.2. Support Vector Machine (SVM)

Support vector machine (SVM) is a kind of supervised ML method proposed by Vapnik with colleagues at AT&T Bell Laboratories [54]. In essence, the input variables are mapped to a high-dimensional feature space by nonlinear transformation, and then a hyperplane is found in the space for linear classification or regression. This mapping process generally adopts the kernel function method. The specific learning strategy of SVM is to maximize the left and right data interval of the hyperplane, which can be abstracted into the mathematical problem of solving convex quadratic programming, and the learning algorithm of SVM is the optimization algorithm of solving convex quadratic programming. The main parameters are parameters of kernel function and penalty coefficient in optimization. Taking the binary classification problem as an example, the SVM model can be defined aswhere and denote the normal vector and the bias constant from zero, respectively.

5.3. Random Forest (RF)

The DT and SVM, or some other supervised learning methods, are actually individual-type algorithms, which only generate a single predictive model. In ML, there are also ensemble learning methods that can generate several predictive models and then combine them into one according to some rules, which has been proven to be obviously superior to individual learning methods. RF is one typical ensemble method that belongs to the bagging family [55]. It takes advantage of two powerful techniques, i.e., bootstrap and random feature selection. At every step of building a tree, samples are randomly selected as a subset with bootstrap from the entire database, and features are randomly selected and used as the split node to build a DT with the subset. After several (or ) steps, one obtains DTs from the above procedure, and the final output is the average of all the outputs by the trees as follows:where is the weak learner by the DT method at step . In addition to the original parameters of DT, the number of trees is an extra yet important parameter.

5.4. Gradient Boosting Regression Tree (GBRT)

GBRT is another family of ensemble learning method [56], which is also developed based on the DT method. Unlike RF which builds single predictive models in parallel using Bootstrap, it is a boosting family which generates a single predictive model in sequence according to the model performance at every step. The initial database is used to train a single model, and then each sample in the database will get a weight according to the prediction accuracy of this model. The weighted new dataset will be used to train a new model in the next step. After several steps, one attains several single models and each model will have a weight computed based on its loss function. The final result is the weighted sum of these models, i.e.,where is the weight of the weak learner at step . Similar to RF, a major parameter for GBRT is the number of trees. In addition, a learning rate is usually employed to overcome the overfitting issue.

6. Implementation and Application of the Proposed Method

To implement the state-of-the-art machine learning models, i.e., the single learner DT, SVM, and the ensemble learner RF, GBRT, the whole database is firstly divided into 70% training and 30% testing sets. Then the hyperparameters are determined for the learners. Finally, the models are adopted on the testing sets to predict the condition rating of the bridges.

6.1. Tuning of Hyperparameters

In the machine learning models, hyperparameters should be specified in advance. In single learners like DT, key parameters include the maximum depth of the tree, the minimum samples of the leaf node, and the minimum samples for the split. In SVM, one important parameter is kernel, and the radial basis function (rbf) is adopted. Another two parameters are C and gamma, which are measures of error and curvature of the decision boundary. Similarly, the hyperparameters are also required in the ensemble learners, i.e., RF and GBRT, such as the number of estimators, the maximum depth of the trees, the minimum samples of leaf node, and the learning rate.

Inappropriate selection of hyperparameters will lead to problems of underfitting or overfitting. Therefore, a strategy is adopted as follows. Firstly, a range of parameter values are selected based on previous experience and the grid search method is then used to determine the value in an iterative manner. 10-fold cross-validation is carried out to overcome the bias induced in random sampling of the training set [57, 58]. With this strategy, the optimal hyperparameters are obtained as shown in Table 9.

Moreover, the learning curves of the four ML models are provided in Figure 7 to consider the effects of the size of the training data. For the training sets, the accuracy is low at the beginning, but increases quickly and becomes stable in DT learner. In comparison, the other three learners share a similar trend. The accuracy starts as high as around 1.0 but decreases gradually before stabilizing. This is because overfitting usually occurs when training sets are very few. Then the accuracy becomes stable with the increase of training data. On the other hand, the accuracy for the testing sets all starts at low values, while it improves with the number of training sets. It can be observed that the model accuracy becomes stable when the training samples reach around 180 (around 70% of the whole datasets). This verifies that the split of the whole datasets into 70% and 30% is reasonable.

6.2. Prediction Results and Discussion

After choosing the hyperparameters, the prediction is carried out. The performance of the different machine learning models is evaluated using a confusion matrix. The confusion matrix is a plot of actual versus predicted classes, in which an element is the number of actual samples known to be in class , but predicted as class . It is hence clear that the diagonal and off-diagonal elements represent the number of correctly and incorrectly classified samples, respectively.

Four common measures of model performance are used for the evaluation of the prediction results. The accuracy of a machine learning model is defined as the ratio between the number of correctly classified samples and the total number of samples. Precision is defined as the percentage of predicted samples that are correctly classified by the learning model, while recall rate is defined as the percentage of the actual samples that are correctly predicted. The higher precision and recall rate demonstrate the better prediction performance of the machine learning algorithm. When precision and recall indicate the different performance of models, it is difficult to compare, so the F1-score is a measure of a model’s accuracy calculated with the harmonic mean of precision and recall as follows:

The confusion matrix for training and testing sets with the four machine learning algorithms is shown in Figure 8. It can be seen that the DT, SVM, RF, and GBRT models have an accuracy of 82%, 84%, 81%, and 89% on the testing set, respectively. They all display acceptable prediction accuracy. The model performances on Levels A and B are usually better than Levels C and D, which is due to the class imbalance, i.e., there are fewer data in categories Level C and D. This issue was shown in previous studies [59] to impair the classification performance. Though approaches were investigated in machine learning fields, the fundamental solution is to incorporate more inspection data on the more deficient (Levels C and D) bridges.

The results of the prediction with the four machine learning models are summarized in Table 10. On condition rating Level A, the F1-score of the four models are 0.88, 0.89, 0.88, and 0.93, while on Level B, the results are 0.78, 0.78, 0.76, and 0.88. Similarly, for Level C samples, F1-scores are 0.75, 0.67, N/A, and 0.40 for the four models, respectively. The results indicate that all models have the best performance in category Level A, which is followed by Level B and Level C. This is due to the difference in the number of training sets. It is obvious that more training data can generate better prediction performance. There are only 9 and 2 training sets in condition rating Levels C and D, respectively, and the prediction on the testing sets is poor compared with Levels A and B. Besides, GBRT demonstrates the highest F1-score in different rating levels, which clearly outperforms other models. It should be noted that the prediction results are very instructive, especially for future work with an emphasis on the collection of inspection reports of bridges in more deficient levels.

Previous studies were usually devoted to the prediction of the health conditions of bridge components such as concrete slabs [60], concrete decks [40], bridge abutment walls [41], and the superstructure and substructure [42]. While in this study, the condition rating of the bridge structure is carried out to provide comprehensive guidance for stakeholders of bridges. Furthermore, a prediction accuracy of 89% is demonstrated with the case study of 263 highway bridges, showing the applicability to more bridges.

7. Conclusions

Condition rating of existing bridges is an essential indicator of bridge health, which is closely connected with potential maintenance procedures. As professional background in inspection practice and familiarity with the relevant maintenance standards are involved in the condition rating process, the traditional human-based assessment is often time-consuming and laborious.

This paper explores a rapid condition rating method using a machine learning model based on textual inspection data of bridges. The NLP technique is used to extract the information in the text description of the bridge inspection reports and convert them into digital vectors. The methodology is verified with the condition rating of 263 highway bridges of different types and health conditions in Jiangsu, China. The main contributions in this study are summarized as follows:(1)Automatic information extraction is achieved using an NLP-based technique, i.e., word vectorization, which can capture the most relevant features of the existing bridge deficiencies buried in the enormous and complicated textual data in inspection reports. This has been clearly demonstrated with the word clouds of the bridges in four condition rating levels.(2)Based on knowledge about bridge deficiencies and inspection practice, TF and TF-IDF can illustrate the hidden influence of important words in the condition rating of bridges. With the measures of the top 10 frequent words, the abundant textual data in inspection reports are adaptively correlated with the condition rating of the bridges. This “explainable” nature of the approach can provide guidance for inspectors to improve efficiency in the inspection practice.(3)The ML-based approach utilizes the valuable while nonstandard textual information in the bridge inspection report, to achieve efficient and accurate bridge condition rating prediction. This is significant as the details of component-level defects are intelligently exploited to realize bridge-level assessment.(4)With state-of-the-art machine learning models, a close agreement is obtained between the prediction and the actual condition ratings. The results indicate that the most accurate model in this study, i.e., GBRT, can reach an accuracy of 89% on the testing sets.

The proposed method saves much time and labor in condition rating and can significantly complement the traditional human-based assessment. It is also worth noting that the proposed method uses words and frequencies as input and can be applied to inspection reports in different languages. A limitation is the imbalanced dataset with fewer samples in class such as Level D, which may affect the performance of the trained model. The future study will focus on the collection of bridges with severe damage conditions. With rapid and reliable condition ratings, effective maintenance actions can be more wisely focused on bridges with poor health conditions.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Disclosure

The conclusions and opinions in this paper are of the authors and do not necessarily reflect those of the mentioned institution.

Conflicts of Interest

The authors declare that there no conflicts of interest.

Acknowledgments

The authors greatly appreciate the financial support from the Natural Science Foundation of Jiangsu Province (Grant no. BK20211564), the National Natural Science Foundation of China (Grant no. 52078119), the Zhi-Shan Scholarship from Southeast University, FCT/MCTES (PIDDAC) under project EXPL/ECI-EGC/1324/2021, and the Start-up Research Fund of Southeast University (RF1028623304). We also acknowledge the bridge inspection company for providing the highway bridge inspection reports in this study.