Abstract

Automatic classification of femur trochanteric fracture is very valuable in clinical diagnosis practice. However, developing a high classification performance system is still challenging due to the various locations, shapes, and contextual information of the fracture regions. To tackle this challenge, we propose a novel dense dilated attentive (DDA) network for more accurate classification of 31A1/31A2/31A3 fractures from the X-ray images by incorporating a DDA layer. By exploiting this layer, the multiscale, contextual, and attentive features are encoded from different depths of the network and thus improving the feature learning ability of the classification network to gain a better classification performance. To validate the effectiveness of the DDA network, we conduct extensive experiments on the annotated femur trochanteric fracture data samples, and the experimental results demonstrate that the proposed DDA network could achieve competitive classification compared with other methods.

1. Introduction

Femur trochanteric fracture is one of the most commonly occurred fractures among elderly people. Especially, with the rapid growth of the aging population worldwide, the occurrence of this fracture increases rapidly which severely threatens the health of elderly people. Moreover, since this fracture could lead to high mortality rates and dramatically affect the quality of patients’ life, effective and timely treatment is essential to relieve the pain of patients during the clinical diagnosis. Currently, the most effective way to diagnose this disease is by utilizing medical imaging such as X-rays or computed tomography (CT) to classify the types of fractures and then applying an appropriate treatment plan based on the corresponding diagnosis result. Typically, the OA/OTA classification criterion has been the most frequent and reliable used method to diagnose the condition of the fracture in the clinical fracture diagnosis. In this criterion, there are three types, e.g., 31A1, 31A2, and 31A3 (as shown in Figure 1), where 31A1 represents the simple pertrochanteric fracture, 31A2 denotes the multifragmentary pertrochanteric fracture and lateral wall incompetent (≤20.5 mm) fracture, and 31A3 is the intertrochanteric (reverse obliquity) fracture [1]. Nevertheless, the conventional diagnosis method inspects patient images slice by slice which is usually tedious and time-consuming for the radiologists, and moreover, with different clinical experiences of radiologists, the final diagnosis result is liable to be empirical and subjective, which may hamper making the follow-up treatment plan. To tackle this challenge, a practicable way is to design a fracture computer-aided system [26] that helps the radiologist classify the fracture types automatically. In the past, a considerable number of researches have been proposed, for example, Demir et al. [7] developed a novel exemplar pyramid method for the humerus fracture; it extracted histograms of oriented gradients and local binary pattern features from the input images and then combined them with four conventional classifiers to classify the fractures. Boudissa et al. [8] explored the influence of semiautomatic bone-fragment segmentation on the reproducibility of the fracture classification, and it claimed that with the assistance of this technique, the classification accuracy of the fracture could be effectively improved. Additionally, in [9], the author proposed a 3D classification intertrochanteric fractures system which used the Hausdorff distance-based K-means method to classify the fractures into five types; the experimental results found that the unsupervised K-means method could gain promising classification performance with clinical significance. Cho et al. [10] evaluated the 3D CT images for boosting the diagnosis performance of femur trochanteric fractures, and it is shown that incorporating the CT could efficiently improve the reproducibility of stability for femur trochanteric fractures. Mall et al. [11] utilized different machine learning methods with the gray level cooccurrence matrix (GLCM) to classify the categories of fractures or no fractures; it proved that the proposed method could gain significant improvement on different evaluation metrics. Despite the great success, those methods have achieved in classifying the fracture task; those have deficiencies in capturing the robust and high-level semantic features due to hand-crafted feature predefinition.

Recently, the deep convolution neural network (CNN) has been proved its effectiveness in many computer vision tasks [1216]. For instance, Lindsey et al. [17] suggested a deep CNN model based on the UNet structure achieving the automatic detection of the wrist fractures, and then it was evaluated on two different datasets; the result demonstrated that the proposed model could boost the clinical diagnosis performance. Then, Krogue et al. [18] labeled 3026 hip fractures and trained them with the DenseNet to achieve the automatic detection of hip fractures. Similarly, in [19], the authors utilized the faster R-CNN [20] to locate and classify the distal radius fractures automatically, and it obtained the mean average precision score of 0.866 at that time. To learn more high-level features, in [21], the authors employed a cropping process with the Inception V3 network to filter the unnecessary parts and thus leading to an improvement of the fracture detection. Besides, in [22], it used the Inception-ResNet faster R-CNN architecture to construct a wrist fracture detection model and tested it on the unseen dataset, which proved that the designed model could gain high sensitivity and specificity.

Although those methods, especially the CNN ones, have gained promising results on the fracture classification task, an automatic fracture classification model should be simple and stable and provide effective information for the follow-up treatment plan. Specially, the femur trochanteric fractures usually have various locations, shapes, and contextual information in the clinical practice, which make it challenging to achieve a higher classification performance. Moreover, few works have considered the contextual information at different scales which may further limit the capability of the classification models. To tackle those challenges and efficiently improve the ability to learn strong representations from the fracture regions, in this paper, we develop a dense dilated attention (DDA) network to aggregate the multiscale, contextual, and attentive features from the femur trochanteric fracture region. Specially, in our DDA network, we incorporate the dense connection with dilated convolution by utilizing different dilated rates to learn the multiscale representations, and meanwhile, the dense connection could also alleviate the vanishing gradient problem and enable the network to reuse the hierarchical features. Furthermore, a dilated attention (DA) module is designed which encourages the network to encode more contextual and attentive representations automatically. To validate the effectiveness of DDA network, we perform extensive experiments on the femur trochanteric fracture images, and the experimental results show that our proposed DDA network could efficiently improve the classification performance by successfully extracting the discriminative features from the input image.

The rest of the paper is organized as follows: Section 2 presents the details of our proposed DDA network, and in Section 3, we first introduce the experimental data and evaluation metrics and then show the comparison results of different experiment settings. Finally, an elaborate discussion and conclusion of this paper are given in Section 4.

2. Methodology

The automatic classification of femur trochanteric fracture is a challenging task due to its complex contextual information and various fracture regions. Hence, improving the network ability to extract multiscale representations, contextual information and intensity details are particularly important for accurate femur trochanteric fracture classification. To address those above challenges, a DDA network is developed for the accurate classification of the fracture categories; in the following subsections, we will provide the detailed descriptions of the network architecture and DDA module.

2.1. Network Architecture

As illustrated in Figure 2, given the X-ray images as the input of the DDA network, it first passes a series of convolution layers, max-pooling layers, and then a DDA module is implemented in the middle of the network to refine the feature representations, which will be described in detail in Subsection 2.3. After that, the final prediction category is output by a fully connected (FC) layer with the softmax activation in an end-to-end manner. The detailed parameters of the network are shown in Table 1. Notably, to preserve more spatial information of the image, we do not use the stride in the convolution layer. Specially, the ReLU activation is used to learn more nonlinear information, and batch normalization layer is utilized after each nonlinear activation to accelerate the convergence of the network.

2.2. Dilated Convolution

In our DDA module, we employ the dilated convolution to enlarge the receptive field without losing feature map resolution. Moreover, as the receptive field increases, it also provides more multiscale contextual features from the input [23]. Specifically, the dilated convolution could be divided into three steps: (1) sampling the input feature map based on the dilated rate; (2) conducting the convolution operation on the sampled values; (3) merging the obtained sampled values to a new feature map. Here, we denote the kernel size of the convolution layer as , and then the output feature map dimension of the traditional convolution layer could be calculated aswhere is the dimension of the input feature map; and are the padding size and stride, respectively. For the dilated convolution, its output dimension could be defined as

Notably, when the stride is set as 1, the receptive field of the dilated convolution layer could be formulated aswhere is the dilated rate of th layer, and through this operation, the receptive field could increase rapidly. Finally, the dilated convolution layer could be given aswhere are the input and output of the th position, separately; represents the learnable parameters of th filter.

2.3. Dense Dilated Attention Layer

As illustrated in Figure 3, the aim of our dense dilated attention (DDA) layer is to learn more contextual and multiscale features across different layers to leverage the classification performance of the DDA network [24]. Note that the shallow layers usually contain the position information of the input, while the deep layers have high-level semantic representations. Therefore, combining those features across different layers could enhance the discrimination capability on fracture regions. Additionally, in order to guide the network’s focus on the most salient regions from different receptive fields, a DA module is integrated into the DDA layer. Specially, taking the previous input feature map from the previous layer, it first passes through the DA module to learn the attentive and contextual features, and then the obtained ones are concatenated with the previous inputs as the input of the next layer. Note that the DDA layer mainly contains three DA modules with dilated rates of , respectively, and its detailed structure is shown in Figure 4.

Mathematically, we denote the input of each DA module as with dilated rate of , where represents the width, height, and channel numbers of . Then, we adopt three convolution layers to transform the to three embeddings , separately:where , and denote the corresponding convolution operation; represents the channel number of those embeddings. After that, , , and are flattened to the dimension of . To gain the contextual relation of , a matrix multiplication between and is applied, which can be given aswhere is the similarity matrix. Next, a softmax activation is employed to normalize to the interval of [0, 1], and it could be formulated as

Then, the attentive feature map is gained by multiplying the with , and it could be formulated as

Therefore, the final output of the DA module is defined aswhere is the convolution operation. By adopting the hierarchical DA modules with the dense connection, the DDA layer could not only extract the multiscale features from different receptive fields but also learn the attentive and contextual information from the input.

2.4. Training Loss Function

Denote the output feature map from the fully connected layer as and the corresponding label as . To gain the predicted scores of each class, we apply a softmax activation function , which could be given as

Additionally, to optimize the network, we use the binary cross-entropy as the loss function, which could be formulated aswhere is the number of the data samples, and , are the predicted probability and corresponding true label.

3. Experiment

In this section, we conduct extensive experiments to validate the effectiveness of the DDA network. Specially, we first introduce the experimental data, implementation details, and evaluation metrics. Then, we compare the experiments with different amounts of data samples. Next, an ablation analysis of the DDA layer and DA module is explored to validate their efficiencies for this classification task. Finally, we reimplement some other fracture classification methods to compare them with our proposed DDA network.

3.1. Dataset

The total number of the experimental dataset is 390, and it consists of three categories 31A1, 31A2, and 31A3 with the amount of 117, 125, and 128, respectively. The mean age of the patients is 65, the maximum age is 91, and the minimum age is 26. All the categories of the experimental data are annotated by three traumatic orthopedic specialists with more than 15 years of experience. Notably, the final category of data is based on the AO/OTA criterion. Since the initial resolution of the image is , we crop the region of interest (ROI) with the maximum bounding-box, and then we resize those ROIs to before input them into the DDA network to accelerate the training process of the network.

3.2. Implementation Details

In our experiments, since the initial resolution of the image is large, we first resize the image input to . Moreover, to alleviate the over-fitting problem, data augmentation is utilized to generate more data samples, and it includes random rotation, flip, and contrast. The whole network is implemented by the PyTorch deep learning library, and it is optimized by Adam. The initial learning rate is set as 0.001 and it decreases by 0.1 after 10 epochs. To accelerate the training process, we use the NVIDIA GeForce RTX 2070 Graphics Card, and the batch-size is set as 2.

3.3. Evaluation Metrics

To evaluate the performance of the proposed DDA network, we apply four evaluation metrics; here, we denote the true positive, false positive, true negative, and false negative as TP, FP, TN, and FN. Then, the accuracy which calculates the correct prediction among the total numbers of samples could be calculated as

The sensitivity measures the ratio of correct TP prediction to the whole number of true positive samples. It could be formulated as

The specificity calculates the ratio of correct TN to the whole numbers of false positive samples, and it is given as

The receiver operating characteristic (ROC) curve is the most used graphical plot to measure the performance of the classifier, and the area under curve (AUC) is the score to measure the classification performance in which the higher score indicates the better distinguishing performance.

3.4. Data Sample Analysis of DDA Network

In this section, we first explore the influence of different data samples on the classification performance of the DDA network. Here, we divide our training data to , while the amount of the testing data samples is unchanged. The comparison result is shown in Figure 5, and it can be concluded that the more the data samples, the better performance the network would gain. That could be suggested that with more data samples, the network can extract the image features more sufficiently and effectively.

3.5. The Effectiveness of DDA Layer

The aim of our DDA layer is to learn the multiscale, attentive, and contextual information from the input image. Therefore, in this section, we conduct experiments w/o the DDA layer to explore its effects on the final classification performance. Moreover, since different depths of layers contain discriminative representations, therefore we also test the effectiveness of DDA layer on different depth locations. As reported in Table 2, DDA-1, DDA-2, and DDA-3 represent the depth location of locating DDA layer, in which the smaller value denotes the shallower depth location of the DDA network, and the DDA-W and DDA-O denote the network with or without the DDA layer. From the result, we observe that the best performance is achieved by the “DDA-W,” which can be explained that with the proposed DDA layer, the network is able to learn more high-level representations and then boost the classification performance.

3.6. Impact of DA Module

Different from the conventional dense connection, we develop a DA module that can guide the network to capture more attentive information with self-attention from different receptive fields. To validate the effectiveness of the proposed DA module, we compare three different network settings: with DA module; without DA module; with DA module (dilated rate as 1); with DA module (dilated rate as 2); with DA module (dilated rate as 3). The comparison result is shown in Table 3; the experimental result demonstrates that adopting the designed DA module could efficiently improve the classification performance compared with the setting without DA module. Moreover, with different dilate rates, the network tends to gain different performance; however, the best one is achieved by combining those three dilate rate settings, with the accuracy, sensitivity, specificity, and AUC of , and 0.97, respectively.

3.7. Comparison with Other Methods

To further evaluate the performance of the DDA network, in this section, we compare the results with different classification methods. As shown in Figure 6, we first compare our method with some baseline classification methods: Inception V4 [25], ResNet [26], DenseNet [27], and SKNet [28]. Note that we reimplement those methods and all the parameter settings are based on the default values. From the result, we observe that our proposed network could gain the highest AUC score of 0.97, which proves the effectiveness of our proposed network. Furthermore, we also report the comparison result with some other fracture methods; despite some of them are not with the same classification task, we reimplement those methods on the same dataset. As shown in Table 4, our method gains the best performance among all the evaluation metrics.

4. Conclusion

In this paper, a DDA network is designed to achieve the classification of femur trochanteric fracture from X-ray images automatically. Since the fracture usually comes with various locations, shapes, and contextual information in the clinical practice, a novel DDA layer is developed which can automatically extract the multiscale, contextual, and attentive features to enhance the feature learning ability for achieving a more accurate classification performance. Extensive experiments on the annotated femur trochanteric fracture demonstrate that the proposed DDA network could gain competitive performance on this classification task. In future work, we will extend our work to different fracture classification tasks and collect more data samples to make the model more robust.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was sponsored by the Science and Technology Talent Cultivation Project of Tianjin Health Commission (KJ20215) and Research Project of Tianjin Sports Bureau (21DY014).