Abstract

Osteoporosis is a significant global health concern that can be difficult to detect early due to a lack of symptoms. At present, the examination of osteoporosis depends mainly on methods containing dual-energy X-ray, quantitative CT, etc., which are high costs in terms of equipment and human time. Therefore, a more efficient and economical method is urgently needed for diagnosing osteoporosis. With the development of deep learning, automatic diagnosis models for various diseases have been proposed. However, the establishment of these models generally requires images with only lesion areas, and annotating the lesion areas is time-consuming. To address this challenge, we propose a joint learning framework for osteoporosis diagnosis that combines localization, segmentation, and classification to enhance diagnostic accuracy. Our method includes a boundary heat map regression branch for thinning segmentation and a gated convolution module for adjusting context features in the classification module. We also integrate segmentation and classification features and propose a feature fusion module to adjust the weight of different levels of vertebrae. We trained our model on a self-built dataset and achieved an overall accuracy rate of 93.3% for the three label categories (normal, osteopenia, and osteoporosis) in the testing datasets. The area under the curve for the normal category is 0.973; for the osteopenia category, it is 0.965; and for the osteoporosis category, it is 0.985. Our method provides a promising alternative for the diagnosis of osteoporosis at present.

1. Introduction

Osteoporosis (OP) is a disease characterized by impaired bone microstructure and decreased bone mineral density (BMD). With the acceleration of population aging, OP has become an increasingly serious global health problem [1]. Fragile fracture is the most serious complication of OP [2]. OP causes more than 8.9 million brittle fractures each year worldwide [3]. In the US, fragile fractures are more than four times more common than stroke, acute myocardial infarction, and breast cancer [4]. In several developed countries, osteoporotic fractures account for longer hospitalization time than these diseases according to a meeting of the World Health Organization [5]. By 2025, the number of fragility fractures is expected to increase from 3.5 million in 2010 to 4.5 million, a 28% increase [6]. Therefore, reliable technology for the early detection and prevention of OP is urgently needed.

Currently, although dual-energyX-ray absorptiometry (DXA) is the gold standard for measuring bone mineral density for the diagnosis of OP, it is not widely used as a screening tool for OP owing to its high cost and limited availability of equipment [7]. To overcome these limitations, a variety of osteoporosis screening tools have emerged. Quantitative ultrasound (QUS) is one of them, which has developed into an alternative method for DXA screening of osteoporosis. Its benefits include being portable and economical; however, it may be unavailable in all primary medical settings [8]. In addition, a variety of clinical risk assessment tools have been developed to predict osteoporosis, including the fracture risk assessment tool (FRAX), the QFracture algorithm, the Garvan Fracture Risk Calculator, and the osteoporosis self-assessment tool [9]. Unfortunately, these tools are based on a combination of known risks to calculate the risk of fracture in patients and have poor efficiency.

Artificial intelligence and machine learning algorithms have recently been used in the diagnosis and prediction of osteoporosis [10]. The existing methods have achieved some success in solving the problem of binary classification (osteoporosis and nonosteoporosis) of which the main purpose is to identify whether the patient has osteoporosis [11]. However, these methods also have some obvious shortcomings: (1) the existing artificial intelligence algorithms treat segmentation and classification as two separate tasks, ignoring the information fusion and complementarity between the two tasks; (2) taking the average of two lumbar cancellous bone mineral density measurements (commonly the first and second lumbar) is widely acknowledged as the best diagnostic criterion for osteoporosis in lumbar QCT [12]. In current models, these data inputs tend to be CT images of a single vertebral body, disregarding the information fusion and complementarity between multiple vertebral images; (3) the problem of class imbalance in the collected data is prevalent due to the lack of standard public datasets; (4) most methods treat osteoporosis as a binary problem, regardless of the urgent need and a strong incentive to turn the binary into a trinomial (osteoporosis, osteopenia, and normal) problem. Although the three classifications are more difficult, osteopenia can bring some predictability to the prevention and treatment of osteoporosis. In this paper, we address the challenges above in the diagnosis of osteoporosis to facilitate the timely detection of the condition and propose an instance-based and class-based multilevel joint learning framework for bone state classification. The innovation of this method lies in the following steps. Firstly, we locate a vertebral body and remove redundant information from the image. Secondly, by constructing the boundary heat map regression auxiliary branch, the vertebral edge is refined, and the segmentation performance is improved on the segmentation branch of the shared encoder. In addition, low-level and high-level features from the segmentation branch and the auxiliary branch, including the shape and boundary of the vertebral body, are fused with feature layers from the diagnostic classifier. Finally, considering the different effects of different vertebral bodies on the classification results of bone state, we design a feature fusion module to adaptively learn feature fusion weights. The proposed method is novel because it solves the challenges of high dimensionality, multimodality, and multiclassification associated with osteoporosis diagnosis, and these challenges have not been resolved in earlier methods. The contributions of the research are as follows:(i)A joint learning framework is proposed to segment vertebral bodies from CT images and classify bone states (normal, osteopenia, and osteoporosis)(ii)An instance-based and class-based data sampling balancing strategy is introduced to solve the problem of poor model prediction caused by imbalanced data between training datasets(iii)A boundary heat map regression branch is proposed, which uses the Gaussian function to do “soft labeling,” accelerating network convergence and improving the performance of vertebral segmentation in joint learning and single-task learning environments(iv)The effectiveness of segmentation features in guiding a deep classification network is verified by hierarchically fusing the features of the decoder and classifier related to two segmentation tasks(v)A feature fusion module is proposed to adaptively learn the feature weights of vertebrae 1 and 2 and balance the influence of two vertebrae images on classification results

To our knowledge, there are many studies [1316] on the classification of bone status using vertebral images, but there are few studies on multitask joint learning and detection of bone status based on soft tissue window images at the central level of lumbar 1 and lumbar 2 vertebrae. Experimental results show that multitask joint learning can improve the accuracy of disease classification.

In this section, we briefly review the related research on bone state classification, categorizing them into three subareas to introduce the current research on the bone state in the medical image, i.e., vertebral positioning, vertebral CT image segmentation, and vertebral medical CT image classification.

2.1. Vertebral Positioning

With the development of deep learning, convolutional neural networks are increasingly used for positioning tasks. However, most of these works describe vertebral recognition as a centroid point detection task. Chen et al. used the advanced features of convolutional neural networks to represent vertebrae from 3D CT volume and eliminated the detection of misplaced centroids based on a random forest classifier [17]. Dong et al. iterated the centroid probability map of a convolutional neural network using a message-passing scheme according to the relationship between the centroids of the vertebrae and used sparse regularization to optimize the localization results to obtain a pixel-level probability of each vertebral centroid [18]. However, it may be more meaningful to directly identify the labels and bounding boxes of vertebrae (rather than the probability map of the centroid point). Zhao et al. proposed a category-consistentself-calibration recognition system to accurately predict the bounding boxes of all vertebrae, improving the discrimination ability of vertebrae categories and the self-awareness of false positive detection [19]. All of these methods identify the vertebrae from the coronal plane, whereas what we want is to get a small image from the transverse view that only contains the vertebrae.

2.2. Vertebral Segmentation

Recently, machine learning is increasingly used in the recognition and segmentation of vertebral bodies. Michael Kelm et al. used iterative variants of edge-space learning to find the bounding boxes of intervertebral discs and utilized Markov-based random fields and graphical cutting to initialize and guide the segmentation of the vertebrae [20]. Zukić et al. employed the AdaBoost-based Viola–Jones object detection framework to find the bounding boxes of the vertebrae and then split them by expanding the mesh from the center of each vertebra [21]. Chu et al. applied random forest regression to detect the vertebral center and used these to define target regions for the segmentation of the vertebrae with random forest voxel classification [22]. Although these methods can find certain vertebral bodies with specific appearances, they still need to set some parameters empirically and fail to deal with complex pathological cases. However, many recent segmentation methods are based on deep learning, using convolutional neural networks instead of the traditional explicit modeling of spine shape and appearance. For example, Sekuboyina et al. used a multiclass convolutional neural network for pixel labeling, segmented the lumbar spine on a 2D facet slice, and estimated the bounding boxes of the waist region using a simple multilayer perceptron to identify regions of interest in the graph [23]. Janssens et al. depended on two continuous networks to realize this task. First, they used a regression convolutional neural network to estimate the bounding box of the lumbar region and then used a classification convolutional neural network to perform voxel labeling in the bounding box to segment the vertebral body [24]. Mushtaq et al. used ResNet-UNet to semantically segment the lost vertebral body, achieving 0.97 DSC and 0.86 IOU [25].

2.3. Vertebral Medical Image Classification

In the study of establishing the osteoporosis model, Yoo et al. established a support vector machine model using age, height, weight, body mass index, hypertension, hyperlipidemia, and other factors to identify osteoporosis in postmenopausal women. Compared with traditional osteoporosis self-assessment tools, they found that the support vector machine model is more accurate [26]. Pedrassani de Lira et al. established a J48 decision number model to identify osteoporosis through multiple indicators such as age, previous fracture, number of previous fractures, and previous spinal fractures [27]. Tafraouti et al. extracted features from X-ray images and used a support vector machine model to identify osteoporosis, which can well distinguish osteoporosis patients from normal people [28]. Kilic and Hosgormez studied the identification of osteoporosis based on a random subspace method and random forest ensemble model. Jang et al. used a deep learning method to identify osteoporosis [29]. In the internal and external test sets, the area under curve (AUC) of osteoporosis screening was 0.91 (95% confidence interval (CI), 0.90–0.92) and 0.88 (95% confidence interval (CI), 0.85–0.90), respectively. The experimental results illustrate that the use of chest radiographs based on deep learning models may be used for opportunistic automatic screening of osteoporosis patients in the clinical environment [30]. In the latest study, Xue et al. conducted a study in which they labeled the L1–L4 vertebral body in CT images and divided it into three categories based on bone mineral density: osteoporosis, osteopenia, and normal. The study achieved a high level of accuracy, with a prediction accuracy of 83.4% and a recall rate of 90.0% [31]. Dzierżak and Omiotek have developed a novel method for diagnosing osteoporosis through the use of spine CT imaging and deep convolutional neural networks. To address the issue of a small sample size, they utilized a large dataset to pretrain their model, which resulted in the successful classification of osteoporosis and normal cases. This approach showed promising results for the accurate diagnosis of osteoporosis using CT scans [32]. In these methods, both the traditional machine learning algorithm and the current popular deep learning algorithm use the image containing only the region of interest as the data source. The step-by-step preprocessing process is tedious, time-consuming, and inefficient. Therefore, the integration of positioning, segmentation, and classification into a network should help to improve efficiency, and no research has shown that explicit or implicit features related to the first 3/4 of the vertebral body can be effectively and interpretably used in deep classification networks.

3. Proposed Methods

3.1. Overview

Our proposed method aims to classify vertebral images within a joint framework to enable a more flexible diagnosis of osteoporotic lesions. To achieve this goal, as shown in Figure 1, we propose an instance-based and class-based end-to-end multitask joint learning framework. It mainly has a strategy to solve class imbalance and four deep learning modules, including vertebral positioning module, vertebral segmentation module, cascade feature extraction module combined with gated attention, and feature fusion module. As shown in Figure 2, a new multilayer and multilevel joint learning framework is introduced, which integrates positioning, segmentation, and classification. Firstly, realizing the accurate location of the target lesion (coronal vertebral body), removing the redundant information of the image through the reduction of resolution (from 512 × 512 to 224 × 224). Secondly, the boundary heat map auxiliary branch is employed to refine the edge to improve the performance of segmentation; meanwhile, segmentation features are cascaded with the classification features to improve the accuracy of classification. Finally, we propose a feature fusion module, which adaptively assigns feature weights to fuse the features of lumbar L1 and lumbar L2. Different magnitudes of losses in multitask learning tend to bring about negative effects on other tasks when the model tends to fit a certain task; to balance this problem, we use the gradient update method to assign weights to each loss, exploiting neural networks to update the weight parameters.

3.2. Instance and Class-Based Sampling Methods

In the actual clinical scene, the data collected by image acquisition will be unbalanced owing to the inherent difficulty of collecting labels of rare diseases or other unusual cases. Therefore, when training on extremely unbalanced data, the model may have a high probability of being affected by the number of different categories, resulting in the underfitting of some categories which may be ignored. At present, the methods to solve the data imbalance include data resampling [33], adaptive loss function [34], and curriculum learning [35]. Inspired by the paper [36, 37], methods are introduced to solve the problem of extreme imbalance of our category images. It combines unbalanced (instance-based) and balanced (class-based) sampling of data, where we extend the method to our three-category practical problems.

We define the training set as , where is the sample, is the sample category. Assuming that for multiclassification problems with categories, each category has samples, and represents the total number of samples, where , the general sampling strategies can be described aswhere is the probability of sampling from the th category. If we set , the probability of sampling from each category is equal to . This is the class-based sampling method.

If we set , then it is equivalent to selecting the sample by the proportion of a category of samples to all samples, which is instance-based sampling. Here, we introduce a mixed sampling method based on instance and class, which is suitable for data imbalance. We denote the training dataset and sampling strategy by the symbol . Instance-based sampling and class-based sampling are represented by and , respectively, so this mixture can be described aswhere , , . and represent random convex combinations of data and label inputs. Here, we set . As shown in Figure 3, as grows, examples from minority classes are combined with a greater weight to avoid overfitting of minority classes. Here, we set to induce a more balanced distribution of training samples by creating synthetic data points around spatial regions where minority classes provide fewer data density.

3.3. Vertebral Positioning Module Based on YOLOv3

The basic step of vertebral CT image classification is to extract robust features from CT images, given and of the original images are 512 pixels. To remove redundant features, we use the YOLOv3 [38] to locate the vertebral body in the image with size 512 × 512 × 3 as input to YOLOv3. The image feature is extracted by DarkNet-53, and then the target classification and position regression are performed on the acquired feature map with the help of the FPNs (feature pyramid networks) structure.

In this study, we will obtain the position of the prediction box in the original image , in YOLOv3, a set of anchor frames is composed of nine initial frames of different sizes. Assuming that the center coordinates, width, and height of an anchor frame are expressed as , can be obtained by reverse calculation of the regression parameter by the output network. Details of the calculation formula are as follows:where represents the sigmoid transformation of the variable, aiming at controlling the offset of the center point between 0 and 1.

The main purpose of employing YOLOv3 is to obtain the center coordinates and of the prediction box and utilize this position as the center cutting position of the vertebral body to obtain a 224 × 224 image containing the complete vertebral body as the input of the subsequent convolution module. In this way, we can remove tens of thousands of useless features and improve the efficiency of the model.

3.4. Boundary Regression Auxiliary Branches

We suggest dividing the segmentation task into two tasks: vertebral segmentation and contour determination. Thus, our network is mainly composed of a weight-sharing encoder and two decoders composed of the segmentation branch and boundary regression branch. In the encoder, we improve the original U-Net [39] by applying residual blocks to replace the original two effective convolutions. In the decoder stage, we cascade the penultimate features from the boundary regression branch with the penultimate features of the segmentation branch, helping the network to better perceive and refine the vertebral contour. Since vertebrae in CT images may show up hyperosteogeny or other conditions, it is necessary to reconstruct edges by constructing auxiliary tasks, which provide more explicit and implicit topological priors for the coding layer and enable them to assist with the segmentation branches to obtain more accurate target masks.

The problem of boundary inaccuracy is rooted in the similarity of information in the corresponding receptive field of pixels. When similar features belong to the interior or exterior of the segmented region, this similarity will be advantageous, inversely similar information lies in the segmented boundary will undoubtedly increase the uncertainty of the edge. In terms of the boundary regression auxiliary branch in the segmentation module, we propose to divide the edge based on the region and graph from the whole image, combining it with the spatial proximity and pixel value similarity. In this paper, the accurate boundary of vertebral segmentation should be the inner boundary. We combine the convolutional neural network with the level set, taking the segmentation result obtained by the neural network as the prior knowledge of level set segmentation; then we construct a gray level constraint term on the original level set function and improve the edge indicator function to deal with uneven intensity in the image.

3.4.1. Improve the Edge Indicator Function

Getreuer [40] proposed the famous Chan–Vese (CV) model in 2001. This method uses a region-based segmentation strategy to divide the image into two homogeneous regions, the inner and outer regions, using active contoured lines to find the image to be segmented and the original image with the minimum difference to minimize the energy function.

Given the input image , the energy function based on the CV is shown as follows:where and describe the average gray levels of equivalent parts inside and outside the contour, respectively, and represent the inner and outer regions of the contour, , , , are constants, is the edge indicator function which can be used to prevent the curve from exceeding the target area, is the Gaussian calculation sub, is the standard deviation, and and represent Dirac and Heaviside functions, respectively. The position of contour and unknowns and are finally obtained through optimization formula (4).

The evolution of the CV model is constrained by global gray-level information. However, most images, especially medical images, have uneven intensity. To solve this problem, we improve the function and construct gray-level information constraint terms to constrain the evolution direction. Bilateral filtering is a method that combines the spatial proximity of images with the similarity of pixel values. Based on Gaussian filtering, bilateral filtering introduces the gray value of pixels for the local weighted average. When smoothing the speckle noise of images, bilateral filtering can better maintain the edge features.

In the first step, the Gaussian function is used to construct bilateral filters to obtain smooth images:

Image is filtered using bilateral filter operator , where is the standard deviation used to control the smoothness, are the weight coefficients.

In the second step, the optimal threshold is calculated based on the filtered image using the adaptive threshold principle. The maximized interclass variance value of is shown in the following equation:where represents the ratio of pixels in the target area to the image, represents the corresponding average gray level, is the proportion of background pixels, and is the average gray level of background pixels. Then, the new edge indicator function can be described as

3.4.2. Auxiliary Branch

We advocate the segmentation results of convolutional neural networks as prior knowledge, namely, the initial contour of the level set, and the curve contour evolved through the level set is used to guide the neural network to optimize toward the edge of the vertebral body.

The specific expression of the gray level constraint is described aswhere is the upper limit of the vertebral gray value obtained by using the convolutional neural network model, is the lower limit of vertebral gray value, is the average of vertebral gray value, is the variance of vertebral gray value, and is a constant.

The function of the gray level information constraint term is to make the level set curve evolve inside the vertebral body to approximate the inner edge contour. When the gray value of the pixel is within the upper and lower limits of the initial vertebral gray value, the energy value of the point is negative, otherwise positive. The edge result obtained by the neural network is used to replace x and y on the initial contour plane. Gradient descent is used to minimize the energy function, and the formula form of the final evolution equation after adding the gray constraint function is shown as follows:

In the label aspect of the auxiliary branch, we use the Canny operator to detect the edge of the binary image label. Canny is built on a two-dimensional convolution. To improve the calculation speed of the Canny operator, two-dimensional convolution can be decomposed into one-dimensional filters, and then a convolution operation with the image is carried out, respectively: . Then, the gradient amplitude and gradient direction can be expressed as

The size of the Gaussian window is adjusted by changing the standard deviation of the Gaussian function, that is . We first apply nonmaximum suppression, and then segment images through the dual-threshold method. When the gradient of some pixel is greater than the limit threshold, it will be considered as an edge pixel.

Then, we construct a soft label heat map in the form of Heatsum based on the processed images:where represents the Hadamard product; it is noted that is normalized between [0, 1].

Here, the boundary regression branch is utilized to refine the segmented edges. We treat this branch as a regression task through mean square error rather than a whole work consisting of a boundary segmentation task together with the segmentation branch.

3.5. Cascading Classification Module

In the classification module, we use ResNet-101 as a basic feature extractor. ResNet [41] is a traditional deep convolutional neural network where the residual structure is used in the shallow network. The corresponding structure is illustrated in Figure 4(b). By adding the input value x with the output unit, the residual gains better performance in convergence after the operation of ReLU active. These steps can be approximated as an identical mapping of equal input and output, which effectively solves the problems of network learning ability decline, gradient disappearance, and gradient explosion when the number of convolutional neural network layers increases.

Inspired by the gating attention [42] and residual structure, we designed a gating residual module as shown in Figure 4 to replace the first convolution module in ResNet-101 from conv2_x to conv5_x. The specific network parameters can be found in Figure 5. The gated residual model can be described as follows.

Assuming that is the activation feature of the convolutional neural network, where and are the height and width of the image, and is the number of channels of the image, in general, the gating attention performs the following transformation.

Among them, , , and are trainable parameters. The embedding weight is mainly responsible for adjusting the embedding output, and the gating weight and the bias weight are responsible for adjusting the gating activation.

They determine the behavior of gated attention in each channel.

For the specific process, assuming the given embedding weight as , modules can be defined aswhere is a small constant, which is mainly used to avoid the derivation of zeros. Equation (14) is used to normalize channels, and represents a small constant. is used for normalization the ratio of , preventing the condition of small when is too large, is a trainable parameter used for controlling the weight of each channel. When is close to 0, the channel will not participate in channel normalization.

Then, we suppose the selection weight and the gating offset , the gating function can be depicted as follows:

Each primitive channel is adapted by the corresponding gate, and are trainable weights and deviations which is used to control the activation of the gate. Finally, will be entered into the residual module to obtain the feature map of the gating attention after the convolution operation. Supposing the feature map concatenated from the segmentation module , we can perform the following operations on the classification network feature and the segmentation module feature to obtain the final feature map .

Two 1 × 2048-dimensional feature vectors of vertebrae can be obtained by flattening the feature map.

3.6. Feature Fusion Module

As mentioned above, the detection of bone status is based on the average of lumbar L1 and lumbar L2. To explain the different effects of different lumbar vertebrae on classification, we learn and adaptively for each vertebra, which satisfies ; and represent the fusion weights, respectively.

Specifically, we calculate and () by and , respectively, where represents the perception of two layers, that is, two fully connected layers. The following layer can be used to eliminate the influence of different feature dimensions. After gaining the feature , the prediction of bone state can be given by the fully connected layer and function.

3.7. Cascading Classification Models

To balance the impact of different dimensions of multiple tasks in the training process we introduce the trade-off parameters and to balance these four tasks. The total loss function of multitask learning can be defined aswhere , respectively, represent the predicted results of the positioning branch, category branch, confidence branch, and segmentation branch of the positioning module for a given input image, the boundary heatmap regression branch, and the classification network. represents the function, represents the prediction box result, and is the result of the category in the positioning module. represents the probability that a vertebral body exists, represents the normalized result of , and is the expected result of the classification network.

4. Experimental Results

4.1. Dataset and Preprocessing

To assess the effectiveness and benefit of the joint learning framework in bone state classification, we conducted experiments in a dataset obtained from the Nantong First People’s Hospital from May 2021 to May 2022, consisting of CT images of 1048 routine-dose cases. All images were collected by Ingenuity Core 128 CT (Philips Health Care, Holland), the tube voltage was 120 kV, the inpatient tube current modulation technique was used, and the iDose 4 was used to reconstruct the cross-sectional image of the mediastinal window (standard B standard reconstruction algorithm). The reconstruction layer thickness and layer interval were both 2 mm. The longitudinal window images of the lumbar 1 and lumbar 2 center planes of each subject were selected for BMD measurement and deep learning model construction. The QCT pro4 software (Mindways, CA, USA) was used to set the same size of the region of interest (ROI) in the central cancellous bone area of the lumbar 1 and lumbar 2 vertebral bodies, avoiding the cortical bone and the visible vascular area. The software automatically calculated the BMD values of the lumbar 1 and lumbar 2 vertebral bodies and used their mean values as the BMD values of the individual subjects (BMD individuals). According to the standard recommended by the “expert consensus on imaging and bone mineral density diagnosis of osteoporosis” BMD individuals > 120 mg/cm3 are normal bone mass, 80 mg/cm3 ≤ BMD individuals ≤ 120 mg/cm3 are osteopenia, and BMD individuals < 80 mg/cm3 are osteoporosis.

We divide the dataset into training data (50%), validation data (10%), and test data (40%); the class distribution of training, validation, and testing datasets is shown in Figure 6. These three datasets do not have any overlapping images, and the CT images of each category in the three datasets are placed in strict proportions. Then, all images are resized to 512 × 512 and each image is normalized from [0, 255] to [0, 1] before being fed into the network.

To increase the amount of training data and improve the generalization ability and robustness of the model, we enhance the image data employing flipping, rotating, and scaling on the basis of the original data balancing strategy based on an instance and actual class.

4.2. Implementation of Framework

To implement the joint learning framework, we implemented the model based on Python 3.6.12, using the PyTorch framework and two NVIDIA GeForce 3090Ti GPUs. We apply the SGD optimizer to train the joint learning framework for 300 epochs with a learning rate of (10e − 1–10e − 5) and add six adaptive parameters to the SGD optimizer to weigh the loss of multitask learning.

4.3. Measurements and Baselines
4.3.1. Measurements

Based on previous work[4952], accuracy, sensitivity, specificity, and F1-score were used to evaluate the performance of classification. The accuracy rate is the ratio of the number of samples correctly classified by the classifier to the total number of samples. The sensitivity reflects the proportion of positive cases correctly judged by the classifier to the total positive samples. The specificity indicates the proportion of negative cases correctly judged by the classifier to the total negative samples. F1-score is the sum of accuracy and sensitivity. In this paper, the three-category problem is transformed into a two-category problem to evaluate; that is, the category studied at this time is a positive sample and the other categories are negative samples.(i)Accuracy: (ii)Sensitivity: (iii)Specificity: (iv)F1-score:

denotes the model prediction and denotes the true label. Positive samples are predicted as positive samples (true positive, TP), positive samples are predicted as negative samples (false negative, FN), negative samples are predicted as positive samples (false positive, FP), and negative samples are predicted as negative samples (true negative, TN).

Based on previous works [5355], we use the intersection over union (IOU) and dice coefficient (Dice) to evaluate the effectiveness of our model segmentation task and use the average precision (AP) to evaluate the effectiveness of the positioning task.(i)Intersection over union: (ii)Dice coefficient:

4.3.2. Baselines

To demonstrate the performance of our federated framework model, we compared our work with popular machine learning and deep learning methods, including AlexNet [43], VGG-19 [44], GoogLeNet [45], ResNet [41], DenseNet [46], ShuffleNet [47], and EfficientNet [48].

4.4. Results

We use ten-foldcross-validation to calculate the average results and show the performance of the joint framework in Table 1. We set the learning rate of 10e − 1–10e − 5 to evaluate the classification performance of the joint framework in different situations. We used normal (osteopenia and osteoporosis) as a positive sample and other categories as negative samples, achieving an accuracy of 0.971, a sensitivity of 0.964, a specificity of 0.976, and an F1-score of 0.964. We achieved 0.933 in accuracy, 0.970 in sensitivity, 0.836 in specificity, and 0.954 F1-score when osteopenia was used as a positive sample and other categories (normal and osteoporosis) as a negative sample. When we used osteoporosis as a positive sample and other categories (normal and osteopenia) as negative samples, we achieved an accuracy of 0.957, a sensitivity of 0.962, a specificity of 0.922, and an F1-score of 0.975. The best performance is obtained by the learning rate of 10e − 3, indicating that the classification problem of bone state CT images can be effectively solved by adjusting the hyperparameters.

In addition, we compare the best results of joint learning with the most advanced baselines. The comparison results are reported in Table 2, where the best comparable performance is represented in bold. For the input images of other classification methods, we use CT images (512 × 512) generated by labels manually drawn by physicians that contain only regions of interest. To better intuitively compare the classification performance of the model, we use the confusion matrix for visual analysis. As shown in Figure 7, joint learning in dealing with the task of identifying low-dose achieves good performance with only 5 cases misclassified as normal, 2 cases misclassified as osteoporosis, and 8 cases misclassified as low doses; in the task of identifying osteoporosis, only 3 cases were misclassified as low dose. This result fully indicates the nonexistence of overfitting and underfitting states; this result further illustrates that there is no bias to a certain category which increases accuracy results.

The histogram of accuracy and F1-score can be found in Figure 8. Intuitively, the accuracy rate has increased. Compared with the highest accuracy rate among advanced baseline methods, the accuracy rate of joint learning has increased by 6.2% in the osteopenia category, 3.3% in the normal category, and 0.1% in the osteoporosis category. Notably, when compared to the overall accuracy of advanced baseline methods, the overall accuracy of joint learning was improved by 3.8% which proved the effectiveness of joint learning strategies once again.

4.5. Further Discussion
4.5.1. Roc Curve

To better demonstrate the classification ability of our proposed joint learning framework, we use the operating characteristic curve (ROC) and the area under curve (AUC) of receivers as further evaluation indicators. Taking the experimental results with a learning rate of 0.01 as an example, we draw the ROC curves of three categories in Figure 9, AUC for each category is also depicted in the figure. It can be found that the AUC in the osteopenia state is 0.965, the AUC value in the Normal state is 0.973, and the AUC value in the osteoporosis state is 0.985. These values prove the effectiveness of joint learning in bone CT image classification tasks.

4.5.2. Training Convergence

For model training, we use the accuracy and loss curve and the training process to imply the training trend of accuracy and model cost. The accuracy and loss curves of the joint learning framework with a learning rate of 0.01 are shown in Figure 10, which reflects that the model’s performance achieved satisfactory results at the 150th epoch and became stable. These two curves show the convergence of the model and assess its stability in bone CT image classification. In addition, the total training time of the joint learning framework on our dataset is about 10 hours, and each epoch takes 2 minutes. In short, training convergence and time reveal the computational efficiency of our network.

4.5.3. Model Visualization

We further use gradient weighted class activation mapping (Grad-CAM) to visualize the decision information of the feature extraction module. Figure 11 shows that the feature extraction modules for different categories (normal, osteopenia, and osteoporosis) focus on different regions, and the model automatically focuses on the corresponding regions. Compared with the correctly classified decision information, we also list some cases of misclassification in Figure 12. The focus area of the wrong case has changed significantly compared with the correct case in Figure 11, which may be used as an explanation for the neural network decision error.

Meanwhile, we calculated that the AP value of all testing datasets in the positioning task is 95%, the average IOU in the segmentation task is 0.972 ± 0.125, and the average Dice is 0.983 ± 0.036, which shows that we have good efficiency in selecting features in the positioning and segmentation tasks, but in some cases, these features have no good effect on classification.

4.5.4. Ablation Experiments

In this section, we conduct an ablation study (learning rate is 10e − 3) of our method to prove the effective impact of segmentation feature and classification feature layered fusion (LF), gated convolution (GC) module, and feature fusion module (FF).

We use the three modules separately and combine them randomly and calculate the overall accuracy of each experiment to evaluate whether the model is improved. The quantitative result can be found in Table 3. In Figure 13, it can be clearly seen that the accuracy of the model has been greatly improved. When we calculate without using the method of three modules; it is unfortunate to find that the accuracy of the model is only 82.1%. However, when we perform a hierarchical fusion of segmentation features and classification features, the overall accuracy rate rises to 85.6%, an increase of 3.5%. When we use the gated convolution module, we find that the accuracy rate has increased by 2.8%. When we use feature fusion of vertebral bodies at different levels, the overall accuracy rate has increased by 3.3%. When we select any two of them, we find that the overall accuracy rate has increased by 4%, 8.1%, and 10.5%, respectively. The seven additional experiments prove the feasibility and effectiveness of our proposed modular methods in improving classification accuracy.

5. Conclusion

Machine learning can help a great deal in accurately identifying osteoporosis from CT images. In this study, we propose a joint learning framework for bone state detection, where we integrate positioning, segmentation, and classification into an end-to-end multitask joint learning framework. The framework processes from the original input to the final output, increasing the overall fit of the model. The accuracy of classification has been improved by modular task fusion, global feature association, and fusion of different vertebral features. We used a CT image database containing three categories of vertebrae to evaluate this method. A large number of experiments confirm this method improves the overall accuracy from 82.1% to 93.3%, which shows the effectiveness of joint learning in bone state image classification and contributes to solving the problem of clinical diagnosis of osteoporosis.

Data Availability

The data used to support the study are included within the article.

Ethical Approval

This retrospective study was approved by the Ethics Committee of Nantong First People’s Hospital (No.: 2021KT028), who waived the need for informed consent. The study protocol was implemented according to the Good Clinical Practice guidelines defined by the Helsinki Declaration and the International Conference on Harmonisation (ICH).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Nantong Basic Science Research and Social People’s Livelihood Science and Technology Program (MSZ2022009 and MS22021032), Postgraduate Research & Practice Innovation Program of Jiangsu Province (SJCX21_1449), and Joint Project of Industry-University-Research of Jiangsu Province (BY2022224).