Prompt and correct diagnosis of benign and malignant thyroid nodules has always been a core issue in the clinical practice of thyroid nodules. Ultrasound imaging is one of the most common visualizing tools used by radiologists to identify the nature of thyroid nodules. However, visual assessment of nodules is difficult and often affected by inter- and intraobserver variabilities. This paper proposes a novel hybrid approach based on machine learning and information fusion to discriminate the nature of thyroid nodules. Statistical features are extracted from the B-mode ultrasound image while deep features are extracted from the shear-wave elastography image. Classifiers including logistic regression, Naive Bayes, and support vector machine are adopted to train classification models with statistical features and deep features, respectively, for comparison. A voting system with certain criteria is used to combine two classification results to obtain a better performance. Experimental and comparison results demonstrate that the proposed method classifies the thyroid nodules correctly and efficiently.

1. Introduction

Thyroid nodules are exceedingly common, with a reported prevalence of more than twenty percent of men and fifty percent of women over 50 years old on high-resolution ultrasound [1]. Some of these nodules are benign and others are malignant. Pathological analysis shows that most nodules are benign. Studies have reported that the prevalence of thyroid cancer is increasing at a rate of 3% per year [2]. Therefore, doctors must distinguish the nature of thyroid nodules in order to make correct clinical decisions. Benign nodules can usually be followed up and followed, and malignant nodules may require immediate surgical treatment to achieve a definitive diagnosis. Currently, fine-needle aspiration (FNA) is the most effective and practical test to determine whether a nodule is malignant or may require surgery [3]. However, most nodules are benign, and even malignant nodules, especially nodules smaller than 1 cm, frequently exhibit indolent or nonaggressive behavior [4]. Therefore, not all detected nodules require FNA or surgery.

Several imaging techniques, including CT, magnetic resonance imaging (MRI), and ultrasound, have been used to discriminate the nature of thyroid nodules in the clinic. Ultrasound is the most commonly used because it is expedient, efficient, inexpensive, noninvasive, and nonradioactive [5]. As shown in Figure 1, ultrasound has two kinds of modalities: B-mode ultrasound (B-US) and shear-wave elastography ultrasound (SWE-US). Many studies have proved that B-US can describe the size, shape, location, and texture of nodules so as to distinguish malignant nodules from benign nodules [6]. Compared with B-US, SWE-US is a novel ultrasound technique that quantitatively represents the stiffness of tissues by assessing the deformability of thyroid nodule, which can assist diagnosis [7, 8]. Nevertheless, perception and understanding of ultrasound images by expert clinicians is usually subjective. They interpret and diagnose ultrasound images based on their experience and knowledge, which may make the diagnosis of the same case inconsistent due to different clinicians. This seriously affects the accuracy of sonography.

Tremendous advances have been made in medical imaging and artificial intelligence technologies, which make computer-assisted diagnostics (CAD) become increasingly widespread. CAD can help us solve subjective diagnostic problems based on objective criteria, which traditionally depends on the experience of radiologists. CAD on ultrasound imaging commonly uses the statistical features (SFs) which are also called radiomics data of medical images, including morphological parameters, intensity statistics, and texture features quantifying heterogeneity [9, 10]. However, the disadvantages of ultrasound images, such as image noise and low contrast, and changes in the shape, size, and traces of thyroid nodules limit these statistical features to work well in classification tasks.

Recently, deep learning models, especially convolution neural networks (CNNs), have received great attention in image classification and target recognition [11]. CNNs can also extract image features, which can be regarded as deep hierarchical representations of the inputs so that implicit features within the image can be well captured. These advanced deep features (DFs) are exactly what we need to complement the primary features because CNNs are designed to capture the intrinsic features [12].

In our study, we propose a hybrid approach combining models trained with traditional features extracted from B-US images and deep features extracted from SWE-US images for thyroid nodule classification task. Firstly, we employ a pretrained CNN model, which is transfer learned from ImageNet, as a feature extractor to draw deep features from SWE-US image dataset. To obtain better performance, we compare the classifiers trained with features extracted from each layers of CNNs to find the most discriminative classifier for the classification task. Then, traditional features are extracted from the corresponding B-US image dataset. Different classifiers are used to train with SFs and DFs for comparison. A voting system including pessimistic, optimistic, and compromise criteria is designed and conducted to combine predictive results from different classifiers together to obtain a better classification performance.

The main contributions of this work are as follows:(1)We propose a novel hybrid framework combining multimodality features for thyroid nodule classification.(2)The classifiers trained with features extracted from each layers of CNNs are compared to find the most discriminative classifier for the nodule classification task.(3)The performance of different decision-making strategies on the classification results is compared and analyzed, and reasonable suggestions are put forward.

The remainder of this paper is organized as follows. Background knowledge is summarized and related literatures are reviewed in Section 2. Section 3 describes the framework of proposed method and introduces the materials and method used in our model. The experimental results and model parameters are presented in Section 4. Discussions are drawn in Section 5. Finally, the paper concludes with some comments in Section 6.

2.1. Ultrasound Application in Thyroid Nodule Diagnosis

Ultrasound is a combination of acoustics, medicine, optics, and electronics. It covers a wide range of applications, including ultrasound diagnosis, ultrasound therapy, and biomedical ultrasound engineering, and is of great value in the prevention, diagnosis, and treatment of diseases. Ultrasound imaging uses an ultrasound beam to scan the human body and receives and processes the reflected signal to obtain an image of the internal organs. Ultrasound commonly used in medical imaging diagnosis includes B-mode ultrasound (B-US, Figure 1, bottom) and shear-wave elastography ultrasound (SWE-US, Figure 1, top).

B-US can describe the size, shape, location, and texture of nodules so as to distinguish malignant nodules from benign nodules. In the literature [10], texture features of the B-US image are extracted, and the classification model is trained using SVM to divide the ultrasound image into two categories, benign and malignant. On the basis of extracting the texture features, Raghavendra et al. extracted higher order spectral (HOS) entropy features as supplements, and the particle swarm algorithm and SVM are combined in the model training process to improve the accuracy of classification [13]. Statistical features such as texture and contour are fundamental features, which are greatly affected by image quality and noise. These defects limit the development of intelligent diagnostic algorithms based on these statistical features.

Buda et al. have trained a multitask deep convolutional neural network and compared it with a consensus of three ACR TI-RADS committee experts and nine other radiologists, and the results show the performance of deep learning algorithm is similar to the diagnosis of experts [14]. Mei et al. extracted deep features of convolutional autoencoders and fundamental features including local binary patterns (LBP) as well as histogram of oriented gradients (HOG) descriptors in association with medical professional thyroid image characterization from B-US and trained the classifiers using these features to improve negative predictive value of thyroid nodule evaluation [15]. Comparison has been done between radiomics-based and deep learning-based approaches, and the results demonstrate that the deep learning-based method achieves a better performance [16, 17]. Deep learning in conjunction with B-US image characterization could improve nodule characterization and reduce benign biopsies. These advanced semantic features extracted with deep models are exactly a complement to fundamental features, since deep models are not intended to carry out classification task, but to learn to capture the intrinsic characteristics of ultrasound images.

SWE-US is a new technology based on the basic properties of biological tissue with elasticity or hardness, with the advantages of measurement results not being affected by the operator and excellent repeatability. Elastography can significantly improve the differential diagnosis of benign and malignant nodules of the thyroid. [18, 19]. Zhang et al. built a two-layer deep learning architecture for automated extraction of features from the shear-wave elastography and evaluated the deep learning architecture in differentiation between benign and malignant breast tumors [20]. Liu et al. extracted features from multimodality images including B-US and SWE-US and trained a SVM model to discriminate the thyroid tumor with LN metastasis [21]. Image features from SWE-US, including fundamental features and advanced features, can also be extracted to train the classification model for intelligent diagnosis. Comprehensive use of B-US and SWE-US multimodality images, as is used in this paper, can effectively improve the accuracy of model results.

2.2. CNNs and Transfer Learning

CNNs are the most commonly used deep learning model, from which the high-level characteristics can be extracted. CNNs can be used for tasks such as object detection, classification, and also feature extraction. It is a kind of feed-forward neural network, which is a multilayered perceptron inspired by biological thinking. CNN has different layers, and the working methods and functions of each layer are also different [12]. It has been proved to be a method with numerous uses, from segmentation of symptom to tumor diagnosis in medicine [22]. But their use may be unfeasible in many situations since they require very large training sets (from thousands up to several million images). To overcome this difficulty, the common approach in the literature consists of applying transfer learning.

Transfer learning means the ability of a system to recognize and apply knowledge learned in previous tasks to a novel task [23]. The definition of transfer learning is given in terms of domain and task. The domain consists of a feature space and a marginal probability distribution , where . Given a specific domain, , a task consists of two components: a label space and an objective predictive function (denoted by , which is learned from the training data consisting of pairs, which consist of pairs , where and . The function can be used to predict the corresponding label, , of a new instance . It is natural to use the transfer learning method to apply the knowledge gained while solving the problem of natural image recognition to solve a different problem of medical images classification.

Consider the following two facts: firstly, the scope of the ultrasound image dataset (hundreds or thousands) is much smaller than the natural image dataset (more than millions); secondly, two datasets consist of images from completely different regions. That is, the data distribution of these two datasets is inconsistent. There are two typical uses of transfer learning in the field of medical image classification: one is to remove the last fully connected layer on the top of the pretrained deep model and treat the rest of the network as a fixed feature extractor for the current dataset; the other one is that we adjust the transfer learning method by fixing most earlier layers to reserve generic information and only retraining from scratch the last fully connected layer of the pretrained deep model to capture domain-specific features.

3. Material and Methods

3.1. Overview

In this paper, we propose to evaluate the hybrid approach with multimodalities of ultrasound imaging in discriminating the nature of thyroid nodules. The algorithm deals with the two modalities separately. For deep features, we compare the classifiers trained with features extracted from each layers of CNNs to find the most discriminative one for the task. For statistical features, the process generally contains 3 steps: image preprocessing, feature extraction, and feature selection. Then, different classifiers are adopted to train classification models with statistical features and deep features for comparison. In the end, two classification models hybridize together with a voting system, employing three kinds of decision criteria. The hybrid model is observed to obtain a better performance. Overall framework of this research is shown in Figure 2. Details of each process are as follows.

3.2. B-US Processing and Statistical Features

B-US image is gray scale image (Figure 1, bottom) which can display the position and shape of the thyroid nodule. The clinically acquired ultrasound thyroid images have low quality, which are mainly reflected in the problems of severe speckle noise, blurred nodule edges, discontinuous boundaries, and low contrast. This paper uses a nonlinear filter for noise reduction of the original ultrasound image. A nonlinear filter combines the spatial proximity and pixel value similarity of the image while comprehensively considering the spatial information and gray similarity. It has the following advantages: first, it can keep the output signal sequence unchanged; then, it can reduce the interference of random noise and impulse noise; in addition, it can well retain the edge information. Therefore, a nonlinear filter is very suitable for ultrasonic image preprocessing, which has been proven in previous research [10]. It can not only eliminate image noise and blur caused by uneven ultrasonic echo but also retain the edge information of nodules.

After ultrasound images are denoised by the median filter, the regions of interest (ROIs) are manually segmented along the nodule contour on each transverse section using an open-source imaging platform named ITK-SNAP. In order to eliminate the difference, the segmentation is carried out by the radiologist with more than 5 years of experience in continuous time. Figures 3(a) and 3(b) show denoised nodule images, whereas Figures 3(c) and 3(d) show the ROIs segmented on sonograms.

A python radiomics package named “Pyradiomics” is used to automatically extract the statistical features from the nodule region, which is outlined by radiologists. A total of 104 dimensional statistics features including first-order statistics features, shape features, gray level co-occurrence matrix- (GLCM-) based features, gray level run length matrix- (GLRLM-) based features, gray level size zone matrix- (GLSZM-) based features, neighboring gray tone difference matrix- (NGTDM-) based features, and gray level dependence matrix- (GLDM-) based features are obtained. The dimensions of all kinds of features are shown in Table 1. Refer to the supplementary material (available here) for a detailed description of segmentation and extraction tools and features.

The purpose of feature selection is to select a subset of the smallest features based on the original value of the dataset instead of removing the irrelevant and redundant attributes, which may increase the complexity of classification and even cause the performance of the classifier to decrease [24]. Therefore, feature with good discrimination can be selected as the results of feature selection. We conduct two methods for feature selection of SFs extracted from ultrasound images and compare their results: one is the principal component analysis (PCA) and the other is the t-test method.

When using PCA, the feature vector is gradually increased at an interval of 10 as an input to determine the optimal number of retained components so that colleagues who retain the data structure information to the greatest extent can reduce the dimension, as elaborated in Algorithm 1.

(i)Input: dimensional dataset
(ii)Output: dataset after dimensionality reduction
(iii) Centralize all samples
(iv) Calculate sample covariance matrix
(v) Eigenvalue decomposition of the matrix
(vi) Get the eigenvectors corresponding to the largest eigenvalues
(vii) Normalize all eigenvectors to form an eigenvector matrix
(viii) Convert samples to new samples
(ix) Get the output sample set

When using the t-test, set the threshold of the value and select features with values smaller than the threshold and input them to the classifier. The feature selection method based on t-test is as follows:where and represent the malignant and benign samples of the same feature, respectively. and are the mean of the samples, and is standard error of the difference.

3.3. SWE-US Processing and Deep Features

SWE-US image is considered to be a composite color image (Figure 1, top) superimposed on the corresponding ultrasound grayscale image (Figure 1, bottom). By subtracting the ultrasound image from the synthetic color image, a pure SWE-US image can be obtained. A SWE-US dataset on full-field digital ultrasound image is composed of labeled elastic ultrasound images. A region of interest (ROI) about each nodule has been extracted within each image. In order to extract the ROI from the elastic ultrasound image, the radiologist labeled the center of each nodule, and then a 448  448 box was automatically cropped around the nodule center with a pixel size of 0.1 mm. The ROIs were marked as either benign or malignant according to the pathological analysis reports. As a result, ROIs were converted from the image matrix to a pixel vector, which could be directly used as the inputs of the CNNs.

As mentioned above, the application of transfer learning method for medical image classification is in our research is feature extraction, which removes the last fully connected layer of the pretrained deep model because the output of this layer is for the class score of multiclass classification tasks like ImageNet. The rest of the network acts as a feature extractor for the given dataset. A variety of pretrained models such as resNet-50, Inception-V3, and VGGNet-16 have been used for transfer learning [25, 26]. In this work, we employ the VGGNet-16 model trained from ImageNet to extract features, which are used to train a two-class classifier (e.g., SVM). VGGNet-16 is a model proposed by Oxford University in 2014, and the network structure is shown in Figure 4 [27]. It consists of 5 max pooling layers, 3 fully connected layers, and 1 softmax function. After removing the network part after the last fully connected layer, the output of each remaining layer can be regarded as the input for training model.

However, for the features output from the second fully connected layer, it may be the result of various functional combinations. Therefore, in order to obtain the best performance of the nodule classification task, we compare the features extracted from Pool1–Pool5 and FC6 and FC7 layers of the VGG network. Features from each layer are used as sets of input for a classifier after zero-variance removal. It is worth noting that the ROI of SWE-US has three channels to accommodate the original architecture of VGGNet-16, which is designed for colorful images. In addition, the sampling rate of ROIs is reduced by half to 224  224 to accommodate the architecture of the network.

The results are shown in Figure 5 to explain how an image responds to a certain convolution layer. It can be seen from the figure that the results derived from the lower layers can better extract shallow features, including edge, direction, and intensity features. However, in the images output from the last few layers, various features will appear mixed together.

3.4. Classifiers

The diagnosis of benign and malignant thyroid nodules in this paper is a typical two-class problem. Many classifiers including logistic regression (LR), k-nearest neighbor (KNN), random forest (RF), support vector machine (SVM), and so on have been used to discriminate the nature of nodules based on features extracted from ultrasound or CT images. We conduct LR, Naive Bayes, and SVM algorithms for comparison in the classification process because they could output the probability value, which were used as the input of voting system for decision fusion. Suppose is the input vector and is the label. The general description and mathematical formula of the classification algorithms are as follows.

Logistic regression is a machine learning method used to solve binary classification problems and is used to estimate the probability. The hypothetical function of logistic regression is as follows:where is weight vector which can be chosen through data training.

The Naive Bayes method is a set of supervised learning algorithms based on Bayes’ theorem, and it is assumed that each pair of features is independent of each other. The principle is as follows:

SVM is a convex quadratic programming problem, which can be expressed by the following mathematical formula:where is the weight vector, is the bias vector, is the slack variable, and is the penalty parameter. Besides, is the feature and is the label of .

3.5. Voting System

The voting system receives the probability of benign and malignant computed by two classifiers. The combination of the two outputs is responsible for increasing the final accuracy for each modality. The voting system proposed in our research is an adaptation of the uncertainty decision theory [28], where we use a series of combination rules called pessimistic criteria, optimistic criteria, and compromise criteria. It can be represented by the following mathematical expressions.

For a thyroid nodule, represent the probability of benign and malignant computed by the transform learning method, and the probability of the model trained by statistical features is . Define the label of benign and malignant as . Note that , , and . If , ; otherwise . Similarly, , , and . If , ; otherwise, .(1)Define pessimistic criteria , . Note that , , and . If , ; otherwise, . That is, if the prediction result of at least one classifier is malignant, then the output of the voting system is malignant. Only if the prediction results of both classifiers are benign, the output of the voting system is benign.(2)Define optimistic criteria , . Note that , , and . If , ; otherwise, . According to the optimistic decision criterion, only two classifiers consider it to be malignant; then, the voting result is malignant; otherwise, it is considered benign.(3)Define compromise criteria , . Note that , , and . If , ; otherwise, . The weighted average of the two classifier predictions is considered, where we consider the weights of the two classifiers to be equal. After the weighted average result, if the benign probability is greater than the malignant probability, the output of voting system is benign; otherwise, it is malignant.

4. Experiments and Results

4.1. Experimental Setup
4.1.1. Preparation of Dataset

Herein, the experimental data are obtained from the Department of Ultrasound, First Affiliated Hospital of Nanjing Medical University. The study population is composed of 245 patients. Both B-US and SWE-US examinations are performed by experienced radiologists. Images are acquired and stored in DICOM standards. The type of thyroid nodules is the gold standard for pathological analysis including excisional biopsy, core needle biopsy, or FNA biopsy. When a patient undergoes multiple biopsies, the gold standard for final diagnosis will be determined according to the following priorities: excisional biopsy, core needle biopsy, and FNA biopsy. There are 490 images in total (B-US and SWE-US each account for half), consisting of 145 images of benign nodules and 100 images of malignant nodules. This retrospective study was approved by the institutional review board, and the informed consent was obtained from all patients.

4.1.2. Cross Validation

In order to improve the generalization ability of the model on the dataset, five-fold cross validation is executed during the training and testing process of the model. The original data are evenly divided into 5 groups, and each subset of the data is used as a validation set, and the remaining 4 sets of subset data are used as the training set. Repeating 4 times will get 4 models. The average of the classification accuracy of the four models is used as the performance index of this classifier. As listed in Table 2, each set of data includes the number of cases, the number of malignant nodules, and the nodule radius.

4.1.3. Evaluation Criteria

Herein, quantitative evaluation indexes such as accuracy, sensitivity, and specificity, which are usually used in medical diagnosis, are adopted to evaluate the classification quality. Accuracy is computed by equation (5), and it is the metric how exactly the given thyroid nodules are classified into the right type. Sensitivity is calculated by equation (6), and this metric is used to retrieve the exact malignant nodules from all the gathered malignant nodules. Specificity is calculated by equation (7). This metric is used to retrieve the exact benign nodules from all the gathered benign nodules.where means true positive while means true negative. and represent the positive and negative sample numbers of correct classification. Conversely, is false positive while is false negative. They represent the negative and positive sample numbers of false classification. In this research, positive indicates that the type of nodule is malignant and vice versa. Sensitivity defines the possibility of predicting the malignant nodules while specificity defines the possibility of predicting the benign nodules.

4.2. Comparison Results
4.2.1. Comparison Results of CNNs

Features extracted from different layers of CNNs are compared to find the most discriminable one for the nodule classification task in our research. The results of feature extraction and processing are shown in Figure 6. The left column is outputs of images after convolution and pooling. The center column indicates the dimensionality obtained after flattening the output of each layer. The right column represents the length of the feature vector per ROI used as an input for the classifier after zero-variance removal.

Considering that the features are high-dimensional vectors and training samples are limited, we choose a linear kernel SVM as the classifier. Totally 7 SVM classification models are trained. Table 3 shows the performance of 7 classifiers trained on CNN features, which are extracted from the experimental dataset through all pool layers and fully connected layers of the VGGNet-16 network. represents deep features extracted from the pooling layer. Similarly, represents deep features extracted from the fully connected layer. Optimize parameters in SVM by using grid search. Derive a malignancy probability from the SVM and choose a threshold of 0.5 to assign the samples to malignant or benign.

Based on the results in Table 3, it is obvious that the performance increases in the beginning of 5 pooling layers until it reaches the best at Pool5 layer; after that, performance begins to decrease slightly. Sharp drops in performance turn out after the FC6 layer. Besides, the length of features has a greater influence on the training time. The training of high-dimensional features obviously takes more time, and the effect of feature length on the testing time is less. One possible reason is that a linear kernel is selected for SVM, and the time consumption of the optimization solution in the training process is related to the number of samples and feature dimensions, which have a limited impact on the time consumption of linear operations in the testing process.

As a result, considering the balance between high predictive performance and relatively low dimensionality, we choose the FC6 layer as the optimal layer to extract deep features. The dimensionality of the feature vector of the pooling layer is one or two orders of magnitude higher than that of the fully connected layer, which greatly increases the computational cost.

4.2.2. Comparison Results of Various Techniques

Finally, multiple sets of comparisons with various techniques are made in our work. First, different types of features are compared, and the SFs from B-US and the DFs from SWE-US are input to the same classifiers for comparison. Second, different feature selection methods are compared. The results of direct training without feature selection are compared with the results of training after feature selection using the PCA and t-test methods, respectively. Third, different classifiers are compared. SVM is compared with the other two classic classifiers, Naive Bayes and logistic regression. Optimize parameters in SVM by using grid search. Derive a malignancy probability from the classifiers and choose a threshold of 0.5 to assign the samples to malignant or benign. For the Naive Bayes method, it is assumed that each pair of features is independent of each other. In the logistic regression algorithm, we use the sigmoid function. The detailed experimental results are shown in Table 4.

Experimental results show that the best performance of models trained with DFs is better than that of models trained with statistical features. According to Table 4, the best classification result of SFs is from SF-PCA-SVM, whose accuracy is 0.804, while the best classification result of DFs is from FC6-SVM, whose accuracy is 0.812. In addition, the classification accuracy of LR and Naive Bayes is similar, while that of SVM is the highest. Especially for DFs, the performance of SVM is significantly better than LR and Naive Bayes indicating that SVM is better at high-dimensional data classification. Of course, SVM takes the longest training and testing time. Furthermore, experiments show that the results after feature selection are better than those without feature selection. In terms of feature selection, PCA performs better than t-test.

It should be noted that when using PCA or t-test for feature reduction of SFs, the classification results listed in Table 4 were obtained with the optimal numbers of reserved features. We find that 30 principal components are best for LR, 20 principal components are best for Naive Bayes, and 50 principal components are best for SVM. The criterion of is better than for feature selection using t-test.

4.3. Voting Results

As previously mentioned, classification models of different modalities are trained, respectively, to compare the performance. According to the experimental results, we choose the predictive results from FC6-SVM and SF-PCA-SVM as the inputs of the voting system which is explained in Section 3. The voting system has three decision criteria: pessimistic, optimistic, and compromise, corresponding to three decision results. In this paper, the voting system is used to predict the predictive nature of thyroid nodules (probability of benign or malignant). The performances before and after the voting system are shown in Table 5. The receiver operating characteristic curve (ROC) and area under curve (AUC) are shown in Figure 7.

It is easy to know that the performances of hybrid approach based on multimodalities are better than that of single modality such as FC6-SVM and SF-PCA-SVM. The performances after voting system have been greatly improved, achieving an accuracy of 0.865 (compromise criteria), sensitivity of 0.89 (pessimistic criteria), specificity of 0.931 (optimistic criteria), and AUC of 0.921 (compromise criteria). The accuracy rates of pessimistic criteria and optimistic criteria are relatively close, slightly better than the results of FC6-SVM and SF-PCA-SVM. However, the accuracy rate of compromise criteria has increased by more than 5 percentage points and the AUC is nearly improved by 6 percentage points.

5. Discussion

The difference in visual imaging can be quantified by algorithms such as statistical machine learning and deep learning to train a CAD system, which can automatically differentiate the nature of thyroid nodules. A lot of research studies use ultrasound images as a dataset to train the CAD system for predicting the benign and malignant nodules [1419, 22, 25, 26]. These research studies confirmed the usefulness of discriminating the nature of thyroid nodules by learning from B-US and SWE-US images. Knowing this, we have shown the detailed comparison results of different CAD models based on features of different ultrasound modalities.

From the experiment results, we found that it is obvious that the scores of specificity are higher than sensitivity for each model. There may be two reasons for this result. One is that in our dataset, the number of benign samples is more than the number of malignant samples, which is nearly 1.5 times; on the other hand, as shown in the ROIs of Figure 3, the imaging performance of malignant nodules is more diverse while the benign nodules have more typical characteristics and are easier to identify.

The deep features generated by CNNs can better represent the inherent features of the image, regardless of the field and type of the image. Therefore, it is feasible to transfer the pretrained VGGNet-16 model to the ultrasound domain, which has been proven in experiments. In addition, models trained with features extracted from different CNN layers have different performance. According to experiments, we find that the higher the number of layers, the better the performance before fully connected layer, and the effect of the fully connected layer is weaker than the pooling layer. One possible explanation is that operations such as convolutional layer and pooling layer are to map the original data to the hidden layer feature space. Higher layers can capture different kinds of common features, while lower layers only calculate low-level features and cannot represent high-level semantic functions. The main role of the fully connected layer is to map the hidden layer features to the sample label space, but the target domain and the source domain are quite different, which results in the performance of fully connected layer being relatively weaker.

To the best of our knowledge, this is the first attempt to train models with statistical features from B-US and deep features from SWE-US, respectively, and fuse them together by a voting system, which can improve the accuracy of thyroid nodule diagnosis. Sun et al. [22] extracted the deep and statistical features of B-US and fused them to train a SVM model but did not extract the features from SWE-US. Liu et al. [21] extracted statistical features from B-US and SWE-US, respectively, and merged them for feature selection and modeling. However, they did not consider the deep features to better represent the image characteristics. In fact, B-US and SWE-US have different characteristics. They can help diagnose thyroid nodules from different perspectives, and they are complementary. B-US clearly characterizes the nodule shape, contour, texture, and other features, which can be well consistent with the diagnosis criteria of ACR TI-RADS [29], so we extract statistical features from B-US. SWE-US is a colorful image, which indicates the elasticity or hardness change between adjacent tissues, and it has high specificity and sensitivity for the differential diagnosis of malignant lesions. So, we use pretrained CNNs to extract deep features from SWE-US. Finally, we train two different classifier models using these two kinds of features separately and combine the models together with a voting system to predict the nature of thyroid nodule. The AUC of our model is improved about 6% from 0.857 and 0.842 to 0.921. Experimental results indicate that the proposed hybrid approach based on multimodalities has better performance than the model trained with single modality (B-US or SWE-US) separately.

It needs to be emphasized that sensitivity is more important in clinical practice. Sensitivity indicates the correct proportion of malignant while specificity indicates the correct proportion of benign. The higher the sensitivity, the lower the rate of missed diagnosis. In our research, a voting system is employed to combine different classifier results to improve the accuracy of prediction. Although the accuracy of compromise criteria is better than that of pessimistic criteria, the sensitivity of the pessimistic criteria is better, reaching 89%, which is 7% higher the compromise criteria. We strongly recommend using pessimistic criteria in clinical practice because high sensitivity can reduce misdiagnosis.

6. Conclusion

A large percentage (about 70%) of FNA results of thyroid nodules turn out to be benign [30]. So, it is significant to predict the nature of thyroid nodules before FNA. In this paper, we have proposed a novel hybrid approach by using preoperative multimodality images. Multimodality ultrasound images can mine different kinds of information about the nodule. Our research assumes that a fusion model based on the two modalities of B-US and SWE-US may provide better performance than the model trained on each single ultrasound imaging mode. The results have confirmed this hypothesis.

Further research could be performed in this area to overcome limitations of this study. Deep models in medical intelligent diagnosis require large datasets, especially multicenter large datasets, to mine the hidden information of predictions to avoid overfitting the model. Additionally, a new hybrid approach including feature hybridization and classification result hybridization needs to be proposed and applied to improve the accuracy of the model.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Hongjun Sun and Feihong Yu contributed equally to this study.


This study was financially supported by the Research and Innovation Program for Graduate Education of Jiangsu Province (KYZZ15_0110) and the National Natural Science Foundation of China (NSFC) (71971115 and 71471087).

Supplementary Materials

ROI segmentation: it mainly introduces the software and the methods to segment the ROI of ultrasound image. Statistical feature extraction: it contains python code to perform the statistical feature extraction described in the article. Statistical feature extraction methodology: it mainly introduces the description and calculation method of statistical characteristics in detail. (Supplementary Materials)