Abstract

This paper focuses on the problem of lung nodule image classification, which plays a key role in lung cancer early diagnosis. In this work, we propose a novel model for lung nodule image feature representation that incorporates both local and global characters. First, lung nodule images are divided into local patches with Superpixel. Then these patches are transformed into fixed-length local feature vectors using unsupervised deep autoencoder (DAE). The visual vocabulary is constructed based on the local features and bag of visual words (BOVW) is used to describe the global feature representation of lung nodule image. Finally, softmax algorithm is employed for lung nodule type classification, which can assemble the whole training process as an end-to-end mode. Comprehensive evaluations are conducted on the widely used public available ELCAP lung image database. Experimental results with regard to different parameter setting, data augmentation, model sparsity, classifier algorithms, and model ensemble validate the effectiveness of our proposed approach.

1. Introduction

Lung cancer is one of the most deadly diseases around the world, with about 20% among all cancers in 2016. The 5-year cure rate is only 18.2% in spite of great progress in recent diagnosis and treatment. It is noted that if the patient can be accurately diagnosed in the early stage and suitable treatment can be implemented, there will be a greater chance for their survival [1]. Therefore, it is of great significance to do research about early diagnosis of lung cancer. Computed Tomography (CT) is currently the most popular method among lung cancer screening technologies [2]. CT can generate high resolution data, which enable small/low-contrast lung nodules effectively detected compared with conventional radiography methods. According to the report of National Lung Screening, low-dose CT scan reduces lung cancer mortality by a rate of 20% [3]. Due to the fact that traditional lung cancer diagnosis only relies on professional experts, two main drawbacks will be caused: subjectivity, different doctors have different diagnostic results for the same CT scan image; huge workload, reading CT images consumes much time and effort. This makes the efficiency inevitably weakened. With the development of computer vision technology, some benefits are brought for medical image process and analysis. Its efficiency and stability provide auxiliary help for doctors with automatically or semiautomatically pattern.

During the last two decades, a number of researchers have been devoted to the development of medical image process and analysis with computer vision and machine learning technologies especially for lung disease diagnosis [4]. Among these studies, lung nodule image classification has attracted much attentions for it is a key step for lung cancer analysis. The lung nodule is characterized by its appearance and relation between surrounding regions. Usually, the lung nodule can be classified into 4 types [5], as shown in Figure 1. To be specific, Figures 1(a)1(d) demonstrate nodule types W, V, J, and P, respectively, whereW is well-circumscribed nodule located centrally in the lung without any connection to other structures;V is vascularized nodule that is also central in the lung but closely attached to the neighbouring vessels;J is juxtapleural nodule that has a large portion connected to the pleura;P is pleural-tail nodule that is near the pleural surface connected by a thin tail.

Lung nodule CT image classification includes two main steps. First, feature extraction and representation use segmentation, filter, and statistical method to describe feature of lung nodule based on shape and texture. Second, classifier design constructs classifier based on supervised or unsupervised machine learning method. However, these methods belong to the fields of traditional image processing and machine learning, which can only characterize the abstraction of lung nodule image in a shallow layer and make the research at low level. As a result, the complex structure of lung nodule makes the classification still a challenging problem. This paper proposes a novel model for lung nodule feature representation and classification. The model considers both local feature and global feature. Lung nodule CT images are first divided into local patches with Superpixel, and each patch is associated with a relatively intact tissue. Then local feature is extracted from each patch with deep autoencoder. Visual vocabulary is constructed with local features. Global representation is constructed by bag of visual word (BOVW) model and classifier is trained using softmax algorithm. The main contributions of our work are as follows: (i) a novel feature representation model for lung nodule image classification is proposed. Local and global features are constructed by unsupervised deep autoencoder and BOVW model, and (ii) comprehensive evaluations are conducted, and performance analyses are reported from multiple aspects.

The structure of this paper is organized as follows. Related works are introduced in Section 2. Section 3 gives the framework. Local feature representation and global feature representation are given in Sections 4 and 5. Section 6 presents the classifier model. Experimental evaluations are shown in Section 7. Section 8 concludes this paper.

Many studies have reported the classification of lung nodule in CT image. Some representative works are introduced in this section. Many researches designed feature based on texture, shape, and intensity of lung nodule image. A feature extraction method based on morphological and shape of lung nodule was designed in [7]. A subclass local constraint based method is proposed in [8]. Spectral clustering and approximate affine matrix were used to construct data subclass and each subclass was used as reference dictionary. The testing image was represented by sparse dictionary. Finally, two metrics based on approximation and distribution degree were merged. In [9], spectrum was sampled around center of lung nodule and feature was constructed by FFT. All features were used to construct the dictionary, and then BOVW mode was used to represent the feature of lung nodule. The Haralick texture feature based on spatial direction distribution was proposed in [10], and SVM was used as classifier finally. Ridge direction information was adopted in [11]. Local random comparison method was used to construct the feature vector, and then random forest was used as classifier. Reference [12] first labeled nodule as solid, part-solid, and nonsolid. Then shape based feature was extracted and kNN was used train the classifier. Reference [13] adopted smoothness and irregularity of lung nodule as feature representation. Texture, shape, statistics, and intensity were extracted as feature representation and ANN was used as classifier in [14]. An eigenvalue of Hessian matrix based feature extraction method is adopted in [15], and AdaBoost was used as classifier. Reference [16] used rotation-invariant second-order Markov-Gibbs random field to model the intensity distribution of lung nodule, and Gibbs energy was used to describe the feature vector. Finally, Bayes classifier was constructed. LDA and 3D lattice were used to construct the mapping between lung nodule image and feature representation in [17]. Reference [18] used topology histogram to represent feature vector of lung nodule, and discriminant and -means were used as classifier. These methods represent the lung nodule image feature in relatively low level, and they lack sophisticated extraction. On the other hand, these methods need heavy participation of professional expert and they have less generality.

Some well-engineered feature extraction and representation methods widely used in computer vision domain were adopted in lung nodule image classification. Reference [22] proposed a method based on texture and context of lung nodule. Lung images are divided into nodule level and context level; then SIFT, LBP, and HOG features were extracted. Reference [19, 23] divided lung nodule as foreground and background with graph model and conditional random field. Then SIFT was used to extract feature and SVM was used as classifier. In [24], SIFT feature was first extracted. Then PCA and LDA were used for dimension reduction. Finally, complex Gabor response was used for representation. In [25], a supervised method was used for initial classification with 128-length SIFT descriptor and weighted Clique was constructed using 4-length probability vector against the 4 nodule types. The overlap that lung nodule belongs to different types was used for optimizing the final classification result. These methods adopt general designed features. They obtain higher performance compared with traditional low-level features, while such methods are considered as mid-level abstraction of lung nodule and with less flexibility.

Several methods were concerned with other aspects. An ensemble based method was applied in [26] for lung nodule classification. Lung nodule image patch was used as input, and six large scale artificial neural networks were trained for classification. Data imbalance problem was discussed in [27]. It used downsampling and SMOTE algorithms to train lung nodule classifier.

Due to its breakthrough in the field of image processing and speech recognition, deep learning has become one of the most hottest topics in machine learning research and application [20, 2830]. High-level abstraction of image object can be described using deep learning model. Meanwhile, feature extraction and representation are more efficient and effective. In [28], curvature, hu-moment, morphology, and shape features were used to detect nodule candidate region. Then convolutional neural network (CNN) was used to extract feature for candidate region and multiple classifiers were merged for final result. Some changes were made in [29, 30]. OverFeat was used for CNN parameter initialization. In [20], a deep feature extraction with one hidden layer autoencoder was adopted, and a binary decision tree was used as classifier for lung cancer detection. This paper proposes a lung nodule image classification method combining both local and global feature representation. Our proposed work is close but has essential difference from the work of [20]. Method in [20] just applied one hidden layer autoencoder to lung nodule image. Our proposed method uses Superpixel to generate intact patches and deep autoencoder to extract local feature. Moreover, method BOVW is incorporated for lung nodule global feature representation and method in [20] has no consideration.

3. Framework

The procedure of proposed lung nodule classification method is shown in Figure 2. It contains training and testing stages. In training stage, lung nodule image samples are used as input and the output is a trained classifier model. Collected training image samples are first divided into local patches with Superpixel. Local patches are assigned with no class label and constitute local patch set. With the local patch set, local features are extracted by unsupervised learning model, deep autoencoder. Next, visual vocabulary is constructed based on clustering all local feature vectors. A lung nodule image can therefore be described by a global feature representation with bag-of-visual-words model. Finally, classifier is trained by supervised learning with nodule type labels. In testing stage, the input is a lung nodule image with unknown type, and the output is its predicted type label. Similar to training stage, a test image is divided into multiple patches. Each local patch is transformed into local feature and assigned with a visual word. Finally, global feature representation of test image is used for classification by the trained model. Details of the proposed method will be introduced in the following sections.

4. Local Feature Representation

Local feature representation is proposed in this section. The process consists of two steps: local patch generation and local feature extraction and representation.

4.1. Local Patch Generation

Decomposing a lung image into small patches is useful and practical and for important tissues can be picked up and unrelated ones can be get rid of. As shown in (1), a lung nodule image can be composed of a group of image patches , where denotes the number of local patches:

The location and scale of local patches are determined through generation [22, 24]. Useless part will be contained for large size patch, while small part may not cover enough intact tissue. Superpixel is a popular method that can partition the image into small similar regions with better representativeness and integrity [31]. So it is adopted in this work.

Figure 3 illustrates the process of the proposed local patch generation method. For a lung nodule image (Figure 3(a)), it is first segmented into local patches using Superpixel and a Superpixel map is obtained (Figure 3(b)). Local patches essentially indicate the uniform regions. Figure 3(c) is an individual patch sample. However, the region that Figure 3(c) gives is an irregular shape, and it is inconvenient for local feature extraction and representation. So we expand local patch with its minimum enclosing rectangle, as shown in Figure 3(d). Finally, a lung nodule image is decomposed into a set of local patches, as shown in Figure 3(e).

Besides, there are some additional criterions to determine whether an image patch is qualified for local feature extraction:(i)Let be a local patch; it is removed when the area of is larger than or smaller than .(ii)Let and be two local patches; if the ratio between their intersection and their union is larger than , then the smaller one is removed., , and are predefined thresholds.

4.2. Local Feature Extraction and Representation

With the rapid development of unsupervised learning in recent years, using unlabeled data to extract feature with autoencoder has become an appropriate way. Autoencoder model is essentially a multilayered neural networks. Its original version is a forward network with one hidden layer. Let be the input data, be the activation of unit in layer , and be the matrix of weights controlling function mapping from layer to layer . If layer has units and layer has units, then will be a matrix with size of . The activation can be formulated as (2), where is the 1st unit in the 2nd layer and are 4 input features:

The main difference between ordinary forward neural network and autoencoder is that an autoencoder’s output is always the same as or similar to its input. The basic formula can be expressed as follows:

An autoencoder can be seen as a combination of encoder and decoder. The encoder includes an input layer and a hidden layer, which converts an input image into feature vector . The decoder includes a hidden layer and an output layer that transform feature to output feature . and are weight matrices of encoder and decoder, respectively. Functions and can be either sigmoid or tanh activation functions, which is used to activate the unit in each layer. When approximates , it is considered that the input feature can be reconstructed from an abstract and compressed output feature vector . The cost function can be generally defined as follows:

A deep autoencoder can be constructed by stacking more hidden layers. As shown in Figure 4, there are 5 layers in the model (including 3 hidden layers). to are encoding layers, and to are decoding layers. is used as the input of the layer , and the weights can be gained based on (3). There are 2 stacked autoencoders. The activation of 1st hidden layer is the input of the 2nd stacked autoencoder. The network can be trained in a fine-tuning stage by minimizing the equation (4). and are trained through the encoding and decoding weights of the 1st stacked autoencoder, and and are trained through the encoding and decoding weights of the 2nd stacked autoencoder. Finally, the whole network can be constructed layer by layer in a stacked way. Moreover, Figure 4 just shows an example of symmetric encoding and decoding structures, and other variational structures can also be adopted.

Therefore, each local patch of a lung nodule image can be represented by a fixed-length feature vector with deep autoencoder model. Then (1) is transformed as follows:

5. Global Feature Representation

For BOVW model, visual vocabulary is first constructed based on clustering all local patch descriptors (local feature representation) generated by a set of training images. Then each lung nodule image can be represented globally by a histogram of visual words. Distance between histograms of visual words can be treated as similarity between lung nodule image samples.

Recall that a lung nodule image is decomposed into a group of local patches and each patch is represented with a feature vector based on deep autoencoder. Assume there are local patches generated from all lung nodule training images and each local patch is represented with -dimensional feature vector; then all local feature vectors can be assembled into a feature space with size of . Clustering is performed with features, and -means clustering method is adopted since it has relatively low time and storage complexity, irrelevant to data process ordering. Each cluster center represents a visual word , and cluster centers constitute the visual vocabulary. A lung nodule image sample can be represented by the encoded local patches as a bag, which is the occurrence frequency of visual word in vocabulary. To get the histogram representation of an image , all local patch feature vectors of are mapped onto the cluster center of the visual vocabulary, and each local feature is assigned with the label of its closest cluster center using Euclidean distance in feature space. Then a -bins histogram is obtained by counting all the label of local patches generated by image , as shown in (6). Figure 5 exhibits the procedure of global representation of lung nodule image.

6. Classifier Model

With global representation of lung nodule image, softmax algorithm is used to train nodule type classifier. Let denote training data set. denotes the lung nodule image sample and denotes nodule type label.

For an input image sample , we want to compute   . The output, a 4-dimensional vector, is estimated to represent the probability of each type that belongs to. The hypothesis function can be expressed as follows:where is model parameter set. This equation normalizes the result and makes the sum to 1. For training procedure, the loss function is given as follows:where is an indicative function, and stochastic gradient descent (SGD) is used for function optimization and the corresponding derivative functions are given as follows:

7. Experimental Evaluations

7.1. Dataset and Program Implementation

In order to evaluate the performance of the proposed lung nodule image representation and classification method, a widely used public available lung nodule image dataset, ELCAP, is used for testing [6]. The dataset contains 379 lung CT images, which are collected from 50 distinct low-dose CT lung scans. The center position of lung nodule is marked in an extra .csv file.

Figure 6 demonstrates the lung nodule CT scan images, which are sampled from different slices. Table 1 shows the format of a .cvs file. Each row denotes a lung nodule. The 4th column indicates the slice number where the lung nodule exists. The 2nd and 3rd columns give the positions that the lung nodule is located in. In this section, lung nodule images are cropped from the raw CT images based on the - and -coordinates of nodule center given in Table 1. The raw lung CT scan image is fixed with pixels, and the cropped nodule images are too small to implement the algorithm. Therefore we further resize the cropped lung nodule image into pixels with bicubic method. The lung nodule images are labeled with one of four types according to the guidance by an expert. Programs are implemented with Matlab 2016a programming language and tested on a Pentium i7 CPU, 8 G RAM, NVIDIA GTX 960 GPU, Windows OS PC.

The experiments include the following aspects: parameter setting; classification rate with different parameters; classification rate with data augmentation; classification rate with model sparsity; classification rate with different classifier algorithms; comparing with other methods; classification rate with model ensemble. The performance of lung nodule image classification is computed with overall classification rate, as shown in the following:where is the number of correctly labeled images and is the number of all testing images. Cross validation mode is adopted. The dataset is divided into 8 groups: 7 randomly chosen groups are used for training and the left group is used for testing. This process is repeated 7 times and the result is computed by averaging 7 independent tests.

7.2. Parameter Setting

The parameters are needed to be set in local patch generation, local feature representation, and global feature representation. For local patch generation, we need to set the number of superpixels that each lung nodule image generates. For local feature representation, the number of hidden layers and nodes that each layer contains should be set. For global feature representation, the size of visual vocabulary should be set.

As shown in Table 2, the number of patches that each lung nodule image generates is set with 15, 20, 25, and 30. The number of hidden layers in deep autoencoder is set with 1, 2, and 3. The number of nodes in deep autoencoder is set with 50, 75, 100, 125, and 150. The size of visual vocabulary is set with 200, 300, 400, and 500. The classification rate is evaluated on the combination of these parameters. For convenience, parameters are expressed with , , , and , respectively.

7.3. Classification Rate with Different Parameters

The size of local patch is set with pixels in our experiment. Table 3 gives the average performance of lung nodule image classification based on combination of parameters , , , and . It can be seen that classification model with , , , and gets the optimal result, reaching 89.5%. We can also see that different parameter settings have great impact on the classification results.

7.4. Classification Rate with Data Augmentation

Overfitting is common in machine learning, and it is influenced by both model complexity and the size of training data. Data augmentation scheme is usually adopted to lessen this problem [32]. In this section, data augmentation is used to enlarge the size of training data. Random rotation, random cropping, and random perturbation (brightness, saturation, hue, and contrast) are used as basic augment techniques.

For original lung nodule image, it is sampled with possibility of 0.5 for data augmentation. The new created examples are set with same labels as original. As shown in Table 4, data augmentation can increase classification rate with 3%. This shows that adding more augmented data for training can improve the compatibility and generalization of the classification model.

7.5. Classification Rate with Model Sparsity

In this subsection, a sparsity constraint is imposed on the hidden layer. Sparsity is a recently proposed technique to improve the generalization of the model [33]. A sparsity regularization term is added to (4), and the new objective functions are given as follows:

The sparsity regularization term is regulated by Kullback-Leibler divergence . is the average activation of th layer of deep autoencoder and is the target activation. with small value can reduce the mean activation of the model. is a trade-off parameter. Table 5 gives the result of classification performance with different (values from 0.1–0.9). It can be seen that set around 0.3–0.4 leads to the superior performance.

7.6. Classification Rate with Different Classifier Algorithms

In this subsection, we evaluate the performances of 4 commonly used classifier algorithms. Softmax (which is used in this paper), SVM, kNN, and decision tree are used. The same feature representation is adopted. Table 6 shows that softmax slightly outperforms SVM, kNN, and decision tree. The results demonstrate that, compared with classifier algorithm, the feature representation is the key problem. Meanwhile, it is easy to combine the softmax algorithm and the proposed feature representation method into an end-to-end structure, which can make model training more convenient.

7.7. Comparing with Other Methods

In order to evaluate the classification rate of different methods, 5 related algorithms are used for testing. Reference [19] studies the same problem as ours. Reference [20] adopts the primitive autoencoder method. References [7, 21] use non-deep-learning methods for classification. Reference [9] employs the BOVW model. The compared methods are reimplemented and are tested with diverse parameters. Table 7 gives the testing result. Among all testing methods, the proposed one demonstrates the best performance. Comparing with non-deep-learning method, our method can construct better feature representation, while, comparing with primitive autoencoder method, the Superpixel and DAE used in our method can catch more detailed information.

7.8. Classification Rate with Model Ensemble

Model ensemble can improve the classification performance by aggregating multiple individual classifiers [34]. We evaluate model ensemble based on Majority Rule in this subsection. In Majority Rule, the class label is assigned with the one that most classifier votes. The function to evaluate the class label for image is given as follows:where is a testing image, is a class label, denotes number of selected models, and means th classifier. , if classifies as . The label with maximal value of is determined as the final result. If multiple labels have the same votes, the arithmetic average of class probabilities predicted by individual model is used as classification result.

With different parameters combination, models with top performances are retained for ensemble. Table 8 gives the testing result. The 1st row denotes the single model. The 2nd to 4th rows denote model ensemble with 5, 6, and 7 individual models, respectively. The result demonstrates that model ensemble can complement individual ones and the performance is improved with about 1.5%.

8. Conclusion and Future Works

In this paper, a novel feature representation method is proposed for lung nodule image classification. Superpixel is first used to divide lung nodule image into local patches. Then local feature is extracted and represented from local patches with deep autoencoder. Bag-of-visual-words model is used as global feature representation with visual vocabulary constructed by local feature representation. Finally, an end-to-end training is implemented with a softmax classifier. The proposed method is evaluated from many aspects, including parameter setting, data augmentation, model sparsity, comparison among different algorithms, and model ensemble. We draw a conclusion that the proposed method achieves superior performance. The merits of our method are the combination of local and global feature representation, and better model generalization can be gained by incorporating unsupervised deep learning model.

Our future works will focus on two aspects: study new classification framework and method according to up-to-date convolutional neural network and analysis of our method in large data set for making further improvement and optimization.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Liaoning Doctoral Research Foundation of China (no. 20170520238), National Natural Science Foundation of China (no. 61772125 and no. 61402097). The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of GPU used for this research.