Abstract

A quarter of all cancer deaths are due to lung cancer. Studies show that early diagnosis and treatment of this disease are the most effective way to increase patient life expectancy. In this paper, automatic and optimized computer-aided detection is proposed for lung cancer. The method first applies a preprocessing step for normalizing and denoising the input images. Afterward, Kapur entropy maximization is performed along with mathematical morphology to lung area segmentation. Afterward, 19 GLCM features are extracted from the segmented images for the final evaluations. The higher priority images are then selected for decreasing the system complexity. The feature selection is based on a new optimization design, called Improved Thermal Exchange Optimization (ITEO), which is designed to improve the accuracy and convergence abilities. The images are finally classified into healthy or cancerous cases based on an optimized artificial neural network by ITEO. Simulation is compared with some well-known approaches and the results showed the superiority of the suggested method. The results showed that the proposed method with 92.27% accuracy provides the highest value among the compared methods.

1. Introduction

The proliferation of lung diseases in today’s industrialized societies doubles the need for modern methods of accurate and early diagnosis. Among lung diseases, lung cancer is still recognized as one of the most dangerous cancers. Cancer means the abnormal growth, and sometimes proliferation of cells in the body. All cancers have an uncontrolled growth pattern and a tendency to detach from the source and metastasize [1]. A normal lung cell may become a lung cancer cell for no apparent reason, but in most cases, the transformation is the result of repeated exposure to carcinogens such as alcohol and tobacco. The appearance and function of cancer cells are different from normal cells. A mutation or change in the DNA or genetic material of a cell occurs [2]. DNA is responsible for controlling the appearance and function of cells. When a cell’s DNA changes, that cell differentiates from the healthy cells next to it and no longer does the body’s normal cells. This altered cell separates from its neighboring cells and does not know when it should stop growing and die [3]. In other words, the altered cell does not follow the internal commands and signals that other cells are in control of and acts arbitrarily instead of coordinating with other cells [4]. One-third of all cancer deaths are due to lung cancer. About 80% of patients have five years left in their best condition after being diagnosed with this type of cancer [5]. Based on a survey by the American Cancer Society (ACS), lung cancer in both men and women is the second most prevalent cancer in the United States [6]. Approximately 228,820 new cases and approximately 135,720 lung cancer deaths, based on ACS figures, will occur [5]. Figure 1 shows the statistical information for the lead cancers and their death number in 2019 based on the American Cancer Society of lung cancer screening [7].

Air pollution due to the industrialization of cities, tobacco use, and genetic factors are the main causes of these diseases [8]. Early diagnosis of lung disease will have a major impact on the possibility of definitive treatment of the disease. Major diagnostic methods for lung cancer include radiographic imaging and computed tomography, biopsy, bronchoscopy, and examination of cells in the sputum. Meanwhile, the CT scan imaging method is widely used as a superior diagnostic method. In this diagnostic method, the doctor examines possible nodules on the images. A pulmonary nodule is a small, round, opaque mass that forms inside the lung tissue [9, 10]. In other words, nodules are spherical radiographic opacities of less than three centimeters in diameter.

Formerly, lung diseases were diagnosed based on the help of experts’ eye ability with no use for computer science. However, recently based on the different imaging techniques based on computer science and artificial intelligence, the diagnosis can be more precise. In most of these methods, after capturing the images from the patient, different image processing methods have been performed for tumor diagnosis [11].

2. Image Preprocessing

The analyzed images for validation in this study are collected from the Lung CT-Diagnosis database which has been provided by the Cancer Imaging Archive [12]. This database contains a collection of publicly available medical images for different cancers [13] with contrast-enhanced CT scan images stored in the Dicom format. This database is collected from 61 patients such that 4682 different images have been acquired from them.

After image acquisition, the min-max normalization method is established for them to scale the acquired images between 0 and 1. This study uses 250 × 250 scale normalization for this purpose. By considering a grayscale image with dimension which has the following limitation: , the normalized image, , can be achieved by the following formulation:where , and describe the intensity values in the grayscale image, and and describe the intensity values for the normalized image [14].

Also, since the reason that all CT scan images have a kind of visual noise, they need a denoising filter to resolve the problem. A CT scan image faces different factors and includes different types, such as Gaussian noise, Shot Noise, Poisson Noise, Speckle Noise, and Salt and Pepper Noise [15]. Noise hides slight details of the CT scan image. This shows that we need a tool for noise removal before starting the CT scan images.

One of the popular noise reduction techniques for CT scan images is mean filtering. The definition of the mean (average) filter works on averaging any aspect of the picture to the neighbors [16]. The average filter calculates and divides the sum of all the pixels in the filter window by the total number of pixels [4]. It then replaces the value of the center pixel with the calculated average. The result value for each indexed pixel is determined as follows:

3. Image Segmentation

One of the most important issues in image processing is the identification and separation of the image into its main components. Image segmentation determines the success or eventual failure of image analysis methods. And yet, due to its wide application, it has suitable research fields. The accuracy of this study in fields such as medicine is very important to preserve and protect human life. Thresholding is one of the most convenient methods for image segmentation [17]. By applying a thresholding method to a grayscale image, a binary image is obtained that delineates the boundaries of the objects in the image with appropriate accuracy.

The lower the threshold is, the more errors are detected and the more sensitive the results are to noise and unrelated image features [18]. On the other hand, a high threshold may miss weak errors or parts of errors. Since the main purpose of the fabric fault detection system is to find all possible faults, the threshold must be chosen to achieve this goal [19]. There are several methods for this practice in image processing science, and some separation methods are used for specific images [20]. One way to select the appropriate threshold is to use the trial-and-error method, in which different values of the threshold are selected and the image resulting from the application of this threshold is judged by the viewer. The simplest type of separation is called general separation, which is based on the image histogram. The input of this function is a gray image or a color image [21]. Its output is also a black and white image (binary). In this method, the threshold value at any point in the image is defined based on the local properties of the image in the neighborhood of that pixel. In this paper, Kapur thresholding has been used. Assume that the gray levels in an image with pixels and gray levels are in the range . Kapur method obtains the thresholds based on Kapur entropy maximization and based on the information obtained from the histogram of gray surfaces, which is defined as follows in the case of two-level segmentation (thresholding):such that

The value of the optimal threshold point is the amount of gray area that maximizes the function. For correct diagnosis of the tumors, the cancerous region should be detected with high precision. As mentioned before, image thresholding has been used for identifying the cancerous region. However, mathematical morphology is needed to better filter the detected areas based on image thresholding [22]. Mathematical morphology algorithm is a new technique for processing and analyzing signals and images. The basic idea of this technique is based on the analysis of geometric information by exploring an image with a small geometric pattern called a structuring element. This study uses three popular techniques including opening, closing, and filling holes. The first operator in the region filling is based on complementation, intersections, and dilation operators and is achieved by the following equation:where indicates a group of boundaries and defines the organizing element. This operator will end if .

The mathematical opening is the second utilized operator. The opening of element with element has been obtained with erosion by structure element , followed by a dilation of the resulted image by structure element , that is,

The key goal of the mathematical opening is to remove the minor blemishes in the area that can be missed during the diagnosis of lung cancer.

Finally, the mathematical closing operator is used to smooth the counters, fuses narrow breaks, eliminates small holes, and fills gaps in the contour and long thin gulfs. This operation is formulated as follows:

4. Features Extraction

The purpose of feature extraction is to make raw data more usable for future statistical processing. Feature extraction is a very common process in different types of data processing such as image processing and audio processing. Feature extraction means selecting a feature that can describe the image with little information. These features must have properties so that a set of these features is described uniquely in each image. If a set of these attributes are the same for two samples, then you will not be able to distinguish two samples with any classifier in the classification section.

The main reasons for extracting features from images are image simplification, reduced processing time and memory, and increased accuracy and efficiency. So, feature extraction is a process in which data is mapped in a high-dimension space to a lower dimension space. This mapping can be linear (such as principal component analysis) or nonlinear. How to select these features requires data properties to be examined, and to extract it, preprocessing operations and various filters must be applied to the image to turn the image into the desired information. In this study, GLCM features have been used for extracting the lung cancer images information.

The GLCM method is one of the most efficient techniques for extracting tissue from medical images. This matrix is a square matrix with dimensions where is the number of degrees of gray in the image. Each element of this matrix represents the number of pairs of pixels that have degrees of gray on the surface of the image and are spaced in a certain direction from each other and to a certain pixel distance. After calculating the matrix, different parameters of the image texture can be extracted from it. In this study, the mentioned technique was used to extract tissue in lung tumor images. In the following, the utilized features have been explained.

4.1. Contrast

The contrast regulates the intensity value of the pixel and its neighbor in the image. This feature is achieved by the following equation:

4.2. Correlation

The correlation feature describes the dependency on spatial features among the pixels. This feature can be mathematically given as follows:

4.3. Homogeneity

Homogeneity is a local uniformity feature that makes single/multiple intervals govern the textured and nontextured characteristics. This feature is achieved by the following:

4.4. Energy

The energy feature regulates the number of repetitive pixel pairs. This feature is mathematically obtained by the following equation:

4.5. Entropy

The entropy is a feature that indicates the image selected interference based on the following equation:

5. Features Selection

Some of the different extraction characteristics in the images are so important and crucial for classification. In the meantime, those features that contain information which is not notable can have high potential when combined with other features. Any of these attributes could also have no useful data at all. This shortcoming can be resolved by different works. In this paper, an optimization-based methodology has been proposed for this purpose. The optimization method for the features selection can be achieved by the following equation:where signifies the true positive, describes the true negative, represents the false negative, and defines the false positive.

6. Improved Thermal Exchange Optimization Algorithm

In Thermal Exchange Optimization (TEO), the temperature of the objects indicates the individual position and with objects grouping, it is started to be exchanged. Therefore, new temperatures indicate their updated positions [23].

6.1. Newton’s Law of Cooling

In the seventeenth century, the English scientist Isaac Newton studied the cooling of objects. The experiments showed that the cooling rate was approximately proportional to the temperature difference between the heated object and the environment. This fact is written as a differential relation:where describes the heat, signifies the area of the body surface that transmits heat, defines the body temperature, represents the ambient temperature, and determines the heat transfer coefficient which is dependent on the geometry of the object, surface state, heat transfer mode, and other factors.

The heat loss in time is which defines the alteration in stored heat as the temperature falls , i.e.,where defines the volume (), describes the density (), and signifies the specific heat ().

Therefore,where is the starting high temperature.

The above equation is valid when is not a function of , i.e.,where is constant. Therefore,

And finally, the equation can be rewritten as follows:

6.2. Inspiration

During the TEO algorithm, some individuals have been considered as the cooling objects and the residual individuals have been considered as the environment; then it has been in reverse. The method of simulation of the TEO algorithm is given in the following.

The first step is initialization. The initial temperature for all of the objects has been defined in an m-dimensional solution space as follows:where signifies the initial solution vector of the object number , describes a random vector with components in the range , and describe the minimum and the maximum limitations for the decision variables.

After initializing the objects, the value of the objective function for all of the individuals is evaluated. During the process, some historically best vectors have been stored in a memory called Thermal Memory (TM) to use their position to develop the algorithm efficiency with no extra computational cost. After selecting some best values in TM, they have been added to the population and for the same numbers of them, worst individuals have been eliminated. Individuals have been divided into two equal groups. Figure 1 shows this division. For example, defines an environment object for cooling object and contrariwise.

During the process, if an object has lower , the temperature exchanging has been established slowly. Therefore, the value of for the objects has been established based on the following equation:

Time is another parameter that is important in this algorithm. This parameter is related to the iteration number. This parameter can be formulated by the following equation:

Generally, an important ability of the metaheuristics is to escape from the local optimum. During this process, the environmental temperature has been altered by the following equation:where and represent the control variables. describes the object earlier temperature that has been modified to .

Based on the previous models, the new temperatures of the objects have been updated by the following equation:

Another parameter in the algorithm is with (0, 1) that stated whether a component in the cooling objects should be changed or not.

All of the agents have been compared with which is a randomly distributed value in the range [0, 1]. If , one dimension of the agent number has been randomly chosen and the value is redeveloped by the following:where signifies the variable of the agent and and represent the lower and the upper bounds of the variable, respectively. To keep the structures of the agents unchanged, just one dimension has been altered.

Finally, the stopping criteria are checked to terminate the algorithm in the considered criteria.

6.3. Improved Thermal Exchange Optimization (ITEO)

The basic Thermal Exchange Optimization algorithm suffers some disadvantages like stability and premature convergence problems. This case leads us to design a modified version of the TEO algorithm to refine these drawbacks as possible.

The first modification is to use Lévy flight (LF) as a proper mechanism. This mechanism has been commonly employed in metaheuristic algorithms to solve premature convergence shortcomings [24]. During this mechanism, a random walk policy has been utilized for proper adjusting of the local search that is mathematically represented as follows:where , , signifies the Lévy index which is located in the range [0, 2] (here, [25]), represents Gamma function, and defines the step size.

By assuming the above equations, the updating formulation for the TEO algorithm is as follows:

The second modification is to use the chaos mechanism for improving the system convergence speed. Here, we used Singer function for chaos modification [26, 27]. By considering this mechanism, can be updated as follows:where .

6.4. Algorithm Authentication

The simulations are applied to a Core (TM) i7-4720HQ 1.60 GHz with 8 GB RAM under and simulations are applied to the MATLAB 2017b environment. To prove the effectiveness of the suggested ITEO algorithm, it is performed on some different benchmark functions, i.e., Ackley, Rastrigin, Sphere, and Rosenbrock, and the results have been compared with some well-known and new optimization techniques, i.e. Locust Swarm Optimization Algorithm (LSO) [28], Crow Search Algorithm (CSA) [29], Multi-Verse Optimizer (MVO) [30], search and rescue (SAR) algorithm [31], and the basic Thermal Exchange Optimization (TEO) [10]. Table 1 tabulates the parameter settings of the studied algorithms.

To clarify the studied benchmark functions, the equations and the boundaries have been given in Table 2. During the simulation, all of the algorithms run 40 times independently for all of the benchmark functions and the maximum iteration for all of the algorithms is considered 500.

Table 3 tabulates the results of the analysis of the algorithms on the benchmark functions based on four measurement indicators including minimum, maximum, mean, and standard deviation (std).

As can be observed from the results, the results of all four indicators based on the suggested ITEO have the minimum value. This indicates that the suggested algorithm has the minimum error ratio for all of the metrics. The minimum value of the indicator “minimum” shows that the proposed ITEO algorithm has the highest precision among the others. Plus, the minimum value of the “std” for the ITEO algorithm shows its higher consistency toward the other methods.

7. Classification

For the final diagnosis of the features, a classifier is needed. In this study, a new optimized version of Artificial Neural Network (ANN) has been used [33]. The ANN is an almost new methodology with high efficiency for different classification applications. The ANN makes a proper relationship between input and output data. The sensitivity of the ANN is so low toward errors [34]. This method can be used for the accurate classification of medical images by correct training with no mathematical modeling [35, 36]. A popular and simple technique among different types of ANN is the multilayer perceptron (MLP) neural network. An MLP is a mathematical model of a natural brain [34]. An MLP is a model with several numbers of weights and biases that are connected for brain performance modeling. The popular method for error minimization in MLPs is backpropagation (BP). In BP, the values of the weights and biases have been adjusted for minimizing the error between the output value and the desired value [37]. This method uses Gradient Descent (GD) algorithm for minimization. One significant problem of the GD algorithm is that they have been easily trapped into the local minimum.

The output of each layer in the network is as follows:where indicates the input variable, signifies the bias for the neuron, and describes the relation weight between and the invisible neuron.

After that, the activation function has been performed to trigger the output of the neurons. This research employs sigmoid function for this purpose:

And the output layer gives

The ANN calculates mean square error (MSE) between the desired and the observed output, i.e.,where describes the number of the steps in the training data collection and and represent the observed value and the desired value, respectively.

As mentioned earlier, GD algorithm for minimizing the MSE has some problems. Therefore, here for minimizing the MSE and optimizing the classifier, the suggested Improved Thermal Exchange Optimization has been used. Figure 2 shows the method of this idea.

8. Results and Discussion

As mentioned before, the main idea of this study is to propose a pipeline methodology for lung cancer diagnosis. The method starts with a preprocessing method for enhancing the quality of the original image. After image preprocessing, a simple image thresholding has been done based on Kapur for lung areas. Afterward, optimal features have been selected from GLCM features based on an improved version of the Thermal Exchange Optimization algorithm. Finally, an optimized MLP system was based on the introduced Improved Thermal Exchange Optimization algorithm. The method has been validated based on the Lung CT-Diagnosis dataset collected by the Cancer Imaging Archive [12]. Figure 3 shows an example of the image segmentation.

8.1. Dataset Description

In this study, the Lung CT-Diagnosis database has been utilized. This database contains several images in Dicom format, employed from capturing 61 patients. The number of the total images is 4682. The method has been simulated based on MATLAB 2017b environment and performed to the database based on the following configuration: Corei7 laptop with 16 GB RAM and [email protected] GHz processor.

8.2. Simulation Results

Table 4 indicates the GLCM data results of 20 first images from the Lung CT-Diagnosis database. As can be observed from Table 4, nineteen numbers of GLCM features are employed for the feature extraction (Table 5).

After feature selection based on the Improved Thermal Exchange Optimization, the optimum features based on the suggested ITEO methodology are evaluated and shown in Table 6.

The optimum threshold achieved by ITEO has been used to select the features. For the cost function, the best optimal value earned is 0.75. To test the final efficacy of the suggested technique, three measurement indicators including sensitivity, precision, and accuracy have been analyzed. The mathematical formulations of the indicators are given below:where is truly negative, is truly positive, is false negative, and is false positive.

To verify the higher efficiency of the proposed method, a comparison analysis of the method has been applied toward some state-of-the-art algorithms, including Kavitha’s [38], Kumar’s [39], and Lin’s [40], applied to the Lung CT-Diagnosis database. The comparison results are illustrated in a bar chart in Figure 4.

As can be observed from Figure 4, the proposed method has the best precision for the Lung CT-Diagnosis database and Lin’s, Kumar’s, and Kavitha’s are placed in the next ranks. The results show also the proposed method.

9. Conclusions

The main purpose of this study is to propose an optimal pipeline for precise lung cancer diagnosis based on different approaches. The method started with applying a preprocessing process based on a min-max normalization for the input data and an average filter for denoising the input image. Afterward, Kapur entropy maximization along with mathematical morphology was used for segmentation of the lung area. Then, 19 numbers of the GLCM features have been extracted from the segmented images and the features with higher priority were selected based on a new optimization design. The new design, called Improved Thermal Exchange Optimization (ITEO) algorithm, was designed and employed to optimize the feature selection step by considering more accuracy and convergence ability as was shown in the validation stage. Finally, the images were classified into healthy or cancerous cases by using an artificial neural network optimized by ITEO. The simulation was compared with some different state-of-the-art methods including Lin’s, Kumar’s, and Kavitha’s, and the results showed that the proposed method with 92.27% accuracy, 96.4% sensitivity, and 97.61% specificity has the highest efficiency toward the other state-of-the-art methods. In future work, we will work on using convolutional features of the lung cancer images to provide a method with higher accuracy in the system.

Data Availability

The data that support the findings will be available in the Lung CT-Diagnosis database of lung cancer images at https://doi.org/10.7937/K9/TCIA.2015.A6V7JIWX.

Conflicts of Interest

The authors declare no conflicts of interest.