[Retracted] Diagnosis of Melanoma Using Deep Learning

Alazzam, Malik Bader; Alassery, Fawaz; Almulihi, Ahmed

doi:https://doi.org/10.1155/2021/1423605

Mathematical Problems in Engineering

On this page

Abstract Introduction Results and Discussion Conclusion Data Availability Disclosure Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article Retraction

!

This article has been Retracted. To view the article details, please click the ‘Retraction’ tab above.

Special Issue

Advanced Aspects of Computational Intelligence and Applications of Fuzzy Logic and Soft Computing

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 1423605 | https://doi.org/10.1155/2021/1423605

[Retracted] Diagnosis of Melanoma Using Deep Learning

Malik Bader Alazzam,¹Fawaz Alassery,²and Ahmed Almulihi³

Academic Editor: Naeem Jan

Received15 Oct 2021

Revised31 Oct 2021

Accepted12 Nov 2021

Published01 Dec 2021

Abstract

When compared to other types of skin cancer, melanoma is the deadliest. However, those who are diagnosed early on have a better prognosis for the purpose of providing a supplementary opinion to experts; various methods of spontaneous melanoma recognition and diagnosis have been investigated by different researchers. Because of the imbalance between classes, building models from existing information has proven difficult. Machine learning algorithms paired with imbalanced basis training approaches are being evaluated for their performance on the melanoma diagnosis challenge in this study. There were 200 dermoscopic photos in which patterns of skin lesions could be extracted using the VGG16, VGG19, Inception, and ResNet convolutional neural network architectures with the ABCD rule. After employing attribute selection with GS and training data balance using Synthetic Minority Oversampling Technique and Edited Nearest Neighbor rule, the random forest classifier had a sensitivity of nearly 93% and a kappa index () of 78%.

1. Introduction

Because of the importance of early skin cancer diagnosis, digital image processing technologies are being developed to augment existing diagnostic tools. The objective is to create a skin cancer diagnosis classification model that minimizes the number of patients who really are misdiagnosed and enhances the likelihood of quick treatment. Alencar [1] suggested a classification technique based on Otsu’s adaptive thresholding over the blue channel of the RGB scheme to segment PH²-based skin lesions. To categorize photos into benign nevus or melanoma, the machine learning technique used a perceptron convolutional neural network (MLP). Safran and colleagues [2] evaluated several machine learning algorithms (such as MLP, SVM, and random forest) to classify dermoscopic exams. Afterwards, we compared the results of at least 21 dermatologists who were put to the test on three crucial diagnostic tasks: the classification of keratinocyte carcinoma, the classification of melanomas, and the classification of dermoscopy-based melanomas. None of the proposals in the literature deal with the problem of class imbalance, where the number of instances of normal dermoscopic exams is greater than the number of melanoma samples. The objective of this work is to propose a methodology for the diagnosis of melanoma through a supervised machine learning process, using deep learning fundamentals in dermoscopic images and sample balancing algorithms.

2. Methodology

2.1. Image Acquisition and Preparation

For training and validation purposes, the set of images available in the PH² database was used [3]. A total of 200 dermatoscopy examinations are included in the database, divided into three categories: normal lesions (80 samples), atypical lesions (80 samples), and melanomas (40 samples). Non-melanoma photos were separated from melanomas in this investigation, and melanomas were combined with normal and atypical lesion images. Afterwards, the database was populated with 160 photos of benign skin lesions and 40 photographs of malignant ones.

As shown in Figure 1(b), the pictures are accompanied with a binary mask comprising the lesion site, which was created by experts. Before feature extraction, parts of the photographs that are not relevant to the issue are removed. You do not want to get any information from the skin in this scenario, only from the lesion’s ROI (region of interest). As a result, only the ROI is retained after applying the mask to the original exam.

(a)

(b)

(c)

After the ROI was extracted, a size standardization phase was used to make the images uniform in size across the dataset. This step involves adding vertical or horizontal borders to the lesion to make the heights and widths the same size. This is necessary because the architectures of the convolutional neural networks used accept only square images.

2.2. Extraction and Selection of Features

VGG16, VGG19, InceptionV3, and ResNet50, implemented by the package Keras, were used to extract picture features using convolutional neural network architectures in this study. Transfer learning is a technique used when there is not enough data to build a new model and instead the knowledge of a vast amount of previously known data is applied to the new data. In addition to the characteristics extracted with the CNNs, another descriptor was used, taking into account the ABCD rule for analyzing skin lesions. An implementation of this rule, adapted from Moura et al. returns 7 values, namely: asymmetry (1 value), irregular edges (4 values), color (1 value), and diameter (1 value). For the distance d in the counting of valleys and peaks from the edge, the value 15 was chosen. For the color variable, the pixels were divided into 50 intervals and the threshold used to disregard a certain color was 100. The values for such parameters were chosen empirically. All features were concatenated, totaling 5,127 descriptors for each image. Soon after, an attribute selection phase was performed, using algorithms Ranker (R), RankSearch, PSO, and Greedy Stepwise (GS). There is no further use of the dataset if the addition or deletion of any remaining attributes reduces the performance of a base model, in which case one subset of the dataset is used for testing, and the remaining k − 1 is used for training. The dataset is divided into mutually exclusive subsets of the same size. The test subset is rotated in a circular pattern k times to complete the operation. The k value employed in this investigation was 10.

The result of the feature selection algorithms, explored through the Weka tool (Waikato Environment for Knowledge Analysis) [4], returns a list containing all attributes and the number of sets (from 10-fold, in the case of RankSearch, PSO, and GS algorithms) where that particular characteristic was relevant. In the case of the R algorithm, a value between 0 and 1 is given for each attribute according to its relevance to the model. A cutoff value, threshold, was determined for each selection algorithm. Attributes that presented relevance below the threshold were eliminated from the final feature vector. Table 1 shows the amount of attributes that remained after applying each selector separately.

The threshold value chosen for RankSearch, PSO, and GS was 1, representing the minimum number of folds in which a given characteristic was relevant. As the approach was 10-fold, in this case, it is taken into account that the attribute is important in at least one group. For R, the value of 0.1 was chosen empirically.

It is critical to quantify the number of variables that remained after applying each feature selection for each attribute descriptor. Table 2 shows the number of leftover variables for each extractor, as well as the percentage of total characteristics for each case.

The initials next to the number of features for the ABCD descriptor indicate which variables remained after the attribute selection algorithm. In this case, for the GS selector, the asymmetry and color variables were more relevant than borders and diameter.

2.3. Experiments

Two different machine learning algorithms used the feature vector generated by the previous step: SVM (support vectors machine), which uses the principle of structural risk minimization (SRM) to find a decision rule with good generalizability, which is always unique and globally ideal, and random forest (RF), which makes a decision using a weighted average of the features. The unbalanced training set was used for the first type of training. The second was accomplished by using SMOTE to oversample the training data and then using ENN to clean up noisy samples. Because instances far from the decision edge between the two classes are not relevant for creating the model, the support vectors (SVs) found by a separately trained support vector machine with unbalanced data were chosen as the set of pivots for the SMOTE in this study. The value for k was set as 3 for the SMOTE and ENN algorithms, as higher values did not improve the performance of the models. In addition, a value of k that is too high for ENN tends to eliminate more instances from the base, increasing the probability of excluding real examples that could be essential in training classifiers. After that, the examples that remained in the base were introduced in the classifiers.

The experiments in this study were repeated ten (10) times. In each experiment, the base was divided into 70% training and 30% testing, where each classifier had its parameters automatically estimated using the auto-sklearn library [5] to generate the best classification model.

2.4. Performance Metrics

To assess the classifiers in terms of their generalizability, the criteria of accuracy, precision, sensitivity, specificity, and score were analyzed for each solution presented.

2.4.1. Cohen’s Kappa Index

The kappa index is an alternative to the calculation of the classification performance rate, in addition to being a method, known for decades, which compensates for hits that can be attributed to chance [3]. The index’s original purpose was to assess the degree of agreement or disagreement between two observers of the same phenomenon. Cohen’s kappa can be adapted for classification tasks and is recommended for use because it, like the AUC measure, considers random successes as a pattern.

According to [6, 7], there is still a very strong correlation between the kappa index and the ROC (receiver operating characteristic) curve, which describes a “trade-oﬀ” choice between TPs and FPs.

3. Results and Discussion

In this section, the results and the respective analyses carried out during the development of the work are presented, categorized by classifier, showing the different training scenarios. In order to obtain a good classification model, the ability to get the percentage of melanomas (sensitivity) right should be maximized with the correct percentage of examples of normal cases (specificity). The reasoning behind this decision is that detecting melanoma is preferable to failing to detect it in a patient at risk, even if doing so results in more false positives. The results are presented in different tables, according to the use or not of the class balancing step with the SMOTE + ENN and by classifier, containing the mean and standard deviations, acquired through the ten runs performed for each type of experiment. In Tables 4 and 5, the results of the random forest algorithm for each selector without and with the use of balancing are shown.

On the other hand, when balancing techniques (SMOTE + ENN) were included before training the machine learning algorithm, an increase in sensitivity could be noticed, regardless of which feature selector was used. For the RF classifier, there was an improvement of approximately 14 percentage points, using GS as the attribute selector. The best produced model was found when using this last approach. This improvement is due to the fact that the classifier no longer prioritizes the class with more examples since it was balanced, causing the weight in the classification error to be balanced for the two classes.

It can also be noted that the RF, using SMOTE and ENN, achieved an overall average accuracy similar to that calculated without the balance of 92.00%. Cohen’s kappa coefficient shows that the model is very good, as the K value is greater than 0.6 and less than 0.8. In Tables 6 and 7, the results of the SVM algorithm are presented.

The same behavior observed for the RF algorithm is observed for the SVM. On average, the sensitivity values were higher after applying the balance. The best SVM model, taking into account the sensitivity, was also using the SMOTE + ENN, together with the GS attribute selector, in the same way as the random forest classifier. However, the sensitivity value was lower, 85.00%. The SVM presented higher mean accuracies when it was trained with the characteristics resulting from the RankSearch selection algorithm. Taking into account the much smaller number of characteristics in relation to the others, it can be considered that the greater dimensionality interfered negatively in the formation of the decision boundary between the two classes. It can be seen that the PSO algorithm, having the largest number of features (mentioned in Table 1), had the worst performance among selectors in most tests.

The dispersion, represented by the amplitude of the graphs, indicates that there is greater variation in value, when the oversampling and undersampling tool is not used to balance classes. With SMOTEEN, there is a minimum sensitivity value equal to or greater than 75%. For the balanced-trained models, we have that the random forest obtained a median higher than that achieved by the SVM.

Regarding the feature extractors related to the shape and color of the ABCD rule, the asymmetry variable proved to be relevant for all attribute selection algorithms, of which R, RankSearch, and Greedy Stepwise relied on the variable color. The descriptors of edges and diameter of the lesion remained in the characteristic set only for the R selector. This could be due to the fact that the attribute scavenging threshold used was 0.1. The maximum and minimum relevance values given by the R selector were 0.336 and 0, respectively. If the maximum value is considered as 100% relevant to the problem, the value 0.1 represents only 29.76% of relevance among the existing variables. A threshold value that would represent at least 50% relevance would be 0.168.

Convolutional neural networks have automatic feature extraction as behavior. However, when it comes to the machine learning process, it is of fundamental importance to verify why a given classifier errs in a given instance of the sample, and this is also done by observing and analyzing how the characteristics were obtained.

This verification is required in order to identify potential failures and improve the overall process. References [7–11] proposed a method for producing “visual explanations” for decisions from a large class of CNN-based models, thereby making them more transparent. It uses the gradients of a target concept to produce an image location map that highlights key classification regions. This technique is known as gradient-weighted class activation mapping (Grad-CAM). In Figures 2–5, the activation of the last convolution layers of the networks VGG16, VGG19, Inception, and ResNet, respectively, is presented, where Figures 2(a)–5(a) represent the lesion’s region of interest, Figures 2(b)–5(b) represent the activated image points, and Figures 2(c)–5(c) represent heat maps, representing which areas contributed most to the architecture. The first line presents an example of melanoma and the second presents an example of a non-melanoma image. To perform the visualizations, only the performance of the best feature selection method was taken into account, which, in this case, was the GS for the best classifier, random forest.

(a)

(b)

(c)

(a)

(b)

(c)

(a)

(b)

(c)

(a)

(b)

(c)

In Figures 2(a)–2(c), it can be seen that the activation of the VGG16 architecture in the melanoma example was over a region that appears to have an aggravation of the lesion, while for the normal image, the most important thing was to get lighter pigments that are close to the edge [12]. The same behavior can be seen in Figures 5(a)–5(c), where maps for ResNet are presented.

For the VGG19 (Figure 3) and Inception (Figure 4) topologies, the most relevant regions for the non-melanoma image (second row) were those around the edge and outside it. For the example of melanoma, the darker regions of the lesion were considered more relevant.

It should be noted that in the feature selection process, not all activation maps are considered important for classification and are therefore eliminated by the selector.

3.1. Comparison with Related Work

Many papers do not provide the precision of the models they used to create their classifiers. This indicates a rate that informs how correctly the classifier got melanoma in relation to all the examples that were classified as such. That is, the greater the number of false positives, the lower the precision. Of the studies that used the PH² basis, no precision value is shown, making it impossible to compare with the 74.33% achieved by this work. Compared to those who used another base, there are more expressive values such as 90% of Masood [13, 14].

It is important to emphasize that the cost given by a wrong classification of a person at risk when saying that the exam is normal, when he actually has a malignant skin neoplasm, is much greater than saying that the individual has a disease, even if it is normal. The overall accuracy is still a promising value, 92%, which is superior to many other studies, even keeping the sensitivity high. What corroborates the present study, in saying that the model is promising, is the kappa index value of 0.7715, attributing a concept of “very good” to the trained predictive model [15, 16].

4. Conclusion

This work presented a methodology for the diagnosis of melanoma, using characteristics obtained from the main models of CNNs and the ABCD rule. The attribute sets were concatenated and used in an attribute selection phase, whose resulting vector was used to train the support vector machines and random forest classifiers in several different scenarios. The tests performed had a sensitivity of 92.5% as the best result, showing an improvement of approximately 14 percentage points after using class balancing techniques. As a contribution, the pipeline applied to the training process, balancing existing classes in the database with simple processes of generating synthetic examples (SMOTE) and reducing noisy samples (ENN), made the classifiers increase their respective performances for the diagnostic task for melanoma, with regard to increasing the sensitivity of the classifier [17, 18]. Thus, the predictive model seeks to maintain the overall accuracy, while increasing the classification of examples of melanoma, to the detriment of the non-melanoma class. The consequences of misclassifying an example of a risky case can be catastrophic, leading to the possible death of the patient, compared to the cost of saying that a particular person has a positive diagnosis when they actually do not [19, 20].

Data Availability

The data used to support the findings of this study are included within the article.

Disclosure

This study was performed as a part of the employment of institutions.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

We deeply acknowledge Taif University, Taif, Saudi Arabia, for supporting this study through Taif University Researchers Supporting Project (TURSP-2020/344).

References

F. E. S. Alencar, Development of a System for Automatic Classification of Dermoscopic Images for Mobile Devices, State University of Rio Grande do Norte, Orlando Teixeira Central Library, BR-RN, 2015.
T. Safran, A. Viezel-Mathieu, J. Corban, A. Kanevsky, S. Thibaudeau, and J. Kanevsky, “Machine learning and melanoma: the future of screening,” Journal of the American Academy of Dermatology, vol. 78, no. 3, pp. 620-621, 2018.
View at: Publisher Site | Google Scholar
A. Esteva, B. Kuprel, R. A. Novoa et al., “Dermatologist-level classiﬁcation of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, p. 115, 2017.
View at: Google Scholar
A. A. Adegun and S. Viriri, “Deep learning-based system for automatic melanoma detection,” IEEE Access, vol. 8, pp. 7160–7172, 2020.
View at: Publisher Site | Google Scholar
A. A. Hamad and L. M. Thivagar, “Conforming dynamics in the metric spaces,” Journal of Information Science and Engineering, vol. 36, no. 2, 2020.
View at: Google Scholar
T. Mendonça, P. Ferreira, A. Marçal et al., PH2: A Public Database for the Analysis of Dermoscopic Images, Taylor&Fransis Group, Milton Park, UK, 2015.
View at: Publisher Site
S. Jha, S. Ahmad, A. Hikmat, M. Abdeljaber, and M. B. Alazzam, “A post-COVID Machine Learning approach in Teaching and Learning methodology to alleviate drawbacks of the e-whiteboards,” Journal of Applied Science and Engineering, vol. 25, no. 2, pp. 285–294, 2022.
View at: Google Scholar
G. Zhang, Z. Guo, Q. Cheng, I. Sanz, and A. A. Hamad, “Multi-level integrated health management model for empty nest elderly people’s to strengthen their lives,” Aggression and Violent Behavior, vol. 13, Article ID 101542, 2021.
View at: Publisher Site | Google Scholar
S. Sengan, O. I. Khalaf, S. Priyadarsini, D. K. Sharma, K. Amarendra, and A. A. Hamad, “Smart healthcare security device on medical IoT using raspberry pi,” International Journal of Reliable and Quality E-Healthcare, vol. 11, no. 3, pp. 1–11, 2022.
View at: Publisher Site | Google Scholar
S. Sengan, O. I. Khalaf, G. R. K. Rao, D. K. Sharma, K. Amarendra, and A. A. Hamad, “Security-aware routing on wireless communication for E-health records monitoring using machine learning,” International Journal of Reliable and Quality E-Healthcare, vol. 11, no. 3, pp. 1–10, 2022.
View at: Publisher Site | Google Scholar
S. Sengan, O. I. Khalaf, D. K. Vidya Sagar, D. K. Sharma, A. J. Prabhu, and A. A. Hamad, “Secured and privacy-based IDS for healthcare systems on E-medical data using machine learning approach,” International Journal of Reliable and Quality E-Healthcare, vol. 11, no. 3, pp. 1–11, 2022.
View at: Publisher Site | Google Scholar
V. Antonysamy, M. L. Thivagar, S. Jafari, and A. A. Hamad, “Neutrosophic sets in determining corona virus,” Materials Today: Proceedings, vol. 13, 2021.
View at: Publisher Site | Google Scholar
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overﬁtting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
View at: Google Scholar
P. Bethapudi, P. Poornima, R. Kumari, and M. Niharika, “Detection of melanoma in skin cancer using deep learning,” no. 11, pp. 47–52, 2021.
View at: Google Scholar
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: visual explanations from deep networks via gradient-based localization,” in Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626, IEEE, Venice, Italy, 2017.
View at: Publisher Site | Google Scholar
A. Masood, A. Al-Jumaily, and K. Anam, “Self-supervised learning model for skin cancer diagnosis,” in Proceedings of the IEEE. 2015 7th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 1012–1015, Montpellier, France, 2015.
View at: Publisher Site | Google Scholar
M. Rastgo, G. Lemaitre, O. Morel et al., “Classiﬁcation of melanoma lesions using sparse coded features and random forests,” in Proceedings of the International Society for Optics and Photonics. SPIE Medical Imaging, p. 97850C, California, USA, March 2016.
View at: Google Scholar
R. K. Barik, S. S. Patra, P. Kumari, S. N. Mohanty, and A. A. Hamad, “A new energy aware task consolidation scheme for geospatial big data application in mist computing environment,” in Proceedings of the 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom), pp. 48–52, New Delhi, India, March 2021.
View at: Google Scholar
M. L. Thivagar, A. S. Al-Obeidi, B. Tamilarasan, and A. A. Hamad, “Dynamic Analysis and projective synchronization of a new 4D system,” IoT and Analytics for Sensor Networks, vol. 13, pp. 323–332, 2022.
View at: Publisher Site | Google Scholar
M. L. Thivagar, A. A. Hamad, B. Tamilarasan, and G. K. Antony, “A novel seven-dimensional hyperchaotic,” Proceedings of Second Doctoral Symposium on Computational Intelligence, vol. 13, pp. 329–340, 2022.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Malik Bader Alazzam et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1178

Downloads

633

Citations