Abstract

In this paper, we describe our method for skin lesion classification. The goal is to classify skin lesions based on dermoscopic images to several diagnoses’ classes presented in the HAM (Human Against Machine) dataset: melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), actinic keratosis (AK), benign keratosis (BKL), dermatofibroma (DF), and vascular lesion (VASC). We propose a simplified solution which has a better accuracy than previous methods, but only predicted on a single model that is practical for a real-world scenario. Our results show that using a network with additional metadata as input achieves a better classification performance. This metadata includes both the patient information and the extra information during the data augmentation process. On the international skin imaging collaboration (ISIC) 2018 skin lesion classification challenge test set, our algorithm yields a balanced multiclass accuracy of 88.7% on a single model and 89.5% for the embedding solution, which makes it the currently first ranked algorithm on the live leaderboard. To improve the inference accuracy. Test time augmentation (TTA) is applied. We also demonstrate how Grad-CAM is applied in TTA. Therefore, TTA and Grad-CAM can be integrated in heat map generation, which can be very helpful to assist the clinician for diagnosis.

1. Introduction

Skin cancer is the most common cancer around the world. Early detection and monitoring play a crucial role decreasing the mortality rate of skin cancer. However, it is a challenging problem that only 65%-80% of the skin cancer cases are correctly diagnosed using clinical inspection by an experienced physician [1]. Perez et al. investigated the impact of 13 data augmentation scenarios, such as traditional color and geometric transforms, elastic transforms, random erasing, and lesion mixing method for melanoma classification. The results confirmed that data augmentation can lead to more performance gains than obtaining new images. Recently, a challenge from the international skin imaging collaboration (ISIC) Skin Lesion Analysis Towards Melanoma Detection resulted in numerous high-performing methods that performed similar to human experts for the evaluation of dermoscopic images, which was mostly based on the convolutional neural network technique [2]. And most of those methods obtain a high classification accuracy through ensembling multiple models [3, 4]. For example, 18 convolutional neural network (CNN) architectures [4] and 7 multiresolution EfficientNet (B4) models [3] are explored, with extensive data augmentation. The final results are obtained by 90 submodels, which takes 13.9 seconds to classify single test images for high-end TitanV graphic card [5]. However, in these ensembling method, every image has to be sent through all the models for inference process, and this scheme will even run multiple times on each model for data augmentation, which have a significant amount of computation and would not be practical for a real-world scenario.

Here, we propose a simplified solution which has a better accuracy than previous methods and is also practical for a real-world scenario.

2. Method Details

2.1. Datasets

The ISIC 2019 skin lesion classification challenge dataset contains 25, 331 dermoscopic images, with extra meta information about the patients’ age, the anatomical site, and the sex properties [68]. We also use 1,572 extra images in training, including 170 images from the MED-NODE dataset, 533 from the seven-point dataset, 120 from the PH2 dataset, and the remaining ones are our own collected data.

2.2. Image Preprocessing

We resize all images’ longer side to 1,024 pixels while preserving the aspect ratio. The shades of gray color constancy method proposed by Finlayson and Trezzi is applied beforehand as a preprocess step, and its color gain in RGB channel is recorded as extra metainfo input. [3]

2.3. Additional Patient Information Preprocessing

The patient’s additional patient information is mostly encoded by a one-hot encoding scheme. For example, six features are used for the anatomical site with enough appearance (>100). Sex is encoded as 1/-1/0 for man/female/missing, respectively. Age is encoded by 18 features, under 18 thresholds from 5-90 with step 5, where 1/-1/0 represented larger/smaller/missing, respectively.

2.4. CNN Architectures

We use EfficientNet that have been pretrained on the ImageNet dataset [9]. This model family contains 8 different models that are structurally similar and follow certain scaling rules for adjustment to larger image sizes, from the smallest version B0 to larger versions, up to B7 (Figure 1). To incorporate additional patient information such as age, anatomical site, and sex, an additional dense neural network and fuse its features with the CNN is discussed [4]. In our experiment, we reported the performance of a single model B4, as well as an ensemble model with B3 and B4. We use the default input size described in EfficientNet paper, which is for B3 and for B4.

2.5. CNN Data Augmentation

We perform data augmentation in training, both geometric and pixelwise, including random brightness, contrast, hue, saturation, Gaussian Noise, Gaussian blur, random crop, rotation, and flipping. Moreover, we also recorded the scaling and shifting properties of the geometric augmentation which was later used as the metafeatures during the training process. These augmentation processes were implemented for using the albumentations.

2.6. CNN Training

We train the models for 60 epochs with batch size 16 using SGD with momentum. One cycle learning-rate scheduler was applied [10], which was implemented by pytorch OneCycleLR function with default parameter, and the learning rate was set as . A weighted cross-entropy loss function where underrepresented classes receive a higher weight. The coefficient was calculated by the formula described by Lin et al. [11] Focal loss with a was also tested; unfortunately, the testing accuracy was lower [12]. Training was performed on NVIDIA GTX 1080TI. A sampling strategy was applied during the dataloader procedure. In the metafile, a lesion ID was provided for each image. For one lesion, there were 1-30 images in the dataset and those images for the same lesion ID were with high similarity. Therefore, a sampling weight coefficient was added which was equal to inverse of the number of images for that lesion ID.

3. Discussion

Skin cancer is one of the most common malignancy with an increasing incidence rates on a global scale. [13] Early detection is an important factor to increase the overall survival and cure rates for those patients [5]. The diagnosis of those diseases is usually carried out by dermatologists through the visual examination of suspicious skin areas, but it is easy to be misdiagnosed due to the high similarities of some types of lesions. Although supportive imaging techniques such as dermoscopy can improve the accuracy of diagnosis to some extent, the accuracy of diagnosis varies greatly among individuals with different experience.

A large number of studies have been devoted to improving the accuracy of diagnosis and treatment. In recent years, more and more semiautomatic or fully automatic computer-aided diagnosis (CAD) systems based on classical image processing techniques or advanced machine learning paradigms, such as classical workflow of machine learning and CNNs, have been introduced into the diagnosis and treatment of skin diseases as screening procedures or rapid diagnosis tools to assist dermatologists [1, 4, 14]. In addition, the quality of classification could be improved by adding clinical data (such as age, sex, race, skin type and anatomical location) as input to the classifier, and this additional information is helpful for dermatologists to make the right decisions [15]. Perez et al. confirmed that data augmentation can lead to more performance gains than obtaining new images, which was based on the researches of the impact of 13 data augmentation scenarios, such as traditional color and geometric transforms, elastic transforms, random erasing, and lesion mixing method for melanoma classification [16]. Although these methods improve the accuracy of diagnosis and treatment, they all have a significant amount of computation and the application value in practical operation still needs to be improved.

In this study, we propose a simplified solution and have evaluated our proposed method on both ISIC 2018 and ISIC 2019 test set (Figure 2). The final predicted probability is archived by 10 times TTA, which costs approximated 0.5 second on GTX 1080Ti. There is a live leaderboard to record the performance of the submitted result [2]. The balanced multiclass accuracy (BMCA) is used as the primary metric value, which is shown on Tables 1 and 2. Our results show that using a network with additional patient information as an input achieves a better classification performance. On the ISIC 2018 skin lesion classification challenge test set, our algorithm yields a balanced multiclass accuracy of 88.7% on a single model and 89.5% for the embedding solution, which makes it the currently first ranked algorithm on the live leaderboard and highlights the excellent performance of our proposed solution on this very challenging task (Table 1). Similarly, we also saw the superiority of our proposed method on the ISIC 2019 test set (Table 2). We also observed that weighted cross-entropy criterion achieves higher BMCA than focal loss or label smooth loss criterion.

Grad-CAM is an efficient method to generate the heat map for visualizing where is the hot zone for the classification, which is quite helpful for assisting the clinicians during the diagnosis [17]. To improve the inference accuracy, test time augmentation (TTA) is also applied.

In this paper, we proposed an integrated solution for Grad-CAM and TTA with multicrop. This is implemented by accumulating the heat map on multiple inference process during TTA. As random crop is included in TTA, the final heat map is accumulated for different crop regions with different resolution. As TTA is an efficient way to improve the final prediction metrics; therefore, we believed the weighted heat map generated by this TTA-Grad CAMs operation will also have benefits for the diagnosis of the clinicians.

An example can be seen in Figure 3. The color is the heat map generated by applying Grad-CAM with TTA using the training model. The redder it is, the more likely that the area is a type of disease diagnosed by a neural network. Therefore, TTA and Grad-CAM can be integrated in heat map generation, which can be very helpful to assist the clinician for diagnosis.

We also test the semisupervised scheme as described by utilizing additional unlabeled testing images in the training process. [18] However, we found that the overall accuracy had not been improved. We also tested other advanced augmentation methods, such as Cutmix [19], as well as some attention-based method WS-DAN [20], which achieved the best result in fine-grained image classification tasks, but the performance has not improved.

4. Conclusion

In this study, we have proposed a single baseline for skin lesion classification which uses the information of data augmentation as additional patient information. The metadata used in our manuscript included additional infos that are generated during data augmentation, for example, gain of color normalization process, random crop, and image size properties. Our method has achieved the best result of ISIC live leaderboard with a balanced multiclass accuracy of 88.7% on a single model and 89.5% for the embedding solution, making it the currently first ranked algorithm on the live leaderboard. In addition, it is also practical for real application because of its low computational complexity.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no competing interests.

Authors’ Contributions

All authors substantially contributed to the manuscript. Qilin Sun and Chao Huang designed the study, performed the literature review, extracted the data, drew the figures, and organized the tables. Minjie Chen helped to revise the manuscript. Hui Xu and Yali Yang reviewed and edited the manuscript. All authors read and approved the final manuscript. Qilin Sun and Chao Huang contributed equally to this study and share co-first authorship. Hui Xu and Yali Yang share co-corresponding authorship (ORCID: Yali Yang, 0000-0001-8301-9943). Qilin Sun and Chao Huang are co-first authors.

Acknowledgments

This study was supported by the Natural Science Foundation of Shanghai Science and Technology Commission (20ZR1432100), the Industry Support Foundation of Huangpu District, Shanghai, Big Data Construction and Application of Artificial Intelligence “Skin Diagnosis and Treatment” (XK2020007), and the Science Popularization Project of Shanghai Science and Technology Commission (19DZ2305600).