Abstract

Because of the high mortality rate, increased medical costs, and ongoing global growth in the incidence of this malignancy, early detection has become a top priority. Early detection and treatment of melanoma are critically important; the likelihood of a positive outcome rises dramatically. To address this issue, academic researchers plan to develop a prototype image analysis system based on deep learning to determine whether a lesion is malignant or benign based on dermatoscopy image databases. Pretrained convolutional networks with simple architectures were employed in this study to grasp their design better and to train the given dataset more quickly. Using convolutional neural networks as the basis, this research seeks to develop a deep learning system capable of classifying images. To train our model with the pretrained AlexNet, VGG, and ResNet networks, we will use the learning transfer methodology (or transfer learning), whose architecture we will outline so that it may subsequently be adjusted to our data. In this research work, fairly basic pretrained convolutional networks have been used to understand their architecture and efficiently train the given dataset. However, other networks have much more complex structures or even the same networks used, but with many more layers. For possible future work, it is proposed to use, for example, ResNet-152, Vgg-19, or other different networks such as DenseNet or Inception.

1. Introduction

Today, deep learning is on the rise, to the point of being able to apply it to any imaginable field, even reaching the field of medicine, for example, for computer-aided diagnosis, to make predictions from sets of data, to process medical images, or to distinguish between malignant and benign tumors [1]. The motivation of this work is precisely this type of application since it allows to improve the diagnoses and, thus, facilitate the work of the experts. Specifically, this work will focus on the early detection of different types of skin cancer, which is among the most common types of cancer today, being produced by developing cancer cells in any skin layer [2].

Deep learning defines a set of artificial intelligence (AI) techniques that consists of a machine learning technique that teaches computers to do what comes naturally to people: learn through examples. Deep learning models are trained using a comprehensive set of labeled data and neural network architectures or neural networks (NN) formed by different layers, obtaining results with a precision that even surpasses human performance [1]. These layers are organized in a hierarchy of increasing complexity and abstraction. The initial level of the network learns simple concepts; the next level takes this simple information, combines it, composes more complex information slightly, and uses it. It goes to the third level and so on. Neural networks are based on biological nervous systems and, therefore, are units called neurons or nodes connected [3]. Neurons receive signals (inputs) from other neurons, and depending on the signals received, a neuron, in turn, sends an alert to other neurons. We will delve into convolutional neural networks or Convolutional Neural Networks (CNN), a type of deep neural network [4].

In many cases, detecting the different classes of skin cancer is susceptible to early detection through visual inspection by experts. Therefore, the challenge of this work will be to develop an automatic tool for the classification of cancer-type skin-based imaging. This classifier will be tested with different images of skin cancer. It will have been previously trained, and its operation will be verified using different parameters and neural networks for training [5]. In 2021, a group of researchers was required to perform skin cancer detection through deep learning [5, 6]; since working with a large number of images makes use of convolutional neural networks using the inceptionV3 architecture, an architecture that Google proposed, the neural network is trained and a small dataset is used with its labels to be able to evaluate the effectiveness of the neural network, obtaining 86.90% accuracy as results, 87.47% precision, 86.14% sensitivity, and 86.90% specificity. The precision generated by this work is obtained thanks to the use of already used, proven, and efficient architectures, which produce more significant use of computational resources to achieve greater precision, unlike the architecture created in this work. Noortaz (2020) seeks to develop and propose a method for recognizing lesions in the skin and thus identifying malignant lesions in nondermatoscopic images. For the technique, convolutional neural networks and autocoders were used to classify. In his experiments, they could replicate results obtained with conventional procedures, reaching 90% specificity and 61% sensitivity. The objective of this work consists of the study and development of different deep learning techniques to develop a solution capable of solving the problem. The ultimate goal will be to implement a solution capable of classifying as accurately as possible the images provided by the competition in the seven different types of skin cancer.

2. Methodology

2.1. Transfer Learning

Transfer learning is a deep learning technique that allows us to take advantage of an existing convolutional neural network that has already been trained for a task similar to the one we will tackle. It is modified to train with our data. So instead of having to train a network from scratch, the job will be simplified to initializing the network with a pretrained or fine-tuning the last layer of the pretrained network to suit our problem. This technique is beneficial when we do not have enough data to train a new domain with a neural network from scratch (as in this case) and there is an extensive pre-existing dataset that can be transferred to the problem. Two main transfer learning scenarios can be found [7]:

2.1.1. CNN Fine-Tuning

Instead of random initialization, the network is initialized with a pretrained network. The rest of the training is done standardly (Figure 1). CNN as a fixed characteristic extractor: in this case, the weights of the entire network are frozen, except for the fully connected final layer. A new one replaces this with random weights. and only this layer is trained (Figure 1).

2.1.2. Pytorch

It is designed to be integrated into Python to use all its libraries and packages. On the other hand, it has support to run on the GPU, internally using CUDA, an API that connects the CPU with the GPU, which accelerates traditionally prolonged processes, such as model training. This programming language has been chosen to develop the image classification work due to all its functionalities and advantages discussed above.

2.2. Base Algorithm and Development

The algorithm implemented with Pytorch is based on transfer learning techniques for data training. First, the necessary libraries are imported (import torch, torchvision, NumPy, etc.). Parameters and hyperparameters are declared, such as the number of epochs, the size of the minibatch, and the folder in which the data is located. Subsequently, the dataset available to train the model is loaded. In this case, we have 10,000 images, divided into a training set and a validation set. Once the data is loaded, transformations are applied, resizing it to the size required by the neural network to be used. Data augmentation transformations will also be applied to the training set to study its effect on the classification results. It is then configured to use the GPU if it is available. Otherwise, CPU will be used. A pretrained neural network is loaded, trained by applying one of the two transfer learning techniques. The training of the model will be defined in a loop that trains period by period, calculating the loss and the accuracy in each of them, both for the train set and for the validation set, as well as the duration of the training, which will be saved for later use. Once the training results have been obtained, those interested in studying will be graphically represented.

2.3. Networks Used
2.3.1. AlexNet [8]

The initial layer is the one that receives the input images, which are normalized so that their dimension is , corresponding to the height H (height), width W (width), and depth D (depth) or channels that represent the RGB values (red, green, and blue). The goal of the five convolutional layers is feature extraction. The first two include a pooling stage with overlap, to reduce the dimensions of the network outlet to increase the depth of the next convolution layer. More complex characteristics can be achieved with this, although part of the spatial information is eliminated.

2.3.2. VGG [9]

This network is characterized by its simplicity since it uses only convolutional layers stacked one on top of the other in greater depth. Volume size reduction is achieved through max pooling. Then, two fully connected layers, each with 4096 neurons, were followed by a softmax classifier. VGG-16 means that 16 layers of weights are used in the network. In terms of architecture, the input to the first layer should be a RGB still image. The image is passed through a series of convolutional layers where the filters used are in size. In one of the configurations, a filter is used, which can be seen as a linear transformation of the input channels. All these convolutional layers include pooling at the end, so that the dimensions of the output of each one are reduced. All hidden layers have the ReLU activation function. It comprises three fully connected layers, the first two with 4096 neurons and the third with 1000, equivalent to the classes for which it was pretrained. Finally, the softmax layer is in charge of classifying the different classes according to the established probabilities.

2.3.3. ResNet [10]

Regarding its architecture, the input image to the first layer must be RGB, which will be introduced in the first convolutional layer of the network (Conv-1). Next, we find four convolutional layers (Conv-2, Conv-3, Conv-4, and Conv-5) obtained from the residual blocks, which we can see in more detail [1114]. Subsequently, an average pooling layer is found to reduce the output of the previous layer. Finally, a fully connected layer (FC-6) is introduced for the total number of classes to learn (in this case it is adapted to 1000). A softmax activation function will be applied to generate the final output probabilities and decide your ranking.

3. Results and Discussion

3.1. Assessment Framework
3.1.1. Dataset

The dataset used has been the one provided by the human against machine with 10000 training image [1519] datasets that have 10,000 images of spots corresponding to skin cancer, as well as a CSV file that includes the ground truth corresponding to each image according to the type of skin cancer (Figure 2). All images are in jpg format, with a size of () pixels, which will be adjusted to serve as input to each network used. On the other hand, having a single set of 10,000 images labeled with the corresponding classes to classify will be divided into two parts: train set and validation set (val). This division has been carried out by selecting the percentage corresponding to each required set; in this case, divisions have been made for the following sets: (i)80% train and 20% validation(ii)70% train and 30% validation(iii)55% train and 45% validation

In such a way, the results obtained from different combinations of the train and validation sets can be studied, which will allow us to correctly establish our classification model and adjust the hyperparameters (Table 1).

3.1.2. Metrics

To evaluate the algorithms, both the precision (accuracy) and the loss function (loss) will be taken into account. Precision or accuracy is a metric to evaluate classification models and measures the ratio of well-classified samples to total data. It gives an idea about the system’s overall performance to be evaluated. It will take values between 0 and 1, with 1 being 100% correct, so the objective will be to maximize it as much as possible. The loss function takes the input pair (output and destination) and calculates a value that estimates how far the output is from the target. There are different loss functions, for example, the mean square error (MSE). It serves as a measure of the error made; therefore, the objective will be to minimize its value. For this, optimization algorithms are used, such as gradient descent.

The algorithm used in this case is the stochastic gradient descent (SGD), which consists of an iterative method to optimize an objective function. It is called stochastic because it uses randomly selected (or mixed) samples to evaluate gradients. This is how the weights (weight) of the network are updated.

3.2. Hyperparameter Tuning

A study and evaluation of the times, the minibatch size, and the learning rate will be carried out so that fixed values considered optimal for subsequent training with different networks and datasets are established. All the tests for the adjustment of hyperparameters will be carried out on the pretrained AlexNet network.

To establish the number of epochs, since there is no correct answer to which is the correct one and it will be different for each dataset, a study of the dataset will have to be carried out concerning our model so that the appropriate epochs are obtained for ensuring network convergence.

Different tests have been carried out to determine a fixed number of times to be used in the different pieces of training, and it has been found that after 15 or 20 times, both the accuracy and the loss converge. In Figure 3, we can see an example of execution in which it is seen that from 15 to 20 epochs, the result is a continuous line. In Table 2, it can also be seen that the accuracy from 15 epochs does not undergo any variation, as does the loss, which from 20 epochs remains constant.

Observing Table 3, we see that the best result has been given in experiment 14, with a value of 0.6960 (bold), and the second-best has been in experiment 8, with a value of 0.6955 (italic), and third place experiment 20 with a value of 0.6855 (bold-italic).

Therefore, we can conclude that the learning rate that gives the best result is 0.001 since it has been used for these three configurations.

On the other hand, we can see from the Table 4 that, in general, better results are obtained for training applied to the last layer. This may be because when training on a network with all the pretrained layers, the weights they previously had may not be adjusting correctly to our data so that the training is not as practical as training from zero last layer. Regarding the size of the minibatch, we will stick with the 64 one since it is one of the ones that obtain the best results.

3.3. Results on Pretrained Networks

Once the minibatch (64) and learning rate (0.001) parameters have been set, as well as the transfer learning technique that consists of freezing the weights of all the layers and training only the last one, we proceed to do the experiments on the different pretrained networks (AlexNet, VGG, and ResNet) and to apply data augmentation. As the last layer of the architecture of these three networks is fully connected, this will be the one that is trained by freezing the rest. In addition, the outputs have been adapted to classify in 7 different classes instead of 1000, which is what they were initially trained for. The decision has been made to use the data augmentation transformations that consist of horizontal flip (horizontal flip), vertical flip (vertical flip) ,and the application of brightness, contrast, and saturation (jitter) since these were the ones that most highlighted in the study done previously on the results of the competition.

The following experiments are carried out over 20 epochs since it was previously verified that from this point on they had already converged.

3.3.1. Results on AlexNet

Once different experiments have been carried out using AlexNet as the base network, we can see in Table 5 that, using the same data augmentation transformations, in all cases, there are improvements in the precision (accuracy) of the validation data by increasing the proportion of validation data concerning training data. Furthermore, when applying the vertical and horizontal flip transformations, we also see some improvement over the experiments in which no data augmentation has been added. However, using color jitter, which adds changes to the images’ brightness, contrast, and saturation, gives quite similar results, and even worse, compared to the first tests in which no transformation had been added.

Therefore, we can conclude that the best results using the AlexNet network are given for a division of the dataset in 55% training and 45% validation, and without applying data augmentation transformations, or only horizontal and vertical flips.

3.3.2. Results on VGG

Observing Table 6 of the tests carried out with the VGG-16 neural network, we reach similar conclusions as with the AlexNet network regarding the proportions of the dataset, since the improvement of the validation accuracy results is noted when using 55% training and 45% validation, compared to the other two proportions that have been experimented with. Regarding data augmentation transformations, a slight increase in precision can be seen when using horizontal and vertical flips compared to tests in which no transformation is performed; however, in this case, it also worsens the result when applying color jitter.

3.3.3. Results on ResNet

Finally, 12 other experiments have been done on the ResNet-18 pretrained network. The best accuracies are given for 55% training data and 45% validation sets. The best result occurs for the experiment in which horizontal and vertical flip transformations are applied; the Cal is not too far from the second-best result. Only horizontal flip is applied as data augmentation. Therefore, we can conclude that in the case of ResNet-18, applying flip transformations does produce an improvement in the results. On the other hand, when using the jitter color, we get worse results in the precision of the validation data.

Comparing the results obtained in the experiments carried out on the different pretrained networks chosen (AlexNet, VGG-16, and ResNet-18), we can conclude several things: (i)The model works best for the 55% train and 45% validation ratio. This may be because overfitting is taking place, that is, the training data is being learned by “memory” the images with which it trains and, therefore, when applying its learning in the validation set, it obtains worse results, since the images are different from what has been learned(ii)In general, applied data augmentation transformations are not working well. It is seen that when applying color jitter, that is, brightness, contrast, and saturation, the results get worse(iii)The best results have been obtained for the AlexNet network. As expected, ResNet has achieved better results than VGG(iv)The accuracy and loss obtained for train and Val of the best result (0.7411), that is, for the AlexNet network, without applying data augmentation and with the division of the set into 55% train and 45% val

4. Conclusions and Future Work

4.1. Conclusions

The transfer learning technique has been possible to optimize the training of the model and thus be able to study the behavior of the three different selected pretrained networks, AlexNet, VGG, and ResNet. The results have been obtained with a success rate of close to 75%. We only had the training dataset labeled with its ground truth, which we divided so that we had two sets, one for train and the other for validation, so the training set was even smaller. In this work and the competition in general, the percentages obtained seem insufficient to me in the case of a problem as delicate as the determination of a specific type of cancer. It would be convenient to determine a specific cure with the maximum possible precision. Deep learning, with this class of tools, a robust and functional system is being offered when determining the diagnosis of this type of disease, which will help specialist doctors when making final decisions, providing them with more precision and speed in diagnoses.

4.2. Future Work

The possible improvement in the present work could be to try to introduce more layers at the end, before the softmax classification, so that they gradually reduce the number of neurons in each layer instead of going, for example, from 4096 to 7, as happens using AlexNet, in just the last layer, in addition to adding dropout in each layer to avoid overfitting. Another possible future work to see how the data is behaving could be to calculate the accuracy separately for each of the classes to study which one fails and which one works correctly.

Data Availability

The data underlying the results presented in the study are available within the manuscript.

Conflicts of Interest

The authors declare that they have no conflicts of interest.