Abstract
Localization of region of interest (ROI) is paramount to the analysis of medical images to assist in the identification and detection of diseases. In this research, we explore the application of a deep learning approach in the analysis of some medical images. Traditional methods have been restricted due to the coarse and granulated appearance of most of these images. Recently, deep learning techniques have produced promising results in the segmentation of medical images for the diagnosis of diseases. This research experiments on medical images using a robust deep learning architecture based on the Fully Convolutional Network- (FCN-) UNET method for the segmentation of three samples of medical images such as skin lesion, retinal images, and brain Magnetic Resonance Imaging (MRI) images. The proposed method can efficiently identify the ROI on these images to assist in the diagnosis of diseases such as skin cancer, eye defects and diabetes, and brain tumor. This system was evaluated on publicly available databases such as the International Symposium on Biomedical Imaging (ISBI) skin lesion images, retina images, and brain tumor datasets with over 90% accuracy and dice coefficient.
1. Introduction
Segmentation is the key process of identification of ROI of a disease region to assist in the diagnosis of diseases. It is very important in medical imaging where localization is paramount to the analysis of scans. Segmentation classifies each pixel to the part of the image the pixel belongs to and it produces the output for each pixel. Recent advancement in machine learning methodologies has led to the development of deep learning techniques in the field of medical images analysis for the diagnosis of various diseases [1]. Diseases such as brain tumors, diabetes retinopathy, skin cancer, and liver tumor have been successfully diagnosed through the analysis of MRI scans, retina vessel images, skin lesion images, and liver tumor scan, respectively, through the use of deep learning techniques [1]. Existing techniques for analyzing these images such as handcrafted methods have been limited due to their time consumption and coarse and granulated appearance of most of these images [2].
The application of deep learning techniques for medical image analysis and segmentation has produced promising results in recent times. These approaches are however well constrained with scarcely accessible labeled datasets for training the deep learning models for effective performance [3]. This study proposes a robust and efficient deep learning framework for the segmentation of medical images towards disease discovery and prognosis with limited training data. In this work, three different sets of medical images, retina images, brain tumor, and skin lesions datasets, have been explored to assess the performance of the deep learning framework.
Automated techniques based on traditional machine learning techniques have been developed in the past for imaging and analysis of medical images towards the diagnosis of diseases. These techniques have been limited in performance due to the complex visual appearance of these images. For example, difficulties have been experienced in the analysis of the nerve fiber layer of the optic disc and the surrounding retina [4]. The swollen optic disc may indicate symptoms such as malignant hypertension, diabetic retinopathy, etc. The macula in the optic disk may be a circular region of 5.5 mm in diameter with a 17-degree center centered, or between 4.0 and 5.0 mm, sequential, and 0.53–0.8 mm lower than the middle of the optic disk [4]. Any variation in the location and the size of this muscular from the normal form can be identified by an automated system.
The proposed system efficiently performs the analysis of retina images and identifies the optic disc ROI. The system also performs analysis of skin lesion images and identifies and differentiates the ROI with melanoma from the non-melanoma region. Lastly, the proposed system performs the analysis of brain MRI images to identify ROI with tumors and separate images with tumors from images with no tumors.
The system utilized a Fully Convolutional Network that was trained in an end-to-end manner from images directly, using only pixels and disease labels as inputs. Datasets containing both the training images and corresponding labels of the three categories of diseases experimented within this research have been employed for training the model. Each of the pixels is identified and classified to either belong to a disease or not. Conclusively, performance evaluating metrics such as dice coefficient, accuracy, etc., have been utilized to evaluate the model.
2. Review of Related Works
Semantic segmentation had been utilized for pixel-by-pixel categorization of medical images, which includes instances as brain MRI images, dental images, and breast and liver lesion images.
Khagi and Kwon [5] applied a deep neural network for the classification of MRI images by grouping the pixels into particular classes as well as assigning descriptions to every pixel. According to them, the MRI images indicate the authentic substance of the brain which comprises three (3) key constituents which are white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). The developed system was capable of categorizing the brain into categories such as WM, GM, and CSF. Yauney et al. [6] developed a dental categorizer based on two convolutional neural networks trained with dentist image datasets for ailment detection.
A pixel-wise deep learning approach that employed variants of Fully Convolutional Networks (FCNs) such as FCN-AlexNet, FCN-32, FCN-16s, and FCN-8s has been used for semantic segmentation of breast lesions [7]. The system used pretrained ImageNet-based models and transition of training to address information deficiency issues and was competent to classify two categories of benign and malignant.
Bellver et al. [8] suggested a strategy for segmenting the liver and its lesions from CT scans using Convolutional Neural Networks (CNNs). They trained a detector to identify skin lacerations and diagnose with the positive detections of the effects using the segmentation network. The detector also locates the lesions using the segmentation network to eliminate false positives. The segmentation architecture was based on the Fully Convolutional Network (FCN) architecture of Deep Retinal Image Understanding (DRIU). The system operates on characteristics feature maps of diverse resolutions to allow multiscale information to be processed and learned at different network stages. A UNET-based deep neural network called RIC-UNET was also proposed for the nuclei segmentation of cellular images. The system utilized a residual inception channel attention based-UNET system. The system was tested on the dataset of the TCGA Cancer Genomic Atlas dataset [9].
A deep class-specific learning approach was also proposed [10] for the automatic segmentation of skin lesions. The technique observes on an individual basis the essential visual features of each class of skin lesions (melanoma vs. non-melanoma). They used possibility-centered and stepwise incorporation to integrate the effects of segmentation originated from distinctive class-explicit prototypes of learning. A segmentation technique using Full-Resolution Convolutional Networks (FrCN) was also used for the segmentation of skin lesions [11]. The FrCN approach acquires directly the complete resolution descriptions of a distinct pixel of exclusive information for both the pre- or postprocessing procedures, which include eliminating artifacts, modifying truncated disparity, and thereby improving the detection of skin lesion borders. The framework was tested on two skin lesion datasets, namely, the 2017 Challenge IEEE International Symposium on Biomedical Imaging (ISBI) and PH2 datasets [11].
Finally, a study reviewed a proposed model for the separation of retinal vessel images based on Deep Convolutional Encoder–Decoder Architecture. The method has been suggested by [12] and consisted of processes for encoder and decoder units. The system tolerates a low-resolution retina image which was investigated by a series of convolution layers in the encoder section before being conveyed to the final segmented output in the decoder section [12].
3. Deep Learning Methodology
The proposed method conducts diagnostic images to examine and diagnose the disease. It adopts a supervised learning approach that simultaneously embraces as input both the training datasets and the ground labels. The entire training phase is pixel-wisely done where each pixel from the training images is allocated with a pixel from the ground truth labels. The preprocessing stage is the first part of the system. This performs image cropping, resizing, and resampling to guarantee that both the training images and ground truth labels are following the same resolution and size. The input images are then sent to a Fully Convolutional Network for end-to-end learning with a dice loss feature. The FCN-UNET network adopts a multistage methodology. Figures 1 and 2 define the architectural representation of a deep convolutional network and the adopted structure, respectively. The entire system can be divided into the following components.


3.1. Data Preprocessing
Datasets of various clinical images used in this study include images of retina vessels, skin lesions, and brain MRI. Such images were first preprocessed to resolve differences in medical images in size, scale, and resolution. Tasks such as cropping, redimensioning, and resampling were performed on the images before they were sent to the FCN-UNET network. Little image dimensions of 160 × 224 were used in this work, as this influences the map dimensions of the input function. The images are also reordered by computing the average and standard deviation of the images’ pixel intensity values for the data normalization process. On-the-fly data boost was implemented to increase the number of training datasets.
3.2. Network Architecture
The FCN-UNET network utilizes encoder–decoder architecture for end-to-end training and learning from the clinical images and their respective ground truth labels [13]. The network uses the encoder network in the initial stage to learn the general visual characteristics of the clinical images pixel by pixel. At the later stage of the encoder–decoder architecture, the network learns general lesion recovery details and also captures the lesion borders information of the images. The general architecture is explained and discussed below.
The first part of the network, which is the encoder, is made up of five blocks of layers with each block composing of convolution layers, ReLU activation function, and a pooling layer each. The convolution layers perform feature extraction and generate feature maps from the input image. The ReLU activation function employed is a nonlinear function basically for image transformation.
This accepts the feature maps as input and transforms them into the system to train and learn on them properly. The transformed output will then be sent as input to the next level of convolution. The extracted characteristic maps are then pixel-wisely classified for the final segmentation. This is illustrated in the equation below:
The ReLU activation function uses the equation below:
The function of the pooling layer is to reduce the size and resolution of the extracted feature maps. This is to reduce the complexity and overfitting tendency and also to decrease the processing time for the computation. The layer adopts the equation stated below:
The FCN-UNET’s second part, which is the decoder, learns about the spatial features of data for recovery and boundary positioning purposes. It restores the original size from the encoder stage of the input function map. Each block of the decoder section also contains convolution layers, the ReLU activation feature, and the upsampling layer. It is also composed of five blocks of layers. The upsampling layers perform spatial recovery features and position of boundaries while the convolution layer proceeds with the extraction of features.
There is a short skip connection between the encoder section and the decoder section. The short skip connection enables the output from the encoder section to be merged and concatenated with the output from the convolution layers in the decoder section. This helps to increase the full restoration of the feature maps. The decoder’s final output is sent to the Softmax classifier which predicts the class for each pixel as illustrated in the equation below:where n represents two class numbers and the output is a probability two-channel image. Therefore, the expected segmentation corresponds to the highest likelihood category at each pixel.
The network architecture is described in Figure 1.
3.3. Model Implementation and Training
The system generally consists of two major parts to achieve segmentation via pixel-wise classification. This is shown in the general layout as shown in Figure 2.
In the first section, the model is trained using some skin lesion images training dataset. The Deep Convolutional Encoder–Decoder Network learns image pixel-wisely in an end-to-end manner. The first convolution layer in the encoder section extracts feature maps and learns from the maps. Down-sampling is performed by the pooling layer on the extracted feature maps to reduce the size and resolution of the encoder segment feature maps. This is then sent to the decoder in the second section through the shortcut skip connection where the downsampled feature maps are restored by the upsampling layers in the decoder section to the original size and resolution. In the encoder section, the visual appearance details of the lesion are captured and learned while the location information of the lesion borders is learned in the decoder section.
The method of downsampling and upsampling in the encoder and decoder portion efficiently executes the process of feature learning and extraction. Finally, the feature maps are sent into a softmax classifier for pixel-wise classification. The Softmax module employs equation (4) to perform the segmentation process by classifying each pixel of the feature maps.
In Figure 3, a flowchart diagram that explains the architectural diagram is represented.

4. Experiment and Analysis
4.1. Datasets
Three different medical image datasets are employed for the evaluation of the proposed system. These are described below.
4.1.1. Experiments on Skin Lesion Images
ISBI 2018 includes 2000 learning pictures with the experts’ ground truth. The picture dimensions have a maximum of 1022 × 767 resolutions. The ISIC Dermoscopic Archive provided this dataset [15]. It also includes 600 images of testing with corresponding images of ground truth. The input dataset is JPEG-format skin lesion images, while the ground truth is the PNG-format mask image.
The ground truth labels are provided with performance evaluation metrics to train and evaluate validation and test phase data.
4.1.2. Experiments on Retinal Images
There are 87 training images with the corresponding ground truth tags in the retina picture dataset. This also includes 40 test images with corresponding images of ground truth [16]. This is applied to increase the volume of the dataset.
4.1.3. Experiments on Brain MRI Images
The datasets used in this work were taken from the dataset of Brain MRI Images for Brain Tumor Detection dataset [17].
5. Evaluation Metrics
Dice similarity coefficient (DSC), accuracy, and dice loss function are the most common segmentation evaluation metrics used for performance evaluation. These metrics were used for model evaluation. The following were illustrated.
5.1. Dice Similarity Coefficient (Dice)
It calculates the similarity or difference between ground truth and automatic segmentation.
It will be specified as shown in the following equation [18]:
5.2. Accuracy (Acc)
This calculates the proportion of true results (both positive and negative) in the total number of cases investigated. This is seen in the following equation [19]:
5.3. Loss Function (Dice Loss)
It uses the equation below [20]:
6. Results and Discussion
The deep learning system has experimented on some datasets containing three sets of medical images. The results got are discussed below.
6.1. Skin Lesion Analysis Results
Figure 4 illustrates the pixel-wise analysis of the skin lesion image. The proposed method performs image segmentation of skin lesion images via pixel-wise classification. The result in Figure 4 shows how the pixels on a sample image of skin lesions are grouped into categories by the proposed method. Column 4 of the diagram displays each image’s confusion matrix. The image provides results of pixel categorization of 4060 pixels correctly categorized as malignance, 25135 pixels accurately categorized as nonmalignance, 7 pixels of malignance categorized as nonmalignance, and 2639 pixels of nonmalignance classified as malignancy. This offers us more than 90 percent accuracy.

The final segmentation output is shown in the next figure where the segmented output produced by the proposed system is compared with the ground truth label.
In Figure 5, some sample sets of original images and the corresponding ground truth labels from the testing skin lesion image dataset are employed for the experimentation process. The segmentation result shows a very close similarity with the expected output in the ground truth labels.

This is expressed in the curve of our dice coefficient in Figure 6 with a coefficient score of over 90 percent. Figure 6 shows the dice coefficient of the system on the skin lesion dataset as well as the training loss curves. The curves show that the loss decreases as the dice coefficient increases significantly. This shows clearly the relationship with the adopted dice loss function utilized by the system.

(a)

(b)
Overall performance shows that with more than 90 percent of the dice coefficient and less than 10 percent loss is acquired. The dice coefficient curve shows the relationship between the segmented output and the expected outcome also known as ground truth is very close. It can also be inferred that the system works efficiently.
6.2. Retina Image Analysis Results
The deep learning approach correctly identifies and segments the optical disk on each image of the retina. The length and location of the optical disk are established for a suitable diagnosis. Figure 7 displays the predicted outcome from the system using some original image sample sets and the equivalent ground truth label from the image dataset of the retina images. The result shows a very close similarity with the expected performance between the ground truth labels.

6.3. Brain MRI Analysis Results
The deep learning approach correctly identifies and segments the region of interest of the brain tumor on each image of the brain MRI. The size and location of the ROI are established for a suitable diagnosis of brain cancer.
Figure 8 displays the predicted outcome from the system using some original image sample set and the equivalent ground truth label from the image dataset of the brain MRI image. The result shows a very close similarity with the expected performance from the ground truth labels.

Figure 9 shows the classification output of the predicted outcome from the system using some original image sample set with tumor and without tumor in the brain MRI image.

7. Comparison of the System Performance with the Existing Systems
Table 1 shows the segmentation and analysis of some medical images using deep learning approaches.
Table 1 deduced that the proposed system performs better with 93% accuracy and a 90% dice coefficient than previous researches that had used deep learning methods on medical images. It was also demonstrated that the proposed model was tested on skin lesions, brain MRI, and retina images.
8. Conclusion
The research investigated the application of a deep learning approach to medical images. An enhanced FCN-UNET method has been proposed for medical image analysis in this research. To diagnose diseases such as skin cancer, brain tumor, and retina-related disease, the regions of interest of the disease areas were first segmented and identified. The proposed system has been tested on a publicly available dataset. The performance was evaluated using metrics such as dice coefficient and accuracy. Overall performance produced promising results with more than 90% accuracy and dice coefficient score. It can be inferred that the system works efficiently. In future work, it is recommended that images can be well preprocessed using probabilistic and fuzzy approaches [25, 26]. This will further improve the general performance of the proposed model.
Data Availability
The Brain MRI Images for Brain Tumor Detection dataset used to support the findings of this study is available at https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection/.
Consent
Informed consent was obtained from all individual participants included in the study.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.