Abstract

In diagnosing kidney stone disease, clinical specialists often apply medical imaging techniques such as CT and US. Among these imaging techniques, is frequently chosen as the primary examination method in emergency services due to its low cost, accessibility, and low radiation levels. However, interpreting the images by inexperienced specialists can be challenging due to the low image quality and the presence of noise. In this study, we propose a computer-aided diagnosis system based on deep neural networks to assist clinical specialists in detecting kidney stones using Direct Urinary System (DUSX) images. Firstly, in consultation with clinical specialists, we created a new dataset composed of 630 DUSX images and presented it publicly. We also defined preprocessing steps that incorporate image enhancement techniques such as GF, LoG, BF, HE, CLAHE, and CBC to enable deep neural networks to perceive the images more clearly. With these techniques, we considered the noise reduction in the DUSX images and enhanced the poor quality, especially in terms of contrast. For each preprocessing step, we created models to detect kidney stones using YOLOv4 and Mask R-CNN architectures, which are common CNN-based object detectors. We examined the effects of the preprocessing steps on these models. To the best of our knowledge, the combination of BF and CLAHE which is called CBC in this study, has not been applied before in the literature to enhance DUSX images. In addition, this study is the first in its field in which the YOLOv4 and Mask R-CNN architectures have been used for the detection of kidney stones. The experimental results demonstrated the most accurate method is the YOLOv4 model, which includes the CBC preprocessing step, as the result model. This model shows that the accuracy rate, precision, recall, and F1-score were found as 96.1%, 99.3% 96.5%, and 97.9% respectively in the test set. According to these performance metrics, we expect that the proposed model will help to reduce the unnecessary radiation exposure and associated medical costs that come with CT scans.

1. Introduction

The human excretory system consists of kidneys, ureters, and bladder. The system has a crucial role for human health. The kidneys in the excretory system filter toxic materials from the blood, particularly urea and the system ensures that they are eliminated from the body via the bladder [1]. Crystallized structures called “kidney stones” may occur in the kidneys when they perform their filtering function within the body. Kidney stones are one of the most common ailments affecting the kidney and urinary system due to complications with the kidney’s internal mechanism. Figure 1 shows an example of a kidney stone that occurred in the urinary tract and kidneys.

Small-sized kidney stones are thrown out by the urine without having any impact on the body. In contrast, as the diameter of the formed stones grows, they cause symptoms such as bloody urine, nausea, painful urination, and severe lower abdomen or back pain. Patients suffer unbearable pain from kidney stones when they come out of the kidney and fall into the urinary canal. As the detection process of these kidney stones takes longer, the quality of life worsens, which leads to kidney function deteriorating and human life is endangered. Therefore, the diagnosis of kidney stones at an early stage is significant in the treatment process [3, 4]. Many patients who have kidney stone disease apply to hospitals with various clinical manifestations such as fever, severe pain in the lower back and sides, and blood in the urine [5]. In some cases, the disease is confused with clinics such as appendicitis, cholecystitis, ovarian torsion, and mesenteric ischemia [6]. Due to this multiplicity of differential diagnosis and the accompanying physician density in the emergency evaluation, physicians may misdiagnose kidney stones and overlook the diagnosis of kidney stones in patients presenting with milder symptoms. Therefore, physicians demand additional imaging such as computerized tomography (CT) containing intense radiation from patients to diagnose [7]. The analysis and interpretation of these medical images are manually and subjectively performed by physicians. Physicians may misinterpret medical images in a short time due to fatigue and poor quality of contrast form in medical images. According to statistics, human-induced misdiagnosis rate can reach 10–30% in medical image analysis [8]. To minimize the misdiagnosis problem, computer-aided diagnostic systems are proposed as practical approaches that can help physicians make a diagnosis. Hence, numerous neural network models such as artificial intelligence, machine learning, and deep learning models have been widely used to increase diagnostic accuracy in medical image analysis [9, 10]. Deep learning models particularly convolutional neural networks (CNNs) have recently become popular in medical image processing because high-level feature can be extracted from objects, after the training phase is completed [11].

In this study, a computer-aided diagnosis system is proposed to help physicians by automating kidney stone detection using CNN architectures through DUSX images. Because of its widespread use, the low amount of radiation compared to other imaging techniques, and the availability of imaging devices even in the simplest medical clinics, DUSX images were used in the study. Despite the widespread use of this imaging technique, as far as we investigated, a DUSX dataset has not been encountered in the literature. Our dataset was approved by the Ataturk University Faculty of Medicine Clinical Research Ethics Committee, and the dataset is publicly available for scientific studies. Hereby, we contributed to the literature by publishing a new DUSX dataset retrieved by Ataturk University Research Hospital in Erzurum, Turkey. Moreover, we investigated the effect of six image enhancement techniques (Gaussian Filtering (GF), Laplacian of Gaussian Filtering (LoG), Bilateral Filtering (BF), Histogram Equalization (HE), Contrast-Limited Adaptive Histogram Equalization (CLAHE), and Combination of BF and CLAHE (CBC)) to increase the accuracy rate of CNN models on DUSX images for automated kidney stone detection. After the preprocessing step, models were created using CNN-based object detectors YOLOv4 and Mask R-CNN architectures to automatically detect kidney stones on the images. Among the evaluated models, YOL0v4 model using CBC technique as a preprocessing step has the best performance with 96.1% accuracy on the test set. The developed computer-aided diagnosis system is ready for clinical application.

The main contributions of this study are summarized in the following:(i)A new dataset was generated using unique DUSX images obtained from the hospital. We consider that this public dataset will pave the way for further research.(ii)Based on our investigation, the CNN-based object detectors YOLOv4 and Mask R-CNN architectures are first applied to DUSX images for kidney stone detection.(iii)The presence of noise in DUSX images and the poor quality, especially in contrast form, make it difficult to notice some details, and the lack of these details leads to reducing the accuracy of CNN models. To address this issue, the effect of various image enhancement techniques on kidney stone detection was investigated.(iv)Among many image enhancement techniques, a hybrid filtering approach named CBC + YOLOv4 was proposed to detect kidney stones. This technique outperformed the other preprocessing methods.

The rest of the study has been detailed in the following. Section 2 discusses the related kidney stone detection methods in the literature. Section 3 describes the dataset and image labeling process, preprocessing steps, and CNN models used in this study. Section 4 shows the experimental results regarding different parameters and preprocessing techniques, and finally, Section 5 concludes the study.

In recent years, many studies have been conducted to detect kidney stones on medical images in the literature. Medical imaging techniques such as ultrasound, DUSX, MRI, CT, and color Doppler are used to diagnose kidney stone disease [12]. The main focus of the studies is based on the information on whether the stone is found in the image or not, and it has not been represented the boundary of the stones in visual results as we performed in our study. Based on our investigation, CNN-based object detectors have not been used to detect kidney stones in DUSX images. Moreover, an open-source dataset of DUSX images has not been found in the literature as far as we investigated. In this section, literature studies related to kidney stone detection were discussed according to the imaging type. The previous studies and their general features are summarized in Table 1.

Viswanath and Gunasundari [30] conducted kidney stone detection in ultrasound images. They used Gabor filtering and histogram equalization as preprocessing steps to eliminate the speckled noise in the ultrasound images and provide clarity in the image. Then, the preprocessed ultrasound image was segmented using level-set segmentation. After the segmentation process, wave sub-bands were used to detect energy levels in the areas removed from the kidney. Since the energy level in the region of the stone is different from the threshold value, the energy levels assisted to predict the location of the stone. Using energy level, they trained the network model created from Multilayer Perceptron and Back Propagation ANN. The authors stated that their system has 98.8% accuracy rate. In another study, Verma et al. [13] applied a median filter, Gaussian filter, and unsharp masking processes to clarify the stones in the ultrasonic images. Entropy-based segmentation was performed to find the stone area using morphological operations such as erosion and dilatation. They used some classification techniques, such as KNN and SVM. According to the experimental results, KNN has better accuracy rates than SVM. The authors stated that the KNN classification technique has better performance than SVM with 89% accuracy rate. Selvarani and Rajendran [15] proposed a metaheuristic SVM-based method to detect kidney stones on ultrasound images. They proposed the adaptive mean median filter approach to remove speckle noise from ultrasound images. Segmentation was performed using the K-means clustering algorithm. They extracted GLCM features for classification. According to the experimental results, metaheuristic SVM classified images with 98.8% accuracy. Eskandari et al. [16] proposed an expectation–maximization segmentation algorithm to detect kidney stones in ultrasonic images. Noise removal was performed on these images using the wavelet thresholding technique. Then, the authors used the expectation–maximization algorithm to segment kidney stones (renal calculi) in renal ultrasound images. They achieved 99.96% accuracy and 82.38% precision rate. However, the authors experienced that the computation time (58.02 s) was slower than the traditional algorithms. Khan et al. [14] proposed a speckle reduction approach to detect kidney stones on practical ultrasound (US) images using median filters and image segmentation techniques. A median filter was used to smooth images and reduce noise. Besides, a thresholding technique was used to make the segmentation process more robust. The approach has a 96.82% accuracy rate and 92.16% sensitivity on fifty test cases.

Längkvist et al. [9] developed a method using CNNs for the detection of kidney stones from CT. As a preprocessing step, smoothing was performed with a Gaussian filter. As a preprocessing step, a Gaussian filter was applied for smoothing. This method uses raw pixels on 3D volumes instead of feature extraction. The authors obtained 100% sensitivity and 2.68 false positives per patient in stone detection. Chak et al. [17] classified CT images as with stone and without stone using the support vector machine- (SVM-) based linear classifier. Before the classification phase, they used a neural network-based feature extraction method and applied a preprocessing step to filter the speckle noise on the CT images. They stated that their system obtained 95%–99% accuracy. Parakh et al. [18] used cascading CNN architecture for kidney stone detection in CT images. They created two CNNs structures identifying the urinary tract and detected the presence of stones. The authors obtained 95% accuracy, 94% sensitivity, and 96% specificity. Soni and Rai [12] also used CT images to detect kidney stones. Histogram equalization was used as a preprocessing step, and emboss was applied to calculate the differences in colors according to the directions. The support vector machine (SVM) classification method was applied to divide the vector space into two separate regions as stone-affected and healthy kidneys. The experimental results show that the test model has 98.71% accuracy rate. Cui et al. [19] developed a deep learning and thresholding-based model for kidney stone detection. In addition, they focused on scoring the detection on noncontrast CT images. They used 3D U-Net architecture as a deep learning method. The proposed model achieved a sensitivity of 95.9% in detecting stones larger than 2 mm in diameter. Yildirim et al. [20] performed kidney stone detection on CT images taken from 433 subjects. The created dataset consists of 1799 images by taking different cross-sectional CT images for each subject. A model was created by using xResNet-50 (cross-residual network) architecture, and the experimental results show that the accuracy rate was found by 96.82% on the test dataset. In another study that used CT images, Baygın et al. [22] aimed to classify patients who have kidney stones or not. They proposed a new classification network called ExDark19. They used the KNN algorithm as a classifier, and this classifier achieved 99.22% accuracy rate in the test data. Islam et al. [21] conducted the detection of three main kidney diseases (kidney stones, cysts, and tumors) on 12,446 CT images. They used vision transformers (EANet, CCT, and Swin transformers) and deep learning models (ResNet, VGG16, and Inception v3) to detect kidney diseases. The authors stated that the most accurate method was the Swin transformer, and the model in the test images had 99.30% accuracy rate in detecting the three types of kidney disease. Sabuncu et al. [24] used the Inception v3 model as a reference to detect kidney stones in CT images. In their study, a test accuracy of 98.52% was achieved in detecting kidney stones from CT images. Patro et al. [23] aimed to reduce redundancy in feature maps without convolution overlap by using a Kronecker-product-based convolution method instead of the traditional CNN-based deep learning network on the CT imaging model. The authors highlighted that the proposed method made the network efficient by extracting abstract and in-depth features from the input images. The method was validated using 10-fold cross-validation, and experimental studies show that the detection of kidney stones from CT images has an accuracy rate of 98.56%.

A method proposed by Akshaya et al. [25] used DWT (discrete wavelet transform) as a preprocessing step to detect kidney stones in the MRI images. Key features were extracted using GLCM (gray level co-occurrence matrix). The GLCM matrix determines the texture properties of an image by calculating how often pixel pairs occur with certain values in a specified spatial relationship. A dataset generated by 20 test data containing normal and abnormal kidney MRI images was classified using the backpropagation method. Kobayashi et al. [28] proposed a deep learning-based CAD (computer-aided diagnosis) system to detect kidney stones on plain images. They used 17-layer ResNet architecture for patchwise training. According to experimental results, their CAD system showed 87.2% sensitivity and 66.2% positive predictive value (PPV). Preedanan et al. [29] have proposed a two-stage pipeline for detecting kidney stones using the segmentation technique. Firstly, the location of urinary organs in images was identified using U-Net. Then, segmented images were increased using data augmentation methods and are passed through the second phase U-Net network to reduce class imbalance in the resulting map. Experimental results have shown that the U-Net network using the two-stage pipeline produces an accuracy of 80% in urinary stone classification.

3. Material and Method

3.1. Dataset and Image Labeling Process

The generated dataset consists of 630 DUSX images obtained from patients who applied to Ataturk University’s Urology Department, due to urinary system kidney stone disease. In the dataset, 558 images have one or more kidney stones in different regions with various sizes. The rest 72 images do not include a stone, and they are labeled as healthy kidneys. The dataset is split by 80% as the training set and 20% as the testing set. The training dataset is also divided into two parts as 80% training and 20% validation in order to increase the training success in itself. The hierarchy of DUSX images including 844 stones in total used for training, validation, and testing, is represented in Figure 2. The entire dataset can be accessible in this study’s “Data Availability” section.

Labeling is a critical process for applying images supervised learning methods in artificial neural networks. In this study, the boundaries of the kidney stones for each image were acknowledged by a specialist doctor who works in the Urology Department. Then LabelImg for YOLOv4 and VGG Image Annotator [31] for Mask R-CNN were used to draw boundaries of the kidney stone in each image. In the YOLO tagging format, a file with the same name was created for each image file. Each .txt file contains object class, object coordinates, object height, and width information as annotations for the corresponding image file. At the end of the labeling process, all images in labeled format were saved, and txt tag files in YOLOv4 format containing the coordinate information of kidney stones (center of the stone, x, y coordinate, width, and height of the stone) in each labeled image were obtained. In Mask R-CNN tagging format, all image files were stored in a single .json file format. In the JSON file, the file names of the tagged images and the x, y coordinates of the points forming the polygons of the tagged objects are found in each image.

3.2. Preprocessing

The use of CNN models has become popular in medical image processing since their high performance in the detection and classification of many diseases. These models are also widely used in image processing, and they can learn high-level properties about objects in images by processing pixels in different layers. Thanks to these high-level features, the classification and detection operations can be performed successfully in the images. Before the classification and detection processes with CNN models, the accuracy of the models can be increased with different image enhancement techniques to be performed on the images.

In this study, object detection models were generated for automated kidney stone detection on DUSX images using CNN models. However, the presence of noise in DUSX images and the poor quality, especially in the form of contrast, make it difficult to notice some details and reduce the accuracy of the models. The images should be perceived more clearly by the models to increase the accuracy rates in the detection process. Therefore, various image enhancement techniques (GF, LoG, BF, HE, CLAHE, and CBC) were applied before generating CNN models and the effect of these techniques was investigated according to the experimental results. The coefficient values of the filters used in the preprocessing step are given in Table 2. Figure 3 demonstrates the original images and corresponding enhanced images applied in the preprocessing step. In Figure 3(a), kidney stones identified by the urologist were shown in red circles. The preprocessing phases performed in this study are described in the following subsections.

3.2.1. Gaussian Filtering (GF)

The kernel of the filter is a discrete estimate of the normal distribution in the Gaussian filter. With this filter, an infinite transfer function can be filtered with a finite scanning window in the spatial domain. When a convolution operation is performed between an image with a Gaussian filter, the average of the pixels in the image is considered and the difference in value between neighboring pixels is reduced. Hereby, noise is also reduced by smoothing the image. The Gaussian filter is generally used for operations such as noise removal, smoothing, and edge protection. The use of the Gaussian filter for two-dimensional images was represented in the equation (1) [32]. Equation (1) indicates the x, y values as the horizontal and vertical distances from the center, represents the standard deviation of the Gaussian distribution, and represents the natural logarithm.

In this study, the window size of the Gaussian filter was chosen as 5 × 5 and value was 1, and DUSX images were smoothed at this rate. The large standard deviation value leads to larger peaks and this problem cause the image to be more blurred. Therefore, the value of was chosen as small as possible. Figure 3(b) shows the GF-applied images.

3.2.2. Laplacian of Gaussian Filtering (LoG)

Laplacian is an operator representing linear quadratic derivative. The operator is used to define the edge transitions and contours in the images. Laplacian-based methods are sensitive to noises and when these methods are used on the images, the images have many unwanted edge points and noises. To handle this issue, the image is smoothed using Gaussian low-pass filtering in the LoG method [33, 34].

In this study, the LoG filter was applied to smooth the images, sharpen the edge contours of the kidney stones, and reduce the noise on DUSX images. LoG pixel values were calculated as shown in the following equation:

In Equation (2), (x, y) represents the pixels of the input image, is the standard deviation value, and LoG (x, y) represents the pixel values of the filtered image. The window size of the Gaussian filter was chosen as 5 × 5, value was selected as 1, and the input image was smoothed at this rate. A mask with edge pixel information of the image was obtained by applying the Laplacian operator on the image passed through the Gaussian filter. The output image was obtained by adding the original image and the mask containing the edge information. The original images, the mask, and the added version of the original image are shown in Figure 3(c).

3.2.3. Bilateral Filtering (BF)

The bilateral filter is a basic antialiasing filter that aims to preserve edge information while smoothing images. Bilateral filters are frequently used when noise reduction is required by preserving the edge. BF includes a combination of two different Gaussian kernels, spectral and spatial kernels. Filtering is performed according to the spatial proximity of the central pixel to the neighboring pixels. In case of a high brightness difference between two pixels, it was aimed to preserve the sharp transition by adjusting the filter coefficient of the neighboring pixel according to the difference. In this way, the edge information of the image was preserved better than standard antialiasing filters during filtering. The BF process is expressed by the following equation:

In equation (3), I indicates the input image, p represents the currently filtered pixel position of the image, and q represents the neighboring pixels of the pixel that are in the S neighborhood. normalization parameter is expressed by the following equation:

The Gaussian kernel is expressed by the following equation:where and indicate the standard deviation of the spatial smoothing function and the spectral effect function, respectively. The spatial kernel enhances the effect of nearby pixels, and the spectral kernel increases the effect of those with closer pixel values in the neighborhood. S is the area containing the neighborhoods centered on the p pixel in the image. The values of Gaussian kernels are the most effective factor for the filter performance. A high spectral kernel value causes the filter to execute like a typical Gaussian filter. As this value increases, the difference between the values of the pixels is not considered. Moreover, the increment of spatial parameter causes the smoothing larger features [3537]

In the experiments conducted within the scope of this study, the best results were obtained using the following parameters on the bilateral filter: a window size of 5 × 5, 2 and 0.1. Since the values preserve the edge information, the kidney stones in the DUSX images can be distinguished better. Figure 3(d) shows bilateral filtering applied images.

3.2.4. Histogram Equalization (HE)

Many contrast enhancement methods for medical images can be seen in the literature. Today, histogram equalization is one of the most preferred methods to improve the contrast of radiographic images. There are two types of classification for contrast enhancement methods in the literature. In the first type, contrast enhancement methods are divided into two classes according to frequency space and spatial space [38]. In the second type, contrast enhancement methods are classified as global and local methods. Global methods use the histogram of the entire image for contrast enhancement. As an alternative to the global method, local methods were developed to solve this problem. In native methods, the histogram of each subsection of the image is used instead of the entire image histogram in contrast enhancement [39]. In histogram equalization, the brightness distribution of the image is normalized to improve the global contrast of the image. Then, an output image with a uniform density distribution can be obtained. This operation is represented in the following equation [40]:

In equation (6), is the number of pixels in the level, L is the desired gray level number (256 for 8 bits), and n is the total number of pixels. Figure 3(e) shows histogram equalization applied images.

3.2.5. Contrast-Limited Adaptive Histogram Equalization (CLAHE)

The traditional histogram equalization method uses a global density distribution; for this reason, some important features can be suppressed by unimportant features such as background or noise [41, 42]. To solve this problem, the adaptive histogram equalization (AHE) method [43] was proposed in the literature.

In adaptive histogram equalization, the histogram of each sub-block of the image is used instead of the global histogram. Each pixel in each sub-block of the image is arranged in intensity proportional to the pixels in the surrounding region. In other words, the image is divided into sub-blocks in the form of a grid, and standard histogram equalization is applied to each sub-block. Then the sub-blocks are combined to obtain an enhanced image. Since the AHE method works in the local area, distortions called the blocking effect may occur in the border parts of the sub-blocks during the merging process. The blocking effect is a discontinuity problem. To solve this problem, the bilinear interpolation method is used for combining sub-blocks. Noise problems arise in local areas when contrast enhancement is performed with the AHE method. Especially noise increases in homogeneous regions. For this reason, the contrast-limited adaptive histogram equalization (CLAHE) method was developed to avoid noise by limiting the contrast enhancement [39, 43]. The CLAHE method helps to improve contrast in medical images without increasing the effect of noise [41].

In this study, we implemented CLAHE method to improve the contrast of DUSX images. In the CLAHE method, each image is divided into sub-blocks and the histogram of each sub-block is calculated. Each histogram is then cropped so that it does not exceed the clipping limit value (clipping limit = 3). Thus, the effectiveness of noises is prevented by limiting the contrast enhancement size. The number of clipped pixels is evenly distributed on the histogram. The histogram equalization method is applied for each histogram. In addition, the bilinear interpolation method is used to eliminate the blocking effect that may occur during the joining of sub-blocks. In the CLAHE method, local contrast enhancement is performed without increasing the amount of noise on the image. Figure 3(f) represents the CLAHE-applied images.

3.2.6. CBC

In this technique, noises are removed firstly by preserving the edge transitions by applying BF to the images. Then, CLAHE is applied to improve the contrast of the image, hereby, the local details can be identified easily. The results show that the use of the two combined methods is more beneficial than a uniform image enhancement. The general block diagram of the CBC method used in this study is given in Figure 4. In this method, the parameters were determined as the following: the window size of the bilateral filter is 5 × 5, 2 and 0.1. The sub-block size of the CLAHE method is 16 × 16, and the clipping limit is 3. Figure 3(g) shows CBC-applied images.

3.3. CNN Models

Determining the presence or absence of a disease in medical images is a wide-ranging application field, and it can be defined as a classification problem. In this field, deep learning methods, especially CNN algorithms present remarkable accuracy. In this study, YOLOv4 and Mask R-CNN models were used for kidney stone detection in DUSX images. The training of the created CNN-based models was carried out on DUSX images. The general architectural structure used to compare different models and preprocessing methods applied in this study is shown in Figure 5. In the following subsections, we explained the architectural structures of YOLOv4 and Mask R-CNN models and the network configuration details for our dataset.

3.3.1. YOLOv4

In this study, the YOLOv4 model was used for the automatic detection of kidney stones in DUSX images. This section describes the YOLOv4 which is one of the most popular CNN-based object detectors. The network configurations for this model are explained in the following sections.

YOLO (You Only Look Once) [44, 45] is a CNN-based algorithm that can detect multiple objects in a single step with high accuracy and speed in real time with multibox structures. The YOLOv4 algorithm was proposed by Bochkovskiy et al. [46] in 2020 as the fourth version. The YOLOv4 version is an algorithm that can be trained quickly on a single graphics processing unit (GPU) and generates more accurate results than other versions. In the YOLOv4 model, the CSPDarknet53 [47] neural network is used as the feature descriptor. CSPDarknet53 performs splitting and merging operations on the feature map to provide more gradient flow from the Darknet-53 CNN. Darknet-53 is a convolutional network trained on ImageNet and it consists of 53 consecutive 1 × 1 and 3 × 3 convolutional layers followed by residual layers. Darknet-53 uses GPU efficiently due to the high number of floating-point operations it performs per second and makes the evaluation more efficient and faster than other feature extractors such as ResNet101 or ResNet152 [48, 49].

To extract features in the YOLOv4 architecture, SAM (spatial attention module), PAN (path aggregation network), and SPP (spatial pyramid pooling) structures [46] were implemented and their properties were extracted at three different scales to recognize objects of various sizes. When the input image is given to the network, the third scale divides the image into 52 × 52 cells, enabling the detection of small-sized objects. The second scale divides the image into 26 × 26 cells, allowing common-sized objects to be detected. The first scale allows the detection of large objects by dividing them into 13 × 13 cells. Using these sizes, the output size of each scale was calculated as N × N × [3 × (4 + 1 + C)]. In this equation, the expression 3 represents the number of bounding boxes calculated for each cell, 4 denotes the offset values (, , , ) of each bounding box, 1 denotes the objectivity score, and C the number of its class. Finally, the output of the network shows the boundary box’s coordinates belong to the estimated object, the objectivity score, and the class information of the object [45, 48]

(1) Bounding Box Prediction. Anchor boxes are used to estimate the boundary boxes as shown in Figure 6. The best anchor boxes are calculated by applying the K-means clustering algorithm. When the K-means clustering algorithm is applied, the Intersection over Union (IoU) score is used instead of the Euclidean distance. If several anchors overlap, any anchor can be selected with the IoU value. The sizes of anchor boxes obtained by the K-means clustering algorithm are appropriately assigned to the scales. The network predicts four values for each bounding box (, , , ). Using the sigmoid function, the center coordinate values (, ) are reduced to the range 0-1. Using the equations in Figure 6, the center point of the anchor box obtained using K-means is calculated by its distance from the upper left corner of the grid cell and . By adding the distances (, ) to the upper left corner of the image to these values, the center point coordinates of the boundary box are found (, ). Finally, the width and height of the boundary box (, ) are calculated using the anchor box dimensions (, ) using K-means. With the operation, , the signs of the values are converted to positive in case of encountering negative and values. The YOLO network estimates an objectivity score for each anchor using logistic regression. The objectivity score estimate indicates the probability that the anchor contains an object. It also estimates the class probability using an independent logistic classifier. Class probability estimation performs object classification.

3.3.2. Mask R-CNN

In this study, the Mask R-CNN method evolved by region-based convolutional object detection algorithms was used to detect kidney stones in DUSX images. This section briefly introduces Mask R-CNN architecture. The network configuration details according to our dataset were explained in the following sections. Mask R-CNN is a neural network model which is based on object segmentation and developed by the Facebook Artificial Intelligence Research (FAIR) team as an extension of the Faster R-CNN algorithm. Instead of semantic segmentation identifying each pixel in the image, Mask R-CNN uses instance segmentation by segmenting only the pixels where the target objects are located and placing a mask on them. The output of the network consists of the bounding box which is the output of the Faster R-CNN algorithm, and a mask defining the object outlines at the pixel level in addition to the object class. Region-based convolutional object detection algorithms generally consist of two parts: (1) identifying regions that could potentially be objects in the processed image and (2) delimiting the detected object with a bounding box [50, 51].

(1) Region Proposal Network (RPN). The algorithm gives the input image to the backbone (ResNet101), which is a standard convolutional boundary network that acts as a feature extractor. Convolutional feature mapping is generated by passing the image through the feature pyramid network in the backbone. The nn size window is shifted over the feature map; then this window is matched with a lower-dimensional feature vector. The RPN proposes region anchor boxes, which can be a set of objects, with different aspect ratios at each floating window location. Each proposed anchor box is associated with the box’s objectivity score and the four coordinates of the bounding box. If the anchors highly overlap each other, the nonmax suppression (NMS) process is performed by choosing the one with the highest intersection-to-union (IoU) ratio value. Then intersection regions (Region of Interest, RoI) are obtained by NMS processing. [52].

(2) Region of Interest (RoI). Classifiers are capable of processing fixed-size input images better than variable-size input images. However, the RoI regions have different sizes due to the different aspect ratio bounding boxes in the RPN. Therefore, a process of resizing region proposals named RoI pooling is required to make regions fixed-sized. The RoI Align method was developed for the pooling process in the Mask R-CNN method [50].

In the RoI Align method, the region suggested by the RPN is divided into grids. Because each grid cell is expected to contain the same number of pixels, fractional pixel states may occur. Then, each cell of the grid is sampled by dividing it into four subcells. Bilinear interpolation is performed to represent subcells with a single value. In the last step, maximum pooling is performed on the bilinear interpolation values to obtain the sized output. Fixed-size RoIs are sent to the fully connected layers to generate classification and bounding box information, and then they are sent to the mask branch to generate mask information.

(3) RoI Classifier and Bounding Box Regressor. In the RPN network, binary classification is performed as foreground/background for RoI’s. Unlike RPN, the RoI classifier has a deeper network and it performs multiple classifications for each RoI. A bounding box is created for RoI in RPN. In the bounding box regressor, the bounding box is optimized to fully cover the object.

(4) Mask Branch. The mask branch is a CNN that uses the RoI information obtained after the RoI align procedure and it creates masks for them. The created masks have a lower resolution than 28 × 28 pixels. The small-sized mask ensures that the processing density is low and the mask branch remains lightweight. During the training phase, the basic truth masks are scaled to 28 × 28 size to calculate the value of the loss function. During the estimation, the masks are scaled to the dimensions of the RoI bounding box and final masks of the detected objects are created including one for each object [53].

(5) Loss Function. The loss/error function of the Mask R-CNN method is formalized in equation (7). The equation shows that the loss function consists of three subcomponents. These subcomponents are the loss function for each classification result , the loss function for the regression process used to determine the bounding box , and the loss function for the segmentation mask , respectively. The minimization of this function is performed iteratively using the gradient descent algorithm [50]:

3.3.3. Network Configurations and Model Training Phases

In YOLOv4 and Mask R-CNN constructs, network configuration files need to be adjusted for model training. These files contain many parameters such as the architectural structure, number of layers, activation functions, learning rate, and input image for network training and testing phases. In this study, all training processes were carried out on the Intel(R) Core(TM) i5-7400 CPU 3.00 GHz processor and Nvidia Geforce RTX 2080 8 GB GPU computer. To perform the training operations on the graphics card, the 10th version of the CUDA library was used to offer parallel computing on the graphics card created by Nvidia. Besides, OpenCV Library, which can be used as open source for operations on images, Keras, and TensorFlow libraries were implemented to train the Mask R-CNN model. Predictions were performed with the images reserved for verification, and the models were saved in the backup folder specified in the configuration file. k-Folds (5-fold) cross-validation was performed in the training phase to ensure the randomness of the generated models and avoid the overfitting problem. As a result of cross-validation techniques, the average of five models was considered as a result model and the evaluation process was performed on this model. Different training processes were carried out for each preprocessing step applied to the images.

(1) Configuration and Model Training for YOLOv4. The parameters required for YOLOv4 model training were located in the “YOLOv4-custom.cfg” configuration file in the root directory of the Darknet backbone. The parameters shown in Table 3 were arranged and the configuration file was prepared for the training. After parameter and file configurations, initial weight values were identified randomly during the training phase. For this reason, the training period may be time consuming. To shorten the training time and transfer learning, Darknet-53 convolution weights, which were previously trained in the Imagenet, were identified as initial weights in the training of the “darknet53conv.74” YOLOv4 network. Each training process was performed in 6000 iterations and takes 15 hours. Evaluation of the output models obtained as a result of the training phases is detailed in Section 4.

(2) Configuration and Model Training for Mask R-CNN. Matterport [54] Mask R-CNN which is one of the popular frameworks was chosen to train the Mask R-CNN model. The parameters required for model training were configured in the “conFigure py” file which is located in the root directory. Network configuration and training parameters for the Mask R-CNN model were identified in Table 4. We used the “mask_rcnn_coco.h5” file, which was previously trained with the Microsoft COCO [55] dataset as the initial weight values for the model training. Mask R-CNN takes a parameter named “layers” during the training phase. This parameter has two options named “heads” and “all.” The “heads” option changes only the weights of the last layers of the model where transfer learning is performed, and the available weights of the Backbone and RPN (region proposal network) networks are not changed. When model training is performed by selecting the “all” option, all layers including Backbone and RPN are trained according to the new dataset. It was observed that a more successful model is identified as a result of the training by choosing “all” option. Each training process was performed at 400 epochs and takes 10 hours. Evaluation of the output models obtained as a result of the training phases is detailed in Section 4.

4. Experimental Results

In this study, the evaluation process was carried out with the IoU (Intersection over Union) metric. The IoU is an evaluation metric that measures the similarity between the ground truth bounding box (labels denoted by the bounding boxes in the test dataset) and the predicted bounding box to evaluate the robustness. So, it is defined as the intersection of the junction of the detection bounding box and the ground truth bounding box. The IoU score ranges from 0 to 1, the closer the two boxes are, the higher the IoU score. A threshold value with a real value of 0.5 is commonly accepted to convert each object detection into classifications. If the kidney stone is detected according to IoU 0.5 threshold value, the object is classified as true positive (TP). When a label is present in the image and the model fails to detect the kidney stone, the object is classified as false negative (FN). If the image has no labeling but the image has a detection with IoU 0.5, it means that this is a false detection and it should be classified as false positive (FP). Using these parameters (TP, FP, TN, and FN), accuracy rate, precision, recall (sensitivity), F1-score, and specificity were calculated to compare the performances of the models. The equations of these evaluation metrics are represented in the following equations, respectively:

The performance of the obtained models after the training phase was tested on 120 test images containing 142 stones. Initially, the models were tested on images without any preprocessing steps, and six preprocessing steps were applied to the images. Test images of YOLOv4 and Mask R-CNN models are given in Figures 7 and 8, respectively. When the test images are examined, it is seen that the YOLOv4 model using the CBC preprocessing step was able to detect all stones in the figure.

Confusion matrix is a table of predictions and actual values used to evaluate the performance of classification models in machine learning. Confusion matrices with four combinations (TP, FP, FN, and TN) are created using estimated and real values. Figures 9 and 10 represent the confusion matrices for YOLOv4 and Mask R-CNN models, respectively.

When the confusion matrices of the models are examined, it is seen that the YOLOv4 model without applying the preprocessing step can detect 118 of 142 stones, and the Mask R-CNN model can detect 115 of them. The performance of the models increases in kidney stone detection by applying preprocessing steps. In particular, the model obtained by using the CBC method and the YOLOv4 model was able to detect 137 out of 142 stones. The accuracy rate, precision, recall, F1-score, and specificity values of the models were evaluated using confusion matrices to compare the performances of the models, and the results are shown in Table 5. When the performance of the models is evaluated according to the calculated metrics, it is seen that the application of BF, GF, LoG, CLAHE, and CBC preprocessing steps increased the accuracy of CNN models. On the other hand, HE did not significantly increase the accuracy of the models.

Kidney stone disease classification was performed in DUSX images using transfer learning method with EfficientNet [56], Densenet [57], ResNet101 [58], and MobileNet [59] deep learning architectures. The best 1000 features for each transfer learning method were selected with the relief algorithm, and the obtained features were classified by SVM using 5-fold cross-validation. Accuracy rate, precision, recall, and F1-score values of the models are given in Table 6. When Table 6 is examined, it can be seen that the DenseNet201 [57] method, one of the transfer learning methods, is the most successful method with an accuracy of 89.3%.

The receiver operating characteristics (ROC) curve is an evaluation curve to check the performance of any classification model. The ROC curve is widely used to evaluate the performance of machine learning algorithms. It is efficient, especially in unbalanced datasets and it explains how well the model predicts. The ROC curve has a false positive rate (FPR) on the x-axis and true positive rate (TPR) on the y-axis. It facilitates the comparison of the accuracy of different models trained on the same dataset. Figure 11(a) shows the ROC curve of the YOLOv4 models, and Figure 11(b) shows the ROC curve of the Mask R-CNN models used for this study. Area under curve (AUC) refers to the area under the ROC, and it can be considered as a summary of model performance. In this curve, the larger area leads the more accurate model predictions. The ideal value for AUC is 1. When Figure 11 is examined, it is seen that the YOLOv4 model with CBC preprocessing step has the highest AUC (0.94). This means that the predictions of the proposed model are correct with a 94% probability.

5. Discussion and Conclusion

In this study, a CNN-based computer-aided diagnostic system was proposed to automatically detect kidney stones in DUSX images. For this purpose, a new dataset was proposed to the literature obtaining 630 DUSX images that belong to the patients of Ataturk University’s Urology Department. We believe that the proposed dataset paves the way for further investigation of kidney stone detection systems.

The presence of noise in DUSX images and poor quality, especially in contrast form, reduces the success of CNN-based models in kidney stone detection. Six image enhancement techniques (GF, LoG, BF, HE, CLAHE, and CBC) were evaluated to ensure the quality of DUSX images. We investigated the effect of these techniques on automatic kidney stone detection. The experimental results show that the YOLOv4 model using the CBC technique as a preprocessing step has the best performance with 96.1% accuracy rate and this technique was proposed as a promising result model. The success of the proposed result model was clinically evaluated and accepted by a specialist urologist. The model can help urologists and radiologists to accurately detect kidney stone cases and reduce their workload. Additionally, we expect that the use of our proposed model will help to reduce the unnecessary radiation exposure and associated medical costs that come with CT scans.

We encountered some challenges during the training phase of the proposed system. Since YOLOv4 and Mask R-CNN architectures perform operations such as feature extraction and size reduction directly on the training images, the training times lasted too long. Moreover, labeling the images one by one before the training phase, especially the polygon tagging used by the Mask R-CNN architecture, brings a serious workload to the operational processes. When we evaluate the YOLOv4 and Mask R-CNN architectures in terms of training time, the workload on operational processes, and detection performances, it is observed that YOLOv4 outperformed Mask R-CNN in terms of workload and training time. Another challenge is that patient or device movements can cause blurry or distorted images, making it difficult to detect kidney stones. Additionally, such actions lead to an increase in false positive results. Moreover, all DUSX images obtained from a single hospital may limit the generalizability of the model. In future studies, we aim to expand our dataset obtaining DUSX data from different hospitals to generalize the performance of our model and make it more robust.

In future studies, the dataset will be expanded and a balanced data distribution will be established to enhance the accuracy and precision of kidney stone detection from images. In addition, we have planned to achieve the detection of smaller kidney stone types with higher accuracy and speed by mapping the locations of the stones using segmentation methods. Moreover, we will evaluate the ability of our model to detect other pathological conditions such as tumors and cysts, and optimize the model for such situations.

Data Availability

The dataset used for this study is available at https://github.com/ugrkilc/DUSX-Dataset. The code that supports the findings of this study is available from the first author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

In this article, all researchers made significant contributions. Ugur Kilic has the main responsibility of the study from the implementation of the methods to validation of the system. Isil Karabey Aksakalli has an intermediary role for obtaining the medical data and data curation, writing, reviewing and editing, and visualization roles. Gulsah Tumuklu Ozyer and Baris Ozyer are experienced in deep learning, and they contributed to combined image enhancement methods. Tugay Aksakalli and Senol Adanur extracted the DUSX images from the Ataturk University Research Hospital, and they analysed and tagged the images.

Acknowledgments

The authors would like to thank the Ataturk University Faculty of Medicine Clinical Research Ethics Committee for sharing DUSX images of patients anonymously (approval number B.30.2.ATA.0.01.00).