Abstract
Deep learning (DL) technology has shown to be the most effective method of completing class assignments in the last several years. Specifically, these approaches were used for segmentation, classification, and prediction of retinal blood vessels, which was previously unattainable. U-Net deep learning technology has been hailed as one of the most significant technological advances in recent history. In the proposed work, improved segmentation of retinal images using U-Net, bidirectional ConvLSTM U-Net (BiDCU-Net), and fully connected convolutional layers, such as absolute U-Net, BiConvLSTM preferences, and also the fully connected convolutional layer method are proposed. Three well-known datasets were subjected to the suggested technique’s evaluation: the DRIVE, STARE, and CHASE DB1 databases. This suggested technique was tested using the required precise measures in percentage of accuracy, F1 score, sensitivity, and specificity in DRIVE, 97.32, 83.85, 82.56, and 98.68 in CHASE, 97.44, 81.94, 83.92, and 98.45 in STARE, 97.33, 82.3, 82.12, and 98.57 in STARE, respectively. Furthermore, we assert that the strategy outperforms three other similar strategies in terms of effectiveness.
1. Introduction
Deep learning is now being used for picture segmentation, classification, captioning, and prediction, among other applications. Deep convolutional neural network (DCNN) versions like Residual Net, AlexNet, VGG, DenseNet, and CapsuleNet have made significant strides ahead in recent years [1, 2]. For some applications, a DL-based (in particular, a CNN-based) solution provides cutting-edge execution for split assignments and order, among other things. For the time being, the enactment capacity will be enough to overcome the DL models’ inadequacies. Second, the networks’ dropout becomes more consistent. Third, the creation of CNN models [3] necessitates specific augmentation procedures, which are explained further down. It is common practice to check and estimate models for massive datasets such as the yields of the probability value categorization are derived from ImageNet, and the jobs are single labels, by using arrangement techniques. Additionally, modest architectural version models [4] are employed for semantic picture segmentation tasks.
For example, vessel segmentation may be utilized to identify distinct disorders that involve blood vessels in different ways. A variety of disorders can result in changes in the breadth and curvature of the retinal veins. In order to detect many of these pathogens, including glaucoma, hypertension, and diabetic retinopathy, at an early stage [5], it is necessary to detect them at an early stage. Glaucoma, hypertension, and diabetic retinopathy are just a few of the pathogens that cause vision to deteriorate in working people. This section of the skin sore is responsible for detecting and investigating cancerous skin progress in its initial stages. Melanoma is the worst type of skin cancer, and it is caused by melanocytes that grow too fast. Dermoscopy is a noninvasive imaging method that records the skin’s surface and provides sensations of the skin’s surface using light amplifiers and breathing fluids. Dermoscopic pictures of melanoma might lead dermatologists to believe that the tumor is incorrect or abstract. It is possible to achieve 92 percent relative durability over a five-year period if melanoma is discovered in its early stages [6, 7].
Deep learning networks produce outstanding results, and they are being used in clinical imaging to tackle control techniques that have been developed. Because of the large number of network characteristics, that network requires a considerable amount of information in order to train and execute high-quality speculative operations. Large (and well-clarified) datasets are difficult to get, which presents a significant challenge in the segmentation of medical adages [8, 9]. It is necessary to identify pixels in clinical images instead of using the name of the imaging stage when doing clinical image segmentation. The fully convolutional neural network (FCN) [10] has been shown to be one of the utmost powerful deep networks accessible for picture segmentation.
The ability to enhance this CNN to a U-Net allows for achieving the best results without the requirement to train on a large quantity of data. This framework is composed of individuals that code and analyze methods. Numerous maps with lesser dimensions are excluded from consideration by the encoding approach [11]. By running up convolutions of the same scale as input, this disengaging technique may be utilized to construct segmentation maps of units from input data. There have been a number of U-Net extensions proposed so far [12, 13]. The most significant disadvantage is each set of feature maps to be done independently the encoding procedure.
From an architectural standpoint, the CNN model for characterization tasks necessitates the use of an encoder and provides the possibility of a class as a yield. Activation functions and subsampling layers in network processes that decrease the dimensionality of feature maps allowed us to do convolution tasks. Due to the increasing number of input samples transiting network layers, the size of a feature map grows as the number of feature maps grows [14]. This is depicted in Figure 1 in the initial half of the model (in blue), which represents the model’s start. Individually, the number of networks may increase as the overall number of feature maps increases across the network hierarchy’s deeper tiers. SoftMax behaviors are unavoidable while processing goal group probability since they are essential for proper processing.

Segmentation tasks necessitate the use of both convolutional encoding and decoding units, rather than classifications, in the design of the job. The encoding unit converts the photos provided as input into a relatively small number of different maps. The decoding unit conducts an upconvolution (deconvolution) operation on the original image to construct segmentation feature maps with comparable dimensions. As a result, the structure responsible for segmentation often requires twofold many more network parameters as the classification process design.
This paper presents their proposal for BiDCU-Net, a better-quality U-Net version that uses BiConvLSTM to reuse an FCL and bypass connection with the feature maps [15–43]. The ensuing encoding layer’s features have a higher degree of resolution, whereas the feature maps acquired from prior upconvolutional layers have a lot of semantic information, indicating that the approach is working properly. It is preferable over merely concatenating these 2 types of feature maps using nonlinear functions since it will result in more dependable segmentation performance. As a result, we add BiConvLSTM to the skip connections in the U-Net architecture described in this article. During each block, the traits that have been obtained are carried over to the future block. As a result of this strategy, the process may learn a broad variety of attributes that are dependent on past amounts of relevant data while simultaneously minimizing the necessity for recurrent learning functions, which would otherwise be necessary.
2. Background
With the aid of multiple medical image datasets and computer vision, semantic segmentation is a very important field of study in which deep convolutional neural networks (DCNNs) are utilized to categorize an individual pixel in an image. Prior to the deep learning revolution, the usual machine learning approach depended mostly on the manual characteristics that were employed to individually identify pixels [18–19]. Several models have shown in recent years that deeper networks are the most successful for tasks such as detection and segmentation. The severe gradient problem solved by the introduction of modern activation functions like Exponential Linear Units (ELUs) and Rectified Linear Units (ReLUs), which makes training very deep models more difficult, makes it more difficult to train very deep models. Another alternative is to use a deep residual model to solve the problem, as discussed in [20–22], which uses identity mapping to speed up the training process.
Furthermore, when it comes to the segmentation of actual pictures, FCN-based CNN segmentation approaches outperform their counterparts. Random architecture is considered to be one of these image patch-based implementations, and it is extremely computationally costly due to the large number of network parameters it contains (about 134.5 million). The most important constraint would be that there is a considerable degree of overlap between pixels and that the same convolutions are done on a constant basis. With the introduction of recurrent neural networks (RNN), the efficiency of FCNs has increased and is being tuned with increasingly large datasets [23]. SegNet is made up of two primary components: the network and the server [24–26]. One such network is the VGG16 encoding network, which has 13 layers and uses pixel-wise classification layers for the equivalent decoding network. The most important component of this article is discussed about how the upsamples done in decoder with the input feature maps that are of lower resolution than the target resolution. In 2015, a new SegNet variation, known as Bayesian SegNet [27], was proposed as an upgrade to the original. Machine views are used to study the many types of architectures available. There are also numerous deep learning structures that have been expressly presented for the medical image segmentation because they detect the inadequacy of data as well as concerns with class imbalances in the medical picture.
“U-Net” was one of the earliest and most widely used methods for segmenting semantic medical pictures when they were initially introduced. The U-Net model’s essential design is seen in Figure 1. According to its construction, the network is composed of two essential components: the convolutional encoder and the decoder [28]. The simple convolutional tasks are carried out in both regions of a network by activating ReLU in both directions. For the purpose of downsampling, the encoding unit will perform the max-pooling operations. During the decoding process, conversion is carried out in order to sample the feature maps from the input data. The original version of U-Net was used to produce and replicate feature maps from the encoder to the decoding system. U-Net design offers various advantages when it comes to segmenting activities. For starters, such a paradigm allows for the utilization of global location and context at a similar time. Second, it is capable of operating with just a small number of samples while increasing the efficiency of segmentation jobs [29]. Third, the complete picture is processed end to end in the front route, and segmentation maps are created directly from the image data. Because of this, as compared to patch-based segmentation [30, 31], U-Net effectively covers the whole backdrop of the input photos.
In addition, U-Net is currently widely used in a wide range of applications. Since then, many different U-Net models have been developed, together with one specialized for CNN-based medical imaging data segmentation [32, 33]. The original U-Net architecture has been modified in two ways in this model. Multiple segmentation maps and forward feature maps are blended and merged across the whole network. It is normal practice to obtain and summarize maps from many levels of encoding and decoding systems. The authors indicate positive gains in their performance during preparation, but no effect was shown when summed features were incorporated during the test time [34], in contrast to U-Net. The definition, on the other hand, has shown that summing attributes have an effect on network performance. Residual networks and U-Net are used for biomedical image segmentation tasks [35]; it is possible to objectively analyze the effects of skipped connections.
It is proposed in this study that BiDCU-Net be used as an upgrade to U-Net and that it produces significantly more output than current segmentation jobs. Furthermore, the rate of convergence of a network of this type is significantly impacted by BN. In order to attain our objectives, we are experimenting with several methods of retinal imaging.
The primary contribution is the proposal [44] of a novel U-form deep learning architecture that makes use of lightweight convolution blocks. This is done with the goal of maintaining a better level of segmentation performance while simultaneously minimizing the amount of computational complexity. Preprocessing and data augmentation techniques are recommended with regard to the retinal picture and blood vessel properties as the second key contribution.
The contribution of this work is to propose the BiDCU-Net, which combines U-Net, BiConvLSTM, and entirely linked layered convolution. The network does this by utilizing its capabilities, which include totally linked layer convolutions of both BConvLSTM states. Encoding, decoding, batch normalization (BN), and Bi-ConvLSTM are the network’s main components.
3. Proposed Framework: An Efficient Retinal Segmentation-Based Deep Learning Framework for Disease Prediction (EDLFDPRS)
As shown in Figure 2, The BiDCU-contracting Net’s encoding procedure is separated into four phases. Two convolutional filters, a maximum pooling technique, and a ReLU are used in each phase. The no. of feature maps doubles with each level. The contracting method extracts pictures at a faster rate, layer by layer, and increases the layer size of those representations. To finish, the encoding path passes via the last layer of a multidimensional picture definition, which contains important semantic data. At the completion of the encoding procedure, primary U-Net had series of convolutional layers. The method works by feeding a network a series of convolutional layers to identify distinct features. The network may, however, acquire duplicate properties as convolutions continue. Fully linked convolutions [36] are presented as a solution to this problem. This allows the network to improve the system’s efficiency in recycling feature maps by using the concept of “collective awareness”. It often entails integrating all previous convolutional layers’ feature maps to the feature map acquired from current layer and utilized as input data for the subsequent convolution. The concept of entirely linked convolutions has a number of advantages over ordinary convolutions. It permits the network to use variety of feature maps instead of duplicates. By allowing information to flow throughout the network and reusing functionality, this notion improves the network’s representational capacity even more. Prior to being used, fully coupled convolutions should take advantage of all available capabilities, since this will prevent the network from exploding or eliminating gradients [37]. In addition, the gradients will be returned to their original locations in the network at a faster rate. In the proposed network, the concept of entirely linked convolutions is used. As a result, a single block has two successive convolutions. The encoding route’s last convolutional layer displays a succession of blocks, as seen in Figure 3. The blocks become densely connected.


Consider is the output of every convolutional block.
.
For example,
Equation (1) is the feature maps of all previous convolutional blocks connected to their input data, and the block’s output is
We use rather than .
In the decoding part, each stage of the decoding procedure begins with an examination of the previous layer of yield. In the decoding process, the components’ important feature maps are clipped and replicated in the standard U-Net. These feature maps may be connected to the result of the upsampling algorithm. In BiDCU-Net, BiConvLSTM is utilized to investigate these two types of feature maps in greater depth. Let be the set of sequential convolutional layer feature maps, where is the layer number and is the feature map for each layer and be the collection of copied feature maps from the encoding component. We should also pay attention to and. When an upsampling operation followed by a doubles the size of a feature map while halves the number of feature channels, is initially transferred from an updated convolutional layer, resulting in. To put it another way, the extended route increases the image size by layering the feature maps.
Instead of BiDCU-Net, BiConvLSTM was used in the proposed method to deeply analyze the details of these 2 types of feature maps.
Let,
The copied encoding component’s feature mappings are gathered in equation (3). And
Equation (4) is the just be set of feature maps of the subsequent convolutional layer in which the feature map is the no. of layer and of each layer feature map.
Let us consider
According to Figure 3, is initially shifted from an enhanced convolutional layer when a upsampling operation twice the size of a feature map while halves the number of feature channels, resulting in . The extended route, in other words, elevates the feature maps layer by layer to achieve the true image size from the last layer.
The output of the upconvolutional layer, , performs a BN function and creates after upsampling the image. One difficulty encountered during preparation is that the distribution of activations varies between the concealed units. Because each layer must be modified to a different distribution in each training phase as a result of this difficulty, the training operation takes longer. A neural network is used in BN [38] to increase consistency by normalizing the input into a network layer by eliminating the batch means and splitting the batch standard deviations before feeding the standardized input into network layer. BN has an impact on the evolution of neural network training. Furthermore, model’s efficiency is typically improved as a result of the small regulatory effect [39], which is a good thing.
The output of the BN phase is now being transmitted to a BConvLS layer through the BConvLS layer. Due to the fact that these models employ comprehensive relations between state-to-state and state transitions, the LSTM model has a significant shortcoming in that there is no spatial correlation of these networks. ConvLSTM, which turns operations into input-to-state and state-to-state transformations, was suggested as a solution to this problem [40] and should be implemented. An input gate , an output gate , a forgotten gate , and a memory cell make up the circuit. It is broken down into four components . Gates that monitor access to, update, and clear memory cells are known as input, output, and forget gates. The following are some of ConvLSTM’s characteristics.
and are the function of the convolution and Hadamard.
is the input tensor.
is the hidden state tensor.
is the tensor memory cell.
and are input and hidden kernel of 2D convolution.
, and are predominant function.
BConvLSTM used to encode and distribute information within this network. As a result, BConvLSTM analyzes the input data in both directions like forward and backward using two ConvLSTMs. Then, in either way, by addressing the data dependence, it decides which data input is now being used. Only dependency with a forward direction is taken into account in a traditional ConvLSTM. As a result, in order to effectively evaluate backward dependencies, all issues must be thoroughly addressed in this order [41]. Increasing statistical performance has been demonstrated through the investigation of both forward and backward temporal perspectives. Each ConvLSTM can be thought of as a standard in both the forward and backward direction. We have two sets of parameters since we have two different states: backward and forward. The BConvLSTM production is expected to be as follows:
where and denote hidden state tensors with forwarding and backward states, respectively, denotes term for basic, and is output with regard to spatiotemporal bidirectional details. Aside from that, tanh is the hyperbolic tangent that is utilized to mix the front and rear output in a nonlinear fashion. To train the network, we employ the power approach, which is unique to the U-Net.
4. Experimental Results
Three prominent datasets, DRIVE, STARE, and CHASH_DB1, were used to investigate the segmentation of retinal blood vessels. A total of 40 color retinal images are included in this dataset, with 20 examples used for training and another 20 samples utilized for testing. The following is how the DRIVE dataset is organized. The image’s size, according to the original, is pixels. To create a square data collection, the photographs are only clipped to contain data from columns 9 to 574, resulting in a frame size of pixels (in this case). We randomly picked 171,000 patches for training from a total of 190,000 patches discovered from 20 pictures in DRIVE dataset.
The patch size for each of the three datasets shown in Figure 4 is pixels on each side. STARE consists of 20 color pictures, each of which is pixels in width and height. The small sample size necessitates the employment of two methodologies for training and testing on this dataset, both of which are employed rather often. First and foremost, some of the training examples were chosen at random from each group of 20 photos. A sample is evaluated and remaining 19 specimens are trained on the results using a method known as the “leave-one-out” procedure. There is never any overlap between the training and testing samples. The “leave-one-out” method for STARE datasets is used in this implementation, which is detailed below. CHASH_DB1 is a dataset that consists of 28 images, each with a resolution of pixels. The dataset would be divided into two groups, with samples chosen at random in each category. For training, a total of 20 samples are used, with the remaining 8 samples being used for testing.

Tamim et al. [42] test data from the DRIVE, STARE, and CHASE DB1 databases, among other places, was used to validate the framework. Tables 1–3 as well as Figures 5–7 show the score, accuracy, sensitivity, specificity, and AUC. As indicated in the table, the efficiency of BIDCU-Net is higher than that of other updated models. According to the proposed approach, the average BIDCU-Net score for the DRIVE, STARE, and CHASE DB1 is 83.85, 82.3, and 81.94, respectively.



The area under the curve (AUC) is the measure of the capacity of a classifier to discriminate between classes and is used as a summary of the ROC curve. The greater the AUC, the better the performance of the model in distinguishing between the positive and negative groups.
Table 2 compares and contrasts each technique with a variety of additional state-of-the-art methodologies, as indicated in the headers, in comparison to the DRIVE, STARE, and CHASE datasets. We discover that the backdrop and blood vessels can be segmented more reliably than the remainder of the image, based on the BiDCU-highest Net’s score and highest AUC. The score indicates a high level of recall and precision, as well as a successful implementation of our strategy. BiDCU-Net has the highest levels of specificity and sensitivity, implying that more pixels in vessels can be better labelled in the future. Vascular pixels make up a modest percentage of all visual pixels most of the time. Because the unbalanced type is utilized in the description, defining the segmentation of the retinal blood vessels is more challenging. As a result, our technique’s high sensitivity is critical, and a computer-aided demagnetization device may detect blood vessels without using false instances. The BiDCU-Net displays maximal accuracy over a similar period of time.
Table 4 and Figure 8 show the proposed method comparison results of various performance metrics with existing methods, Except score, all the remaining parameters of the EDLFDPRS method are improved than those of the existing methods for the DRIVE dataset. The proposed method performance metric of score is improved than the existing methods for the CHASE dataset. Finally, the proposed method proves that it has better accuracy when compared with existing methods.

5. Conclusion
To construct an effective framework, a mix of bidirectional LSTM network, U-Net, and fully connected layer is used in this research work. The network also includes a tightly linked convolutional layer block to make more biased data available, leading to more reliable segmentation results. This work was also able to boost the network’s speed by roughly 6 times by introducing BN after the upconvolutional layer. The proposed method is given the accuracy as 97.32%, 97.33%, and 97.44% for the datasets DRIVE, STARE, and CHASE, respectively. When compared with existing method, the proposed method accuracy is improved 2.24%. Also, the proposed work is compared with the different architectures such as U-Net, RU-Net, and Dense U-net models in segmentation activities for all three datasets utilizing the same number of network parameters.
Data Availability
The data are available from the corresponding author upon reasonable request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work is funded by the Centre for System Design, Chennai Institute of Technology, Chennai; funding number is CIT/CSD/2022/006.