Iris segmentation is a critical step in the entire iris recognition procedure. Most of the state-of-the-art iris segmentation algorithms are based on edge information. However, a large number of noisy edge points detected by a normal edge-based detector in an image with specular reflection or other obstacles will mislead the pupillary boundary and limbus boundary localization. In this paper, we present a combination method of learning-based and edge-based algorithms for iris segmentation. A well-designed Faster R-CNN with only six layers is built to locate and classify the eye. With the bounding box found by Faster R-CNN, the pupillary region is located using a Gaussian mixture model. Then, the circular boundary of the pupillary region is fit according to five key boundary points. A boundary point selection algorithm is used to find the boundary points of the limbus, and the circular boundary of the limbus is constructed using these boundary points. Experimental results showed that the proposed iris segmentation method achieved 95.49% accuracy on the challenging CASIA-Iris-Thousand database.

1. Introduction

In the 21st century, people use electronics (personal computers, laptops, smartphones, smart watches, etc.) to browse through web-based social platforms, store personal images or videos, chat with other people through text or video, and so on. The amount of personal information stored in electronics is increasing by the day. Thus, biometric authentication is required to prevent unauthorized users from stealing such information from personal electronics. Biometric authentication is also used in the access control systems to identify illegal persons and block them from entering private buildings [1].

Among all the biometric modalities, iris recognition is the one with the highest performance, in terms of false acceptance rate (FAR) and false rejection rate (FRR) [2, 3]. Iris as a biometric identification method has a large amount of the complex texture information available for identification. This paper focuses on an iris recognition system that uses the iris texture for biometric identification.

A common iris recognition system consists of six elementary steps: iris image acquisition, image preprocessing, iris boundary segmentation, iris image normalization, feature extraction, and feature matching [4, 5]. The iris boundary segmentation step is a critical step in the entire iris recognition system. In an iris image, most of the iris textures are concentrated in the iris region close to the pupillary boundary. If the boundary of the pupillary region is not accurately located, a large number of iris textures will be missed in the feature extraction step. In most cases, the limbus boundary is obscured by eyelashes, eyelids, and specular reflections, and thus, a number of noisy features will be extracted in the feature extraction step, if the limbus boundary is not accurately located in the iris segmentation step. These features will deteriorate the performance of the entire iris recognition system [5].

In this paper, we present a novel algorithm for iris boundary segmentation. The proposed algorithm breaks down the iris segmentation step into two actions: locating the eye and segmenting the iris region. Judging whether or not the target exists in the image and locating the target are two major challenges in the object detection technology. Firstly, a well-designed Faster R-CNN network model [6] is used to detect and locate eyes in the proposed algorithm. Once the potential bounding boxes of the eye are obtained, a pretrained Gaussian mixture model (GMM) [7, 8] is used to fit the pupillary region. Secondly, an improved limbus boundary localization algorithm [9] is applied to find the limbus boundary points. Thirdly, the iris region is located by identifying the pupillary and limbus boundaries. Fourthly, we evaluate the accuracy of our algorithm with a newly proposed evaluation method in Section 4.3. Finally, we conclude our research by discussing result and the possibility of implementing the method to a mobile device.

2. Literature Review

2.1. Background of Object Detection

Object detection is a task of finding different objects in an image and classifying them. In 2014, Girshick et al. [10] showed dramatically high performance on the PASCAL VOC object detection challenge [11] using regions with the CNN feature model (R-CNN). Their method achieved a mean average precision (mAP) of 54% as compared to the 33% mAP of the HOG-based deformable part model (DPM) [12]. Although R-CNN works well, it runs really slow because each image has around 2,000 region proposals that need to be propagated through the CNN, and it has three different models that require training: the CNN to extract image features; the classifier, which is a support vector machine (SVM) to predict class; and the linear regression model to obtain a tighter bounding box similar to the true dimensions of an object.

To obtain the same dimensions of feature vectors for prediction, the traditional CNN [13] can only run with fixed-size (e.g., 224 × 224) input images. SPP-net [14] uses the new pooling strategy, spatial pyramid pooling, to eliminate the above requirement. It computes the feature maps from the entire image only once, and then, it uses the pooled features in the subregions to generate fixed-length representations. In 2015, Girshick, the first author of R-CNN, applied the ideas of SPP-net to develop an enhanced version of R-CNN called Fast R-CNN [15]. A region of interest (RoI) pooling layer is set in the CNN to share the forward pass for an image across its subregions. In Fast R-CNN, the CNN is jointly trained with the classifier and the bounding box regressor in a single model.

In the R-CNN, SPP-net, and Fast R-CNN models, the potential region proposals used to detect the locations of objects are created using selective search [16], which is a fairly slow process. Such a slow region proposal method becomes the bottleneck of the overall process. Zitnick and Dollar [17] used edge information to generate the object bounding box proposals. Szegedy et al. [18, 19] developed a learning-based proposal method called multiscale convolutional MultiBox (MSC-MultiBox). Redmon et al. [20] presented another solution that predicts the bounding boxes and the class probabilities directly from the full images in one evaluation.

In 2016, Ren et al. [6] proposed to automatically generate the region proposals using a region proposal network (RPN) that shared the convolutional weights with the CNN. Such method is named as Faster R-CNN. Faster R-CNN consists of two modules. The first module is a deep fully convolutional neural network (FCNN) that proposes regions of interest for object detection. The second module is a Fast R-CNN that uses the proposed regions in the first module to detect objects. Therefore, in Faster R-CNN, only one CNN had to be trained, and the results were used to carry out the region proposals and the classification. A simple summary of the aforementioned object detection methods is shown in Table 1.

The reason we choose Faster R-CNN is because its model size is small compared to other deep learning models for object detection, which make it fast enough to be possible to do real-time iris recognition on a mobile device (for example, smartphones or smart glasses).

2.2. Background of Iris Segmentation

The two typical algorithms for iris segmentation were proposed by Daugman and Wildes using integrodifferential operators [4] and Hough transforms [21], respectively. These methods are based on the idea of finding edge points in an iris image and then fitting them by using circular or elliptical models. For example, Tan et al. [22] presented a combination method of region clustering, semantic refinements, and well-designed integrodifferential operators. Betancourt and Silvente [23] obtained circular boundaries using QMA-OWA operators [24]. Ghodrati et al. [25] used a set of morphological operators, canny edge detector [26], and Hough transforms. Wang and Xiao [27] constructed a difference operator of radial directions. Some other groups used algorithms that rely on region growing instead of edge-based algorithms. They gradually merged the blocks with high correlation in an image to obtain the iris region. Liu et al. [28] used a K-means cluster for pupillary detection. Yan et al. [29] applied the watershed transform [30] and region merging on the structured eye images. Abate et al. [31] combined the watershed transform, region merging, and color quantization. The edge-based and region-growing algorithms estimated the iris region well, but they are not suitable for application to images with various light environments.

The active contour model [32] is another widely used solution for implementing iris segmentation. Jarjes et al. [33] used an angular integral projection function (AIPF) [34] and an active contour model. Bastos et al. [35] combined the pulling and pushing algorithm [36] and the active contour model. Boddeti et al. [37] built the seminal work on active contours without edges [38]. Krichen [39] used the Viterbi algorithm [40] to find a contour that maximizes the gradient value along a connect contour. These algorithms program dynamically and combine the solutions of multiple subproblems. Therefore, they require considerably long processing time in many iterations for achieving better accuracy.

Deep learning is a powerful machine learning tool that has recently exhibited outstanding performance in many fields. Many learning-based methods have been applied to iris segmentation. Tang and Weng [41] used an intensity operator to find the iris’ inner border and border recognition with an SVM classifier for the iris outer border. Li et al. [42] built edge detectors based on a set of features including intensity, gradient, texture, and structure information to characterize the edge points and learned six class-specific boundary detectors with AdaBoost [43] for the localization of pupillary and limbic boundaries. Benboudjema et al. [44] presented an implementation of triplet Markov fields (TMF) [45] for segmentation. Happold [46] trained a fast-structured random forest [47] for learning generalized edge detectors. Learning-based algorithms have more complex computations than the other algorithms, and thus, a device requires sufficient computational storage space during training for implementing these algorithms.

More diverse iris segmentation algorithms are included in [48, 49]. Each method has its advantages and disadvantages. In this paper, we propose a combination method of edge-based and learning-based algorithms. The methodology is described in Section 3.

3. Proposed Method

The algorithm proposed in this paper consists of three key steps: eye detection, pupillary boundary estimation, and limbus boundary estimation. We used a Faster R-CNN model to detect the location of an eye in an image. Then, the pupillary and limbus boundaries were found using GMM, maximization of the intensity gradient along the radial emitting path (MIGREP), and boundary point selection algorithms. Thus, the iris region was accurately located.

3.1. Eye Detection

The first step to segment the iris region is to find (detect and locate) the eye in an image. As the task of detecting only two classes, eye or background, in an image is simple, the architecture of CNN in Faster R–CNN does not require very deep convolutional layers. In this study, the original CNN, Zeiler and Fergus (ZF) model [50] or Simonyan and Zisserman model (VGG-16) [51], presented in [6] was replaced with a newly designed network. As depicted in Figure 1, the network contained only six layers. The first convolution layer filtered the grayscale input image with 64 kernels of size 5 × 5 × 1 with a stride of one pixel. It was followed by a rectified linear unit (ReLU) [52] layer and a local response normalization (CN) [13] layer, which ran over five adjacent kernel maps at the same spatial position. A max-pooling layer with a two-pixel gap between the centers of neighborhood pooling units of size 2 × 2 followed the normalization layer. The second, third, and fourth convolution layers had 64 kernels of size 3 × 3 × 64. A batch normalization (BN) [53] layer and a ReLU layer were applied after the second, third, and fourth layers. The reason we use batch normalization and ReLU is because it reaches the same error rate faster compared to other activation functions such as Tanh function, which means we can train the neural network faster and acquire more neural network models with different parameters. Also, the speed of the model will be faster than other traditionally used activation functions. Interested readers can find more detailed explanation in [13].

The RoI pooling layer extracted a 1024-dimensional feature vector from the output feature maps of the final convolutional layer. The fully connected layer had 128 neurons, and its output that after passing through an ReLU layer was fed to a softmax layer to generate a distribution between the two class labels.

3.2. Gaussian Mixture Model

After generating the potential eye regions with Faster R-CNN, only one bounding box with the maximum score of the eye class and an appropriate aspect ratio was selected to fit the pupillary region. Originally, we planned to use another Faster R-CNN model trained specifically for detecting the pupillary region. However, the result is not as accurate as the model for eye region, and the execution time of two Faster R-CNN models is not fast enough for a real-time iris recognition system. Hence, we decided to use the Gaussian mixture model as our pupillary detection method.

The GMM was built using the expectation maximization (EM) algorithm [7] based on a set of features including the normalized coordinates of pixels, pixel values filtered by a local median of kernel size 5 × 5, and pixel values filtered using Gabor filters (see Figure 2). A GMM was parameterized by mixture component weights, component means, and covariance matrices. For a GMM with components, the component had the mean and the covariance matrix . The posterior probability distribution of GMM can be expressed using the following equations:where is the parameter set . The mixture component weight was defined as , and its total number of components was normalized to one. and are mean and covariance matrix of the component , with a total number of . In the training stage, the model was trained using the EM algorithm, which is a type of maximum likelihood estimation techniques. The EM algorithm for GMM consists of two steps. The first step, known as the expectation step or E step, is to calculate the expectation of the component for each datum , given the model parameters , , and . The second step is known as the maximization step or M step, which is needed to maximize the expectations calculated in the E step with respect to the model parameters and to update the values , , and . The entire iterative process repeats steps 1 and 2 until the algorithm converges on the maximum likelihood estimation. As the number of components is not a known priori parameter in this task, the method [8] is used to adjust the value automatically during the training stage.

3.3. Pupillary Boundary Estimation

A well-trained GMM can fit the pupillary region inside the region proposal. In general, the result shows a unique candidate pupillary region in each image. However, in some situations, the GMM fits multiple regions consisting of the pupillary region, eyelashes, eyelids, specular reflections, and noisy points. We used a three-step process with three image processing methods (grouping, filling, and morphology opening) to discard the noisy regions and were left with only one candidate pupillary region. As shown in Figure 3, each row presents one test eye image. The left column presents the region proposals produced from Faster R-CNN. The median column presents the candidate pupillary regions predicted by the GMM. Further, the right column presents the final smooth region after applying the image processing methods.

The GMM calculated the probability scores with the eye and the background classes of each pixel in the image. According to this score, several candidate points in the pupillary region could be obtained to smoothen the candidate pupillary region and remove the noisy region. The first step involves grouping regions on the candidate pixels predicted from the GMM using an eight-connected neighborhood algorithm. Then, each subregion was checked for whether it contained more than 250 pixels, and the longer axis of its area was less than 1.15 times the shorter axis. The largest subregion that met the above requirements was considered the pupillary region. If all the regions were outside the specification, the largest region was selected as the pupillary region. Filling the empty space inside the region was the second step. Finally, a morphological opening operator based on a structuring square element of size four was applied to smoothen the region. In the mathematical morphology, the opening operator eroded objects that were smaller than the structuring element and dilated the shape of the remaining region. When empty spaces occurred on the edge of the region, as shown in the bottom row in Figure 3, the filling step prevented the region passing through the opening operator from generating new cracks. More importantly, the opening operator not only smoothened the pupillary region but also eliminated the noisy points, as shown in the top row in Figure 3.

When the pupillary region was drawn up, the coordinates of its center point were easily obtained. To precisely recover the pupillary boundary, a pixel scan of the column and the row was performed at the center point to select the lower, left, and right end points. Because the top end point might be obscured by the upper eyelid, the top point found by the pixel scan was probably different from the actual pupillary boundary point. Instead, two points selected from a new scan performed at the location with the same distance to the center point, and the upper end point was collected. We obtained five key boundary points through the pixel scan methods. The full procedure is shown in Figure 4. After obtaining the five boundary points, each point was denoted according to its coordinates as . It completely collected five pairs of coordinates of pupillary boundary points. The parameters of an approximate circle are computed using Equation (4). Moreover, the circle with the computed circle parameters could be accurately located on the pupillary boundary, as shown in Figure 5.

3.4. Limbus Boundary Estimation

The limbus boundary was estimated after the pupillary region, and its boundary was located. The enhanced version of MIGREP [9] was applied for estimating the coarse limbus boundary. Its required work was to design a few radially emitting paths that went outward from the pupillary center. Hence, the parameters of two distances had to be defined in advance. One, called , was the distance between the starting points of the emitting paths and the pupillary center. The other was defined as and represented the distance from the pupillary center to the end points of the emitting paths. In [9], these two parameters were predefined and cannot adapt to various input images during runtime. In this work, were dynamically adjusted according to the size of the bounding box found by Faster R-CNN. We compared the distances from the edge of the pupillary region to the left and the right sides of the bounding box and selected the shortest one as the basic length, as shown in Figure 6. Then, and were assigned the values of the pupillary radius and further incremented by 0.4 and 1.2 times the basic length, respectively. As the bounding box was located by the learning-based algorithm, the basic length associated with it was robust and adjusted automatically during runtime for each image. Thus, most of the emitting paths were supposed to start from somewhere inside the iris region and stop somewhere in the sclera region.

By keeping record of the pixel intensity values along the emitting path, the position that exhibited the maximal variation of pixel intensity was located. This position had to correspond with the intersection between the emitting path and the limbus boundary. Thus, multiple boundary points were successfully estimated when multiple emitting paths were used. Depending on the parameter and the shape of the eyelids and the eyelashes, the position showing the maximal value of the intensity gradient was probably not located on the limbus boundary. To solve this problem, we had to consider a set of candidate points where the local maximum gradient occurred, rather than considering only a single point where the global maximum gradient occurred. As depicted in Figure 7, the gradient value of the red point could be higher than that of the blue one, which denoted incorrect boundary point estimation. Therefore, we had to consider a set of candidate points consisting of red and blue points and then, select the point with the highest likelihood from the set.

A more sophisticated boundary point selection algorithm was used for this problem. Figure 8 illustrates the idea. First, 11 emitting paths were drawn with the parameter . For paths with such angles, it was highly likely that the maximal gradient occurred on the limbus boundary, as shown in Figure 8(a). Thus, the median value of the distances from these points to the pupillary center was recorded as a reference value. Second, a new emitting path (with ) was drawn, for which an incorrect boundary point might have the maximal gradient, as shown in Figures 8(b)8(d). In such a case, the corresponding distance values from the pupillary center to all the points where the local maximal gradient occurred were respectively recorded. The point that had the larger local maximal gradient value and whose distance value was within = 2 of the reference value was selected. Taking Figure 8(b) as an example, assuming that the reference value is in this runtime, the blue point on the new path will be selected on the basis of Equation (5), instead of the red point.

Third, after the best candidate point was selected, the reference value was updated with , which served as the new approximate value of the radius for the boundary points close to it. By repeating the above mechanism for the boundary point selection and the distance updating on the next emitting path with a new value ranging from to , we gradually adjusted the coarse limbus boundary points to more precise locations, as shown in Figure 8(e). Sometimes, the reflection point of light occurred on the limbus boundary and might cause the iteration of the mechanism to go into a bad evolution. Therefore, the distance updating might not apply to the pixels whose pixel values were larger than those of the normal pixels that had 95% probability in the normal distribution established using the pixel values of the complete image.

4. Experimental Results and Discussion

4.1. Database

The database used to train Faster R-CNN and GMM was the CASIA-Iris-Thousand database [54]. This database contains 1,000 subjects with a total of 20,000 iris images, which were collected using the IKEMB-100 camera. As a large number of subjects wore glasses during image capturing, many images have glass frames and specular reflection. These types of obstructions were obstacles to the iris segmentation.

4.2. Detection Model Training

Faster R-CNN and GMM used the full CASIA-Iris-Thousand database for the training and the test. The training set had 6,000 right-eye and 6,000 left-eye images, and the test set had 4,000 right-eye and 4,000 left-eye images. Each image had the region information of the iris that was manually labeled, as shown in Figure 9. To build the proposed algorithm in a mobile device or an embedded system, the model had to occupy less storage space and has lower computational complexity. The model was trained using the training images, previously reduced to the specified size. However, in the test stage, the test images were resized in runtime to pass through the model. The results of the detection were mapped onto the original test images to ensure that there were sufficient iris textures inside the bounding boxes for use in the other iris recognition steps.

To share the convolutional weights between CNN and RPN in Faster R-CNN, the model had to be trained in four steps. The first step consisted of training a region proposal network. For the convolutional feature map of size W × H outputted from the fourth convolution layer of the proposed model, RPN found the W × H × k potential regions. Using of the last convolution layer as feature map has been applied and proven very effectively by other object detecting convolution neural network such as R–CNN and Faster R–CNN. Interested readers can find more details of why using the last layer for feature by reading [6, 10, 15].

However, only 2,000 regions with the higher intersection over union (IoU) value were assigned to positive samples for training the CNN. In the second step, a separate detection network by Fast R–CNN was trained using the region proposals generated from the RPN built in Step 1. At this stage, the two networks did not yet share the convolutional weights. In the third step, the detection network was used to initialize RPN training. It frozed the weights of the shared convolution layers and fine-tuned the layers that belonged only to the RPN during training. The final step was to fine-tune (with the same operation) the layers that only belonged to the CNN. Hence, the networks shared the same convolution layers and merged into a single network.

For the purpose of finding the best architecture of the RPN and CNN model, we trained multiple models with different fine-tuned parameter sets using the right-eye images of the CASIA-Iris-Thousand database, as shown in Tables 2 and 3. The new architecture of the CNN model was designed on the basis of VGG-16. As the detection task in this study was simple, we reduced the number of convolution layers of VGG-16. Precision and recall were used to measure the performance of the detector. Precision is the fraction of retrieved objects relevant to the detection, and recall is the fraction of relevant objects successfully retrieved. Here, we set an overlap threshold of IoU = 0.8 to select effective detection, which was a strict condition.

The initial version of the new network architectures was labeled Model A and Model B, which had only six and five convolutional layers, respectively. The experimental results showed that the performance of Model B was considerably worse than that of Model A, even when the number of neurons in the fully connected layer was increased. Next, we attempted to replace the first three layers of the network with a convolutional layer of a larger kernel size, which resulted in Models C and D. The use of multiple kernel sizes in a network helped the network to obtain more diverse features in an image. The difference between these two models was the different pooling strategies used, namely, max pooling for Model C and average pooling for Model D. Irrespective of the pooling strategy, their performance was almost 100% precision and recall. Although the models performed well, they used a large number of computed parameters in the networks and thus required a long processing time of approximately 0.3 s to complete the detection. Therefore, we reduced the size of the training set by 2×, 4×, and 8× to generate Models E, I, and J, respectively. The smaller was the size of the images used for training, the less was the time required for the model training and testing and the lower was the detection accuracy. According to the experimental results, the performance of Model J was the worst of all the models trained using images of different sizes. This might be attributed to the fact that the images used for training had very few features for the detection when they were shrunk considerably. We finally used the architecture of Model I to implement the algorithm proposed in this paper. Models F, G, and H were the parameter-adjusted results of Model I. Among them, Model I exhibited better performance and sufficiently low time consumption for the detection.

The GMM was trained using the images with the information of the pupillary region. We used the GMM to fit the potential pupillary region inside the bounding box found by Faster R–CNN. Each pixel in an image was represented by a nine-dimensional feature vector used for the training and the testing. The features consisted of the normalized coordinates of pixels, pixel values filtered by a local median of kernel size 5 × 5, and pixel values filtered using Gabor filters. The Gabor filters of size 5 × 13 were parameterized as follows: , , , , and . In the training stage, the pixels inside the pupillary region were taken as the positive samples. A normal distribution built from the pixel values of the entire region was used to remove the positive samples located in the region of the reflection points. The same number of samples as in the positive sample was selected from the pixels out of the pupillary region to form a negative sample. We also attempted to use SVM instead of GMM to predict the potential pupillary region. However, it did not perform as well as GMM, as it took more than three days for training, which was considerably much longer than GMM which only takes 5 min. Furthermore, its accuracy of region prediction was poor, as shown in Figure 10.

We implemented our algorithm with MATLAB R2018a and run it on a personal computer with 3.4-GHz CPUs and GTX 1080 GPU. The average time cost per eye of iris segmentation was approximately 0.06 s, which indicated that the proposed algorithm is a fast iris segmentation algorithm.

4.3. Performance Evaluation for Iris Segmentation

Traditionally, most researchers have evaluated the results of iris segmentation with subjective methods, for example, by reading the iris segmentation results on the plotted image and manually giving the judgment [9, 27, 29, 33, 35, 36, 39, 41]. To quantitatively estimate the performance of pupillary boundary localization and limbus boundary localization, we propose a new method based on the integration of the radial difference. For each image, we used the region information of the manually labeled iris region to generate two separate binary maps containing the pupillary region and the iris region, respectively. We assumed a segmentation that was parameterized by the coordinates of the circle’s center and its radius, denoted as a triple set . Then, we created a dilated version and an eroded version of , which was parameterized as and , respectively. As such, every point of had its corresponding points on and . By collecting pairs of corresponding points on and , denoted as , we evaluated the performance of by using the value computed using Equation (6). Figure 11 illustrates the procedure for the performance evaluation.

We compared our algorithm with [9], which was proved to be very robust and efficient for iris images captured on wearable devices. The proposed performance evaluation method was used with parameters and for evaluating the performance of the pupillary (limbus) boundary localization. With such a value, it ensured that the results of the proposed segmentation algorithm had at least a 0.5 IoU value with the ground truth. By selecting the aforementioned parameters, we make the proposed algorithm to be fast enough for a real-time iris recognition system (above 15 frames per second) while maintaining the accuracy of iris segmentation. It set as the threshold to select effective segmentation and computed the accuracy of segmentation with this threshold. Figure 12 illustrates the histogram of the values for evaluating the segmentation performance on the full CASIA-Iris-Thousand database. The accuracy of segmentation is shown in Table 4. As depicted in Table 4, the proposed algorithm showed a dramatic increase from 47.84% to 95.49%. This could be attributed to the fact that the method used for the localization of the eye was changed to a learning-based algorithm. As such, the parameters used to find the boundary points of the pupillary region and the iris were robustly adjusted automatically during runtime for different input images.

4.4. Difference between the Proposed Method and Other Published Methods

There are many iris segmentation methods based on deeply learned neural networks. In this section, we discuss the difference between two state-of-the-art methods, IrisDenseNet [55] and the model proposed by He et al. in [56].

IrisDenseNet uses a 13-layered VGG-16 network [51] as its core to detect actual iris area (excluding area such as eyelid and eyelashes). However, it only performs segmentation for the iris area without a proper method to normalize it. As we can see in [24], the iris normalization is a key stage for high-performance iris recognition. If this stage is missing, there is no guarantee that the final accuracy of their iris recognition system still remains the desired precision. Also, due to its deep layers, the computation complexity of training and using it is extremely costing compared to our proposed method.

Model proposed in [56] also employs VGG-16 network but with some changes. Its execution time for one image is 0.112 second on a 2.6 GHz CPU and GTX970 M GPU which, again, is not fast enough for a real-time iris recognition system on the embedded system. Our proposed method, on the contrary, can perform iris localization within 0.06 seconds, which is 1.87x faster.

5. Conclusion

In this paper, we presented a robust and fast iris segmentation algorithm based on Faster R-CNN. We reconstructed the CNN architecture of Faster R-CNN. This new model with only six layers could generate precisely located region proposals of the eye in the images. We then extracted the feature vectors with specific dimensions to train a GMM for fitting the potential pupillary regions. Then, the pupillary boundary was recovered through five key boundary points found by pixel scans of the rows and columns. An enhanced version of MIGREP and the boundary point selection algorithm were used to find some boundary points of the limbus region, and the limbus boundary was located by using these boundary points. To evaluate the performance of iris segmentation, we developed an evaluation method based on the integration of the radial difference. Experimental results showed the effectiveness and efficiency of the proposed iris segmentation method on the CASIA-Iris-Thousand database. The segmentation accuracy of the proposed method was 95.49%, which was higher than the accuracy of 47.84% achieved in the previous work, and the time cost of the proposed iris segmentation procedure was only approximately 0.06 s. The results on the challenging CASIA-Iris-Thousand database showed that the proposed method is a fast and accurate iris segmentation algorithm.

The main advantage of the proposed algorithm over most of the state-of-the-art iris segmentation algorithms based on neural networks such as IrisDenseNet [55] and the model proposed by He et al. [56] is that it has a smaller model size which make it faster to segment iris images, which is crucial for a real-time iris recognition system or even implement it on a mobile device.

For the future work, we want to further improve the speed of the algorithm by creating heterogeneous models that combining the power of CNN and the speed of traditional computer vision methods. Another direction is to try to use the semantic segmentation method and combine it with the proposed algorithm. The semantic segmentation algorithm has a high sensitivity of predicting the reflection points in the iris region, which can improve the overall accuracy of the algorithm. After the algorithm is improved, we will attempt to build the algorithm on the mobile devices, by using more concise deep learning models, such as XNOR-NET [57]. The ultimate goal is to implement a fast and accurate portable iris recognition system.

Data Availability

The CASIA database [54] used to support the research is provided by the Institute of Automation, Chinese Academy of Science.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This work was supported by the Ministry of Science and Technology, Taiwan (grant number 106-2221-E-008-102-).