Abstract

Building a human-like robot that could be involved in our daily lives is a dream of many scientists. Achieving a sophisticated robot's vision system, which can enhance the robot's real-time interaction ability with the human, is one of the main keys toward realizing such an autonomous robot. In this work, we are suggesting a bioinspired vision system that helps to develop an advanced human-robot interaction in an autonomous humanoid robot. First, we enhance the robot's vision accuracy online by applying a novel dynamic edge detection algorithm abstracted from the rules that the horizontal cells play in the mammalian retina. Second, in order to support the first algorithm, we improve the robot's tracking ability by designing a variant photoreceptors distribution corresponding to what exists in the human vision system. The experimental results verified the validity of the model. The robot could have a clear vision in real time and build a mental map that assisted it to be aware of the frontal users and to develop a positive interaction with them.

1. Introduction

Building a human-like robot controller that is inspired by the principles of neuroscience and can resemble living organism behaviors in some specific characteristic is recently one of the main challenges faced by robotic researchers [1]. The difficulty in such a system can be summed up into three main points as diagrammatically shown in Figure 1. (A) A mechanism for human-robot interaction, which mainly relies on robot’s vision, speech recognition, sensor-motor interaction, and so forth. (B) A mechanism for learning and memory, which gives the robot the feature to learn and/or teach. (C) A mechanism for homeostasis, which gives the robot a degree of an internal stability. In this study, we are highlighting the issue of enhancing the robot vision toward advanced human-robot interaction. More specifically, we are introducing a novel dynamic edge detection algorithm that is inspired from the biological concept of “retina” and supported by the variant photoreceptor distributions and the robot's eye movements.

Edge detection is classified as a fundamental step in many machine vision systems and image processing applications [24]. The degree of its importance lies on the level of autonomy required in the image processing system [5]. It is mainly responsible for extracting accurate edges from the image, which paves the image for any farther processes, such as object recognition, feature extraction, and 3D environment constructing [6].

So far, many works have been done to develop a unique algorithm that can guarantee, to some degree, high-quality edge detection with less noise and less computational time [7]. Most of these works were basically relying on designing a static mask that sequentially moves through the pixels in the image toward extracting edges [8, 9]. Despite the success of these works, the idea of predesigning the mask for specific tasks, however, could limit the performance of these models, especially when dealing with the complexity of the real-world applications.

In recent years, researchers have been investigated the biological concept of “retina” to try to overcome the above problem, since it is now widely accepted that the biologically inspired technology is a powerful source to achieve an accurate model with simple structure and less computational time [1].

A cognitive vision research group in Hungarian Academy of Sciences [10], for instance, has designed a model for edge detection based on the center surround structure of the receptive fields that are presented in retina. They have simulated the eye tremors and drifts to enhance the output image. The filter, however, was static and cannot distinguish between the noises and the edges.

Becerikli and Engin [11], on the other hand, have tried to solve the problem by building a neural network and trained it by the back propagation to ignore the noises. Their work, however, does not guarantee the ability to detect different edge formations. Metta [12] has worked in an attention system for a humanoid robot based on robot eye movements and space variant vision for motor control. This work has missed the nature of the variety of photoreceptor in the human retina.

In Gifu University in Japan, some researchers have tried to imitate the natural eyes movement by recording the eyes and head movement of real humans and simulate the movement to the humanoid robot “YAMATO”. Although they succeeded to simulate human eye movement, the robot missed the real advantage behind these behaviors [13].

Along this line of research, we are here proposing a bioinspired vision system that would help to develop advanced human-robot interaction in an autonomous humanoid robot. First, we enhance the robot vision accuracy in real time by applying a novel dynamic edge detection algorithm abstracted from the rules that the horizontal cells play in mammalian retina. Second, we improve the robot’s tracking ability, to have it works as a supporter for the first algorithm, by designing a variant photoreceptors distribution similar to what exists in the human vision system. The proposed model is constructed by an artificial neural network and applied in parallel to the robot view. The experimental results proved the validity of the model. The robot was able to achieve accurate edge detection in real time and could build a mental map that helped it to be aware of the frontal users and to achieve a positive interaction with them.

This paper is organized as follows. The following section highlights the biological concept of human vision system. Section 3 describes in details the proposed algorithm. Section 4 shows an experimental setup and results. Finally, Section 5 concludes the work and gives the direction for the future work.

2. Vision System: Biological Review

The retina, as well as the eye movement (saccades and pursuit), are working with each other in the human vision system in such a way that allows us to obtain better vision and awareness of the surrounding environment. The retina, which is a part of the brain, is responsible for performing the first stage of the image processing, for example, edge and motion detection, before passing its signals to the brain for any farther processing [14]. Saccades and pursuit, in contrast, can be considered as a retina vision supporter. They are responsible for various kinds of voluntary or involuntary movements of the eyes, which help to track objects and/or to direct the attention [15].

2.1. Retina's Neural Connection

The neural connections between ganglion cells and bipolar cells in the retina are responsible for edge detection in cold-blooded vertebrate, by performing a processing called the center-surround representation. However, they are not the only source for edge detection in mammalian vision, since the horizontal cells play also an important role in enhancing this mechanism [16].

As can be seen in Figure 2, the horizontal cells are between the photoreceptors (cones) and bipolar cells. When the light is absent, the horizontal cell releases the neurotransmitter gamma aminobutyric acid (GABA) that acts on GABA receptors [17]. This phenomenon has an inhibitory effect on the photoreceptors. Therefore, when the light is shined onto a photoreceptor, the photoreceptor hyperpolarizes and reduces the release of glutamic acid (glutamate), which excites the horizontal cell to reduce the release of GABA [17]. This reduction of inhibition leads to a depolarization of the photoreceptors. Such as complex process, however, is still a subject of hot debate in the community of retina scientists [18].

The functionality of the horizontal cells can be summarized by two main points: (1) A single ganglion cell constructs a center surround representation from the output of many bipolar cells, and transfers this representation to the brain. (2) Horizontal cells improve this center surround representation by changing the output of the bipolar cells, according to the input pattern [19]. According to Verweij et al. [17], the reduction of GABA varies according to the brightness of the light that shines onto the photoreceptors and the time that this light is present. This gives the center surround representation the accurate information about the edges [17].

From the above phenomenon, we have got the idea of designing a dynamic edge detection technique, which is self-adapted during the time based on the given input pattern.

2.2. Human Eye Movement

Saccades and pursuit are two types of the eye movement. They work jointly to support the retina by constantly placing the image of the object of interest on the center of its fovea.

Saccade, on one hand, operates to rotate both eyes to the same direction so that the desired image always falls on the fovea. Since the vision is poor during the saccade, it operates at high speed (up to 500 degrees per second). Pursuit, on the other hand, operates at low speed to smoothly follow a moving object [21].

From the biological point of view, the neural pathway that is responsible for generating the saccade eye movement can be illustrated in Figure 3. In the figure, the eye projects signals to both the visual cortex and the superior colliculus (SC). When the object is moving in the peripheral area of the vision, SC elicits a saccade via the paramedian pontine reticular formation (PPRF), so that the image of the object can be placed in the fovea and examined by the visual cortex. This, in consequence, gives the order to the frontal eye fields (FEFs) to remember the location of the object [22]. Therefore, the SC directs short latency involuntary saccades to an unexpected movement, and the frontal eye fields direct long latency voluntary saccades to the remembered target.

During this process, the low spatial frequencies are attenuated, while the higher spatial frequencies, which would otherwise be blurred out by the eye movement, remain unaffected. This phenomenon is known as saccadic masking [23].

The first step towards the initiation of pursuit is to follow a moving target. Signals from the retina activate neurons in visual cortex, which responds selectively to directions of movement. The processing of motion in this area is necessary for smooth pursuit responses [24]. It is likely involved in providing the signal to initiate pursuit, as well as selecting a target to track [25].

By reviewing the above functionality of both the retina and the eye movement (saccade and pursuit), our proposed model is designed.

3. The Proposed Model

3.1. Dynamic Edge Detection

In the retina, there are three layers of cells: photoreceptors, bipolar cells, and ganglion cells. Horizontal cells interconnect between photoreceptors and bipolar cells (Figure 4).

To simulate the effect of horizontal cells, we consider two functional layers: one is the input layer of photoreceptors and another is the output layer of ganglion cells. We treated the bipolar cells as static weights that perform the center-surround calculation. Horizontal cells are considered as dynamic weights (1). The horizontal cells are responsible for adjusting the synaptic weights between the photoreceptors and the bipolar cells. The above-proposed mask is a part of multi masks that are represented in parallel in the robot’s vision (Figure 5).

From the mathematical point of view, mapping between the input layer (In) and the output layer (Out) can be expressed by where represents the synaptic weight (GABA variation) of the connection between photoreceptors and the bipolar cells, is constant values of the bipolar cells’ connection with the ganglion cells, which is responsible for center-surround representation. and are the location of the input neuron. and are the address of a particular mask in the robot’s vision (Figure 5).

The synaptic weights in each mask are updated similar to that done by the glutamates and GABA effects [17], where the redistribution of the weights is done on the basis of the contribution of each input pixel to the location of and where inputs with high/low contribution value gradually increased/decreased its related weights overtime (3). From this equation we can notice that the weights in the mask are decreased by (this ratio is selected for simplicity, since the mask are constructed from 9 input and one output, and it is close to the real decay in the horizontal cell contribution [17]) of its previous value, by adding the contribution to the result we get the next distribution of the weights, this adaptation mechanism are similar to the horizontal cells contribution in the center surround representation [17]

It is important to maintain the overall summation of the weights in each mask to be a constant in any given time, which is equal to the summation of the contribution (4). This is important to the filter to produce a similar result with different elimination of the input

We believe by the above propose a model, each mask can gradually adapt its synaptic weight connections during the time to reflect the input pattern of the edges in its associated area.

Even though in this work we worked with luminance information of the view (gray images), this work can be extended in future to include colors detections. It is widely known that the receptive fields of the color channels in retina have a center-surround organization, that is, red-green and blue-yellow receptive fields. By introducing the proposed dynamic weights adaptation mechanism, we should be able to improve the representation of the colors in the robot view.

3.2. Human-Like Eye Movement

Since the proposed mask relies mainly on time to approach the best resolution, it is important for the robot to move its eyes (2 color cameras mounted in the robot’s head with 2 degree of freedom each) to trace the object of interest and try always to place it in the center of its vision (fovea), similarly to that behavior found in the human eye (see Section 2 for details).

3.2.1. Variant Photoreceptor Distribution

To achieve such a phenomenon, an equivalent variant distribution of the actual photoreceptor in the retina (cone and rods) is designed in the robot vision (Figure 6). The cones are dense in the center of the retina (fovea) and responsible for edge detection and object recognition. The rods are absent in the fovea but dense in the peripheral area. Rods, therefore, has less resolution and it is responsible for motion detection.

Here, we applied the edge detection algorithm to the images captured from the left camera, while the right camera is responsible for motion detection. Since the left camera images are six times as higher in resolution as that of the right camera images (Figure 7). We use (5) to combine the two images in one single image where represents the radians from the center of the view to the periphery (Figure 6), is the source image from the left camera, and is the source image from the right camera (Figure 8).

3.2.2. Eye Movement Control System

The above-listed representation of the image is used to support the robot eye movements (Figure 9). The image in the SC is the foveal image from the retina. It has two parts: fovea and periphery. When the user approaches the robot and starts the conversation, the robot, at first, locates the user in its fovea and sends an inhibitory signal to PPRF. This allows the robot to give much attention to the target user by ignoring most of the movement surrounding him. The location of the user will be temporarily memorized in the FEF. When the user is interacting with the robot and at the same time starts to move, the FEF will provide the PPRF by the user direction and velocity, so the PPRF can send signal to rotate both eyes smoothly and keep the user always in its fovea.

In the PPRF stage, the robot will decide to keep looking at the user or change the attention to any object moving in the periphery depending on the conversation state and the characteristic of the object itself. If any moving object in the periphery attracts the robot attention, then the PPRF will send signals to motoneurons to generate saccades and move the eyes to the spot of the movement. Concurrently, the PPH will change the location of the SC in the mental map so that the new input will not overlap with the previous information.

When the robot moves its eyes to the new location, it will have the chance to examine that location in its fovea for another user. If another user exists, the FEF will store the location of the new user in its mental map. If the Robot notices that the second user is interesting to start interaction with it, the robot will try to interact with both users and update their locations continuously. Thus, the robot will have a memory about the current users’ location even though they are out of its eyes range.

4. Experimental Results

We have conducted three experiments: the first is to examine the validity of the proposed dynamic edge detection algorithm. The second is to examine the validity of the proposed robot eye movement techniques to support the ability of the dynamic algorithm. The third is to show a complete scenario of a robot equipped with the above two models to examine the ability of a human-like robot to achieve a multiuser interaction in office-like environment. All the following experiments are conducted on a physical human-like robot “Robovie-R2” with two CCD cameras with resolution of pixels, each camera can rotate horizontally and vertically to simulate human eye movement (ATR-Robotics [26]).

4.1. Extracting Edges in a Real Time

In this platform, we are examining the validity of the proposed dynamic edge detection to extract clear edges in real time from a user standing in front of the robot (Figure 10). Note that at this stage, the robot's eyes movement techniques to trace the user movement were not activated. Therefore, when the user changes his location, the learned edges were lost and the network needed to be retrained to adapt to the changes. From the figure, it can be seen that after passing 40 frames (2 seconds), the network converges to detect accurate edges from user's face.

We compare between the proposed algorithm and the well-known canny edge detection algorithm, obtained by a detector of edges similar to the work done by Boaventure and Gonzaga [27], and to do this comparison, we have to test these algorithm with ground truth images that represent the edges in the scene.

It is well known that the edge detection algorithms perform better with higher resolution images, so to generate the ground truth set of images that describe the location of the correspondent edges, we first use sequence of images with resolution of pixels to produce ground truth images, and we tested the edge detection algorithms with lower-resolution images pixels.

The performance of the edge detection algorithm can be obtained through a set of direct measurements, such as the number of correctly detected edge pixels, called true positive (TP), the number of pixels erroneously classified as edge pixels, called false positive (FP) and the amount of edge pixels that were not classified as edge pixel, called false negative (FN), From these measures, the following statistical indices can be obtained.

The percentage of pixels that were correctly detected (): where represents the number of edge points of the ground truth image and the number of edge points detected.

The percentage of pixels that were not detected ():

The percentage of pixels that were erroneously detected as edge pixels, that is, the percentage of false alarm ():

The values of statistical indices represented by (6), (7), and (8) ranging between 0 and 1, and reach ideal values in case 1 for and 0 for indices and .

The distance to the ideal edge detector varies between 0 and (9), where the value 0 represents the perfect fit for this measure; that is, the best edge detector among several detectors will minimize this distance

Equation (10) can calculate the performance of the edge detection algorithm (Figure 11)

From the figure, we can observe that our proposed model can detect clearer edges than the canny edge detection; in addition to this, it is interesting to say that the dynamic feature of the proposed model gives the ability of each mask to adapt itself to suit within the input pattern and allow it to overcome the noises, and we used (11) to compare between the edge detection algorithms with the present of noise, as shown in Figure 12

4.2. Robot Eye Movement

As Figure 10 shows, whenever the user changes his location, the network needed a time to readapt to the new input, which can be considered a drawback in the system. To overcome the problem, the robot uses its eye movement techniques (saccade and pursuit) to always trace the user face and tries to center it on its fovea.

We borrow face detection algorithm from OpenCV Library (Face Recognition OpenCV [28]), which allows the robot to center the user in its fovea, as we can see in Figure 14. After applying the eye movement technique, the convergence rate kept approximately stable and the robot succeeded to keep the user in the center of the robot's vision (fovea), although the user changed his location twice (Figure 13).

4.3. Advanced Human-Robot Interaction

In this stage, we applied the above combined proposed models to the robot. The robot was set in an office-like environment with a number of students moving around. As early mentioned, in addition to the running models presented in this study, the robot was also run by a simple face recognition program (Face Recognition OpenCV [28]) and a speech recognition program borrowed from Microsoft Corporation (Microsoft-speech [29]).

At the initial time, the robot started to look around randomly giving priority to moving objects. When a user (user-1) approached the robot and started a conversation, the robot gave attention to the user by centering his face into its fovea and started responding to the conversation with the user. During the conversation, the robot always attempted to keep the user into its fovea, training its mask, as long as the user was giving it attention. The robot was also neglecting the other users who were moving around even though the robot was aware of them because of the advantage resulting by the proposed control system Figure 9.

After the passage of time, another user (user-2) approached the robot and engaged in the conversation. The robot could successfully interact with both users at the same time in a natural way. While giving attention to user-2, the robot kept memorizing the location of the user-1 in its mental map so that it could return to him and keep continuing the conversation between them. After a period of time the user-2 had left the scene, and the robot gave back its all attention to the user-1 (Figure 14).

As we can see from this scenario, the robot can maintain the user face in specific location of his vision (fovea) that allows our dynamic edge detection algorithm to converge and produce accurate edges from the user’s face, which can be used for face expressions recognition and user identification, as higher level of human-robot interaction.

5. Conclusion

This work is part of a series of studies that aim to develop a human-like controller capable of resembling living organism behaviors in some specific characteristic (Figure 1). More precisely, this study is concerned with enhancing the human-robot interaction level by developing a bioinspired vision system.

In the first stage, we enhanced the robot’s vision accuracy in real time by applying a novel dynamic edge detection algorithm abstracted from the rules that the horizontal cells play in the mammalian retina. The algorithm was constructed by multimasks, each of which was represented by a two-layer artificial neural network and applied in parallel on the robot’s vision. Synaptic weights, in each mask, were updated gradually during the time based on the founded edges in the image.

In order to support the first algorithm, in the second stage, we improved the robot tracking ability in the attempt to keep the subject in a certain area in the fovea so that the weights in the mask maintain its stability by designing a variant photoreceptors distribution similar to what exists in the human vision system. In this distribution, edge detection processes lay more densely in the central region called fovea, while they become gradually sparser in the periphery and vice versa for the motion detection. The distributed nature of the network would also allow for a parallel implementation, making real-time frame-rate processing a definite possibility, which helps in implementing the robot eyes’ movement.

Experimental results were focused to examine the validity of the proposed model in achieving edge detection in an efficient manner with less noise than those obtained by a static model. We believe that our proposed method would be efficient for any dynamic application, where the moving edges require always detecting. The robot could give more attention to the main subject and at the same time keep aware of the potential targets that are moving around. The robot is able to build a mental map that helped it to be aware of the frontal users and to build a positive interaction with them.

We believe that besides that this study focuses on building a model inspired from a biological concept toward new solutions to robotics, but more importantly is the goal of gaining a better insight of how the brain of living systems operates to solve such sort of problems.

Acknowledgments

This work was supported by grants to K. Murase from Japanese Society for Promotion of Sciences and from the University of Fukui.