Abstract

Humans make about 3 saccades per second at the eyeball's speed of 700 deg/sec to reposition the high-acuity fovea on the targets of interest to build up understanding of a scene. The brain's visuosaccadic circuitry uses the oculomotor command of each impending saccade to shift receptive fields (RFs) to cortical locations before the eyes take them there, giving a continuous and stable view of the world. We have developed a model for image representation based on projective Fourier transform (PFT) intended for robotic vision, which may efficiently process visual information during the motion of a camera with silicon retina that resembles saccadic eye movements. Here, the related neuroscience background is presented, effectiveness of the conformal camera's non-Euclidean geometry in intermediate-level vision is discussed, and the algorithmic steps in modeling perisaccadic perception with PFT are proposed. Our modeling utilizes basic properties of PFT. First, PFT is computable by FFT in complex logarithmic coordinates that also approximate the retinotopy. Second, the shift of RFs in retinotopic (logarithmic) coordinates is modeled by the shift property of discrete Fourier transform. The perisaccadic mislocalization observed by human subjects in laboratory experiments is the consequence of the fact that RFs' shifts are in logarithmic coordinates.

1. Introduction

In this article, we demonstrate that a mathematical data model we have developed for image representation intended for biologically-mediated machine vision systems [16] may be useful to process visual information during the motion of a camera with silicon retina that resembles saccadic eye movements. Our data model is based on the projective Fourier analysis that we have constructed in the framework of representation theory of the Lie group by restricting the group representations to the image plane of the conformal camera—the camera with image projective transformations given by the action of [4, 5]. The analysis provides an efficient image representation that is well adapted to (a) the projective transformations of retinal images and (b) the retinotopic mappings of the brain's oculomotor and visual pathways. This latter assertion stems from the fact that the projective Fourier transform (PFT) is computable by a fast Fourier transform (FFT) algorithm in coordinates given by a complex logarithm that transforms PFT into the standard Fourier integral, while simultaneously approximating the local retinotopy [7]. Consequently, PFT of the conformal camera integrates the head, eyes, and retinotopic mapping of the visual pathways into a single computational binocular system [6]. As already suggested in [8], this integrated system may efficiently model visuomotor processes during saccadic eye movements that reposition the high-acuity fovea—the retinal region of a central angle subtending a US quarter an arm length from the eyes—on the targets of interest to build up understanding of a scene.

Humans make about three saccades per second at the eyeball's maximum speed of  deg/sec, producing about saccades per day [9]. As we are not aware of these fast moving retinal images, the brain, under normal circumstances, suppresses visual sensitivity during saccadic eye movements and compensates for these incessant interruptions. This visual stability is maintained by the brain's widespread neural network [10]. Converging evidence from psychophysics, functional neuroimaging, and primate neurophysiology indicates that the most attractive neural basis that underlies visual stability is the mechanism causing visual cells in various visual and visuomotor cortical areas to respond to stimuli that will fall in their receptive fields (RFs) before the eyes move them there, commonly referred to as the shifting RFs mechanism [1115]. The identification of the visuosaccadic pathways (see references in [10]) supports the idea that the brain uses a copy of the oculomotor command of the impending saccade, referred to as efference copy or corollary discharge (see [16] for a review), to shift transiently the RFs of stimuli. This shift of RFs, starting 50 ms before a saccade onset and ending 50 ms after the saccade landing, is hypothesized to update (or remap) the retinotopic maps in the anticipation of each upcoming saccade. In fact, in a recent experiment [17], when human subjects shifted fixation to the clock, their reported time was earlier than the actual time on the clock by about 40 ms.

Although interruptions caused by saccades remain unnoticed in our daily life, in laboratory experiments, it becomes possible to probe the unexpected consequences of the saccadic eye movement. Specifically, laboratory experiments in lit environments have shown that briefly flashed probes around the saccade's onset are perceived as compressed toward the saccadic target [1821], while, in total darkness, the probes' localizations are characterized by a uniform shift in the direction of the saccade [2224]. The experimental studies [2527] investigating the influence of the saccade's parameters on perisaccadic mislocalization showed that perisaccadic visual compression and the unidirectional shift are probably governed by different neural processes. Although the perisaccadic shift can be mainly explained by delays in the processing of visual information [24], the mechanism of perisaccadic compression, commonly related to the neural processes of the RF shift [28], remains relatively elusive.

In this article, we argue that the conformal camera's complex projective geometry and the related harmonic analysis (projective Fourier analysis) may be useful in perisaccadic perception. In particular, the image representation in terms of PFT may efficiently model the RFs shift that remaps cortical retinotopy in the anticipation of each saccade and the related phenomenon of perisaccadic perceptual space compression. During fixations the brain acquires visual information, resolving inconsistencies of the brief compression resulting from remapping. The computational significance of this remapping, when incorporated into neural engineering design of a foveate visual system, stems from the fact that it may integrate visual information from an object across saccades, eliminating the need for starting visual information processing anew three times per second at each fixation and speeding up a costly process of visual information acquisition [29]. This transfer of object features across saccadic eye movements [3033] that is believed to maintain visual stability of transsaccadic perception is not considered here because, at present not much is known about the whole process [13].

This paper is organized as follows. We outline the neural processes of the visuosaccadic system involved in the preparation, execution, and control of the saccadic eye movement in Section 2 and later continue in Section 6. In Sections 3 and 4, we lay out the background that explains the mathematical tools we use in modeling human visual system. To this end, in Section 3, we introduce the conformal camera, discuss its conformal geometry, and evaluate the effectiveness of this geometry in the early- and intermediate-level vision computational aspects of natural scene understanding. Then, in Section 4, we show that the conformal camera possesses its own harmonic analysis—projective Fourier analysis—which provides image representation given in terms of the discrete PFT (DPFT) fast computable by FFT in coordinates given by a complex logarithm. Section 5 of this article deals with the implementation of the DPFT in retinocortical image representation that efficiently integrates the head, eyes, and retinotopic maps into one computational system. We also mention hardware setups that could be supported by DPFT-based software and compare conformal camera-based modeling to other approaches in foveate vision. Using this integrated visual system, in Section 6, we model perisaccadic perception, including the perisaccadic compression observed in psychophysical laboratory experiments. Finally, in Section 7, we discuss and compare our model with other numerical approaches to perisaccadic perception and discuss the directions in advancing our modeling. The paper is summarized in the last section.

2. The Visuosaccadic System

One of the most important functions of any nervous system is sensing the external environment and responding in a way that maximizes immediate survival chances. For this reason, the perception and action have evolved in mammals by supporting each other's functions. This functional link between visual perception and oculomotor action is well demonstrated in primates when they execute the eye-scanning movements (saccades) to overcome the eye's acuity limitation in building up the scene understanding.

In fact, humans can only see clearly the central part of the visual field of a  deg central angle. This region is projected onto the central fovea, where its image is sampled by the hexagonal mosaic of photoreceptors consisting mainly of cone cells, the color-selective type of photoreceptors for a sharp daylight vision. The visual acuity decreases rapidly away from the fovea because the distance between cones increases with eccentricity as they are outnumbered by rode cells, photoreceptors for a low-acuity black-and-white night vision. Moreover, there are a gradual loss of hexagonal regularity of the photoreceptor mosaic and an increased convergence of the photoreceptors on the ganglion cells whose axons carry visual information from the eye to the brain. For example, at  deg radius, which corresponds to the most visually useful region of the retina, acuity drops . In Figure 1, (b) shows a progressively blurred image from (a), simulating the progressive loss of acuity with eccentricity.

With three saccades per second, the saccadic eye movement is the most common bodily movement. The eyes remain relatively still between consecutive saccades for about 180–320 ms, depending on the task performed. During this time period, the image is processed by the retinal circuitry and sent, mainly, to the visual cortex (starting with the primary visual cortex, or V1, and reaching higher cortical areas, including the cognitive areas), with a minor part sent to the oculomotor midbrain areas. During the saccadic eye movement, the visual sensitivity is markedly reduced, although some modulations of low spatial frequencies (contrast and brightness) are wellpreserved or even enhanced [35]. This phenomenon is known as saccadic suppression. The sequence of saccades, fixations, and, often, smooth-pursuit eye movements for tracking a slowly moving small object in the scene, is called the scanpath and was first studied in [36]. In Figure 1(c) we depict the scanpath that eyes might actually take to build up understanding of the scene.

Although they are the simplest of bodily movements, the eyes' saccades are controlled by a widespread neural network that involves nearly every level of the brain. In Figure 2 we show the diagram of well-established pathways of the primate visuosaccadic system. They includes most prominently, the superior colliculus (SC) of the midbrain for representing possible saccade targets, and the parietal eye field (PEF) and the frontal eye field (FEF) in the parietal and frontal lobes of the neocortex (which obtains inputs from many visual cortical areas) for assisting the SC in the control of the involuntary (PEF) and voluntary (FEF) saccades. They also project to the simple neural circuits in the brainstem reticular formation (pons) in the midbrain that ensure the saccade's outstanding speed and precision. The course of events in the visuosaccadic system, which is based on [10], is outlined in the caption of Figure 2.

Although many of the neural processes involved in saccade generation and control are amenable to precise quantitative studies [37], some neural processes of the visuosaccadic system remain virtually unknown. The saccadic suppression, the fact that we do not see moving retinal images, is barely understood. There is accumulating evidence that viewers integrate information poorly across saccades during tasks such as reading, visual search, and scene perception [38]. This means that, three times per second at each fixation, there are instant large changes in the retinal images without almost any information consciously carried between fixations. Furthermore, because the next saccade target selection for the voluntary saccades takes place in the higher cortical areas involving cognitive processes [39], the time needed for the oculomotor system to plan and execute the saccadic eye movement could take as long as 150 ms. Therefore, it is critical that visual information is efficiently acquired during each fixation without repeating much of the whole process since it would require too much computational resources. However, visual constancy, the fact that we are not aware of any discontinuity in the scene perception when executing the scanpath, is not perfect. About 50 ms before the onset of the saccade, during saccadic movement (30 ms), and about 50 ms after the saccade, the salient stimuli are not perceived in veridical locations. In particular, a transient compression around the saccade target, called perisaccadic mislocalization, is observed in lit laboratory experiments; see Section 1. We continue this discussion in Section 6 where we present the algorithm for modeling some of the neural processes underlying perisaccadic perception.

3. The Conformal Camera, Geometry, and Perception

We model the human eyes’ imaging functions with the conformal camera; the name of which will be explained later. In the conformal camera, shown in Figure 3, the retina is represented by the image plane with complex coordinates , on which a D scene is projected under the mapping The implicit assumption will be removed later.

The image projective transformations are generated by the two basic transformations and shown in Figure 3. Both transformations have the form of linear-fractional transformations with . Therefore, all finite iterations of the mappings and give the group acting on the image plane of the conformal camera by linear-fractional, or Möbius, transformations Because have the same action, we need to identify matrices in that differ in sign. The result is the quotient group , where is the identity matrix, and the action (3) establishes a group isomorphism between linear-fractional, or Möbius, mappings and . Thus, gives the image projective transformations of the intensity function .

3.1. Geometry of the Conformal Camera

In the homogeneous coordinate framework of projective geometry [40], the conformal camera is embedded into the complex plane

In this embedding, the “slopes” of the complex lines can be numerically identified with the points on the extended image plane where corresponds to the line . In fact, if and , the slope corresponds to the point at which the ray (line) in that passes through the origin is intersecting the image plane of the conformal camera. Then, the standard action of the group , on nonzero column vectors , implies that the slope is mapped to the slope agreeing with the mappings in (3). However, the action (3) must be extended to include the line of “slope” as follows:

The stereographic projection (with in (1)) establishes isomorphism and gives a concrete meaning to the point such that it can be treated as any other point. The set is referred to as the Riemann sphere and the group acting on consists of the bijective meromorphic mappings of [41]. Thus, it is the group of holomorphic automorphisms of the Riemann sphere that preserve the intrinsic geometry imposed by complex structure, known as Möbius geometry [42] or inversive geometry [43].

The mappings in (4) are conformal, that is, they preserve the oriented angles of two tangent vectors intersecting at a given point [41]. Because of this property, the camera is called “conformal”. Although the conformal part of an image projective transformation can be removed with almost no computational cost, leaving only a perspective transformation of the image (see [4, 5]), the conformality provides an advantage in imaging because the conformal mappings rotate and dilate the image's infinitesimal neighborhoods, and, therefore, locally preserve the image pixels.

The image plane of the conformal camera does not admit a distance that is invariant under image projective (linear-fractional or Möbius) transformations. Therefore, geometry of the conformal camera does not possess a Riemannian metric; for instance, there is no curvature measure. It is customary in complex projective (Möbius or inversive) geometry to consider a line as a circle passing through the point Then, the fundamental property of this geometry can be expressed as follows: linear-fractional mappings map circles to circles [41]. Thus, circles can play the role of geodesics.

3.2. The Conformal Camera and the Intermediate-Level Vision

As discussed before, circles play a crucial role in the conformal camera geometry and it should be reflected in psychological and computational aspects of natural scene understanding whether this camera is relevant to modeling primate visual perception.

Neurophysiological experiments demonstrate that the retina filters impinged images extracting local contrast spatially and temporally. For instance, center surround cells at the retinal processing stage are triggered by local spatial changes in intensity referred to as edges or contours. This filtering is enhanced in the primary visual cortex, the first cortical area receiving the retinal output. This area itself is a case study in dense packing of overlapping visual submodalities: motion, orientation, frequency (color), and oculomotor dominance (depth). In psychological tests, humans easily detect a significant change in spatial intensity (low-level vision), and effortlessly and unambiguously group this usually fragmented visual information (contours of occluded objects, for example), into coherent, global shapes (intermediate-level vision). Considering its computational complexity, this grouping is one of the difficult problems that primate visual system has to solve [44].

The Gestalt phenomenology and quantitative psychological measurements established the rules, summarized in the ideas of good continuation [45, 46] and association field [47], that determine interactions between fragmented edges such that they extend along continuous contours joining the edges in the way they will normally be grouped together to faithfully convey a scene meaning. Evidence accumulated in psychological and physiological studies suggests that the human visual system utilizes a local grouping process (association field) with two very simple rules: collinearity (receptive fields aligned along a line) and cocircularity (receptive fields aligned along a circle with the preferred orientation orthogonal to the tangents of the circle [48]) with underlying scale invariant statistics for both geometric arrangements in natural scenes. These rules were confirmed in [49, 50] by statistical analysis of natural scenes. Two basic intermediate-level descriptors that the brain employs in grouping elements into global objects are the medial axis transformation [51], or symmetry structure [52, 53], and the curvature extrema [54, 55]. In fact, the medial axis, which the visual system extracts as a skeletal (intermediate-level) representation of objects [56], can be defined as a set of the centers of maximal circles inscribed inside the contour. The curvatures at the corresponding points of a contour are given by the inverse radii of the circles.

This discussion shows that the conformal camera should effectively model the eye's imaging functions related to lower- and intermediate-level visions of natural scenes.

4. Projective Fourier Analysis

The projective Fourier analysis has been constructed by restricting geometric Fourier analysis of —a direction in the representation theory of the semisimple Lie groups [57]—to the image plane of the conformal camera [5, Section ]. The resulting projective Fourier transform (PFT) of a given image intensity function is the following: where , and, if , then is the Euclidean measure on the image plane.

In log-polar coordinates given by , takes on the form of the standard Fourier integral Inverting it, we obtain the representation of the image intensity function in the -coordinates: where . We stress that although and are numerically equal, they are given on different spaces.

4.1. Discrete Projective Fourier Transform

In spite of the logarithmic singularity of log-polar coordinates, PFT of any function integrable on is finite: This observation is crucial in constructing the discrete PFT as follows. By removing a disk , we can regularize such that the support of is contained within and approximate the integral in (10) by a double Riemann sum with partition points Then, introducing and defining by we obtain Both expressions (16) and (17) can be computed efficiently by FFT algorithms.

On introducing complex coordinates into (16) and (17), these expressions are referred to as the discrete projective Fourier transform (DPFT) and its inverse, respectively [4, 5]. When “pixels” locations are transformed by the conformal camera's action of so that the function undergoes projective transformation , its representation in (17) is given in terms of and (instead of and ) but with unchanged. We refer to the representation transformations as projectively adapted characteristics of the projective Fourier transforms [4, 5]. These projective transformations are not given explicitly here since they are not used in this work.

5. DPFT in Computational Vision

We discussed before the relevance of the conformal camera to the intermediate-level vision task of grouping image elements into individual objects in natural scenes. Here we discuss the relevance of the data model of image representation based on DPFT to image processing in biologically mediated machine vision systems.

5.1. Modeling the Retinotopy with DPFT

The mapping , where removes logarithmic singularity and indicate, for different signs, the left or right brain hemisphere, is the accepted approximation of the retinotopic structure of primate visual cortical areas and the midbrain SC [7, 58]. However, the DPFT that provides the data model for image representation can be efficiently computed by FFT only in log-polar coordinates given by the complex logarithm . This mapping has distinctive rotational and zoom symmetries important in image identification (rotations and dilations in the retinal space corresponding to translations in the cortical (log-polar) space). Thus, we see that the Schwartz model of the retina comes with drastic consequences by destroying the rotation and zoom symmetries.

The following facts support our modeling of retinotopy with DPFT. First, for small , is approximately linear while, for large , it is dominated by . Secondly, to construct discrete sampling for DPFT, the image is regularized by removing a disc representing the fovea, which is possible because PFT in log-polar coordinates does not have a singularity at the origin; see (12). Thirdly, there is accumulated evidence pointing to the fact that the fovea and periphery have different functional roles in vision [59, 60] and very likely involve different image processing underlying principles. Finally, by the split theory of hemispherical image representation, the foveal region has a discontinuity along the vertical meridian, with each half processed in a different brain hemisphere [61].

We conclude this discussion with the following remark. Both models discussed above, as well as all other similar models, are, in fact, fovea-less models [62]. However, because the fovea is explicitly removed in our model, we plan to complement it in the future by including the foveal image representation.

5.2. Image Sampling for DPFT

The DPFT approximation was obtained using the rectangular sampling grid in (14), corresponding, under the mapping to a nonuniform sampling grid with equal sectors but with exponentially increasing radii where is the spacing and , where is the radius of the disc that has been removed to regularize logarithmic singularity. This sampling interface is shown in Figure 4.

Let us assume that we have been given a picture of the size in pixel units, which is displayed with dots per unit length (dpl). Then, the physical dimensions, in the chosen unit of length, of the pixel and the picture are and , respectively. Also, we assume that the retinal coordinates' origin (fixation) is the picture's center.

The central disc of radius represents the fovea with a uniform distribution of grid points, the number of the foveal pixels given by . This means that the fovea cannot increase the resolution, which is related to the distance of the picture from the eye. The number of sectors is obtained from the condition , where . Here is the closest integer to . To get the number of rings , we assume that and . We can take either or . Thus, and .

Example 1. We let and per mm, so the physical dimensions in mm are and . Furthermore, we let , so and . Finally, and The sampling grid consists of points in polar coordinates: .

5.3. Imaging with the Conformal Camera

In the example from the previous section, the number of pixels in the original image is , whereas both foveal and peripheral representations of the image contain only pixels. Thus, it results in about times less pixels than in the original image. However, this reduction in the number of pixels comes at a price: not only does the small central region have the resolution required for a clear vision, it also has to be removed to regularize the logarithmic singularity. Therefore, the conformal camera with the DPFT-based image processing in the present state of development can support only the peripheral imaging functions of the visual system.

The most basic and frequent eye's imaging functions are connected with an incessant saccadic eye movement (about saccades per day). The neural mechanisms of the RFs shifts and perisaccadic mislocalization, hypothesized to be involved in maintaining visual stability, are mainly supported by the peripheral visual processing. We use DPFT to model these phenomena in Section 6. The DPFT-based image representation could support the following hardware setup. A set of samples of an image , where , is obtained from a camera with anthropomorphic visual sensors (silicon retina) [63] or from an “exp-polar” scanner with the sampling geometry similar to the distribution density of the retinal ganglion cells. The DPFT is applied to according to (16), and is efficiently computed with FFT. Next, the IDPFT of , given in (17), is again computed with FFT. However this output from IDPFT renders the retinotopic image (numerically equal to ) of the retinal samples in the cortical log-polar coordinates.

When the eyes remain fixed, motion of objects is perceived by the successive stimulation of adjacent retinal loci. These image transformations are modeled in the conformal camera by the corresponding covariant transformations of the image representation in terms of DPFT; see the end of Section 4.1. These transformations are not important in modeling perisaccadic perception and are not dealt with in this work. However, they will become important in modeling smooth-pursuit eye movements, which we plan to undertake in the near future.

5.4. Other Approaches to Foveate Vision

Of the numerical approaches to foveate (also called space-variant) vision, involving, for example, Fourier-Mellin transform or log-polar Hough transform, the most closely related to our work are results reported by Schwartz' group at Boston University. We note that the approximation of the retinotopy by a complex logarithm was first proposed by Eric Schwartz in 1977. This group introduced the fast exponential chirp transform (FECT) [64] in their attempt to develop numerical algorithms for space-variant image processing. Both FECT and its inverse were obtained by the change of variables in the spatial and frequency domains applied to the standard Fourier integrals. The discrete FECT was introduced somehow ad hoc and some basic components of Fourier analysis, such as underlying geometry or Plancherel measure, were not considered. In comparison, projective Fourier transform (PFT) provides an efficient image representation well adapted to projective transformations produced in the conformal camera by the group acting on the image plane by linear-fractional mappings. Significantly, PFT can be obtained by restricting geometric Fourier analysis of the Lie group to the image plane of the conformal camera. Thus, the conformal camera comes with its own harmonic analysis. Moreover, PFT is computable by FFT in log-polar coordinates given by a complex logarithm that approximates the retinotopy. It implies that PFT can integrate the head, eyes, and visual cortex into a single computational system. This aspect is discussed, with special attention to perisaccadic perception, in the remaining part of the paper. Another advantage of PFT is the complex (conformal) geometric analysis underlying the conformal camera. We demonstrated in Section 3.2 the relevance of this geometry to the intermediate-level vision problem of grouping local contours into individual objects of natural scenes.

The other approaches to space-variant vision use the geometric transformations, mainly based on a complex logarithmic function between the nonuniform (retinal) sampling grid and the uniform (cortical) grid for the purpose of developing computer programs for problems in robotic vision. We give only a few examples of such problems: tracking [65], navigation [66], detection of salient regions [67], and disparity estimation [68]. However, in contrast to our projectively covariant image processing carried out with FFT, they share high computational costs in the geometric transformation process for dynamic scenes.

6. Perisaccadic Perception with DPFT

A sequence of fast saccadic eye movements is necessary to process the details of a scene by fixating the fovea successively on the targets of interest. Given the frequency of three saccades per second and limited computational resources, it is critical that visual information is efficiently acquired without starting anew much of this acquisition process at each fixation. This is critically important in robotic designs based on the foveate vision architecture (silicon retina), and in this section we propose the front-end algorithmic steps in addressing this problem.

The model of perisaccadic perception presented in this section is based on the theory in [29] that states (as is most classically assumed) that an efference copy generated by SC, a copy of the oculomotor command to rotate the eyes in order to execute the saccade, is used to briefly shift the flashed probes' RFs in FEF/PEF toward the cortical fovea. Because the shift occurs in logarithmic coordinates approximating retinotopy, the model can also explain observed in laboratory experiments perisaccadic compression shown in Figure 5.

We recall the time course of events (Figure 2) that we are going to model. During the eyes' fixation, lasting, on average,  ms, the retinal image is sampled by ganglion cells and sent to cortical areas, including higher areas in the parietal and the frontal lobes. In particular, the next saccade's target is selected in PEF/FEF areas and its position computed in subcortical SC area. About 50 ms before the onset of the saccade, during the saccade (30 ms), and about 50 ms after the saccade, the visual sensitivity is reduced, and probes flashed around the impending saccade's target are not perceived in veridical locations; see Figure 5. Instead, a copy of the oculomotor command (efference copy) is used to translate the receptive fields of flashes recorded in the fovea-centered frame of the current fixation, remapping them into a target-centered frame. This internal remapping results in the illusory compression of flashes about the target. The cortical locations of the neural correlates of remapping are uncertain; it is required that these areas are retinotopically organized. These areas, of which most likely include PEF/FEF and V4 (and to a progressively lesser degree V3, V2, and V1), can be represented here by one retinotopic area [69].

6.1. The Model

The modeling steps are the following.

Step 1 (see Figure 6). The eye initially fixated at F is making the saccade to the target located at T. The four probes flashed around the upcoming saccade's target at T are projected into the retina and sampled by the photoreceptor/ganglion cells to give the set of samples . Next, DPFT is computed by FFT in log-polar coordinates , where . The inverse DPFT, computed again by FFT (the gray arrow), renders the image representation in Cartesian log-polar coordinates–-the four dots in -coordinates. The fovea, which is shown in yellow in Figure 6, is not included in log-polar coordinates, for these coordinates approximate only the extrafoveal part of the retina. For simplicity, we can take the radius of the fovea to be so that the -coordinate starts at .

Step 2 (see Figure 7). The log-polar image is multiplied by two characteristic set functions, , or ; the domain of each is shown in Figure 7 in a different color, the blue-enclosed region for and the green-enclosed for . We obtain two images , representing cortical half-images into which the image would be divided by the retinal vertical meridian after the fovea landed on the target at T. We recall that the characteristic function of a set is defined by the following condition: takes on if and if . The blue-enclosed image is reflected both in the vertical ( line where is the midpoint of the projection of into the -axis) and in the horizontal ( line) axes and translated (blue arrow) while the green-enclosed image is only translated in the -direction (green arrow).

These transformations are shown on the left of the gray arrow in Figure 7 while the results of these transformations (red dots) are shown on its right. The translation in the -coordinates (blue arrow) is obtained by the shift of the IDPFT (here ) which can be computed by FFT. The formula in (22) is the standard shift property of Fourier transform; the cortical image is translated by pixels in the -coordinate and by pixels in the -direction, as the blue arrow shows in Figure 7. (Equivalently, the coordinate system is translated by pixels in the -coordinate and by pixels in the -direction.) In (22), the inverse discrete Fourier transform is applied to where , , is the spacing in the -coordinate (see (21) in Section 5.2) and is the original Fourier transform with the normalizing area factor in log-polar coordinates.

Further, the image reflection about the vertical axis of the region can be done with two consecutive transformations, each computable with FFT. These transformations consist of the reflection followed by the translation . We note that the reflection can be obtained by applying Fourier transform twice to the original image, which follows directly from the Fourier transform definition. The red dots represent peripheral receptive fields shifted to the frame centered at the upcoming saccade target.

Step 3 (see Figure 8). This perisaccadic compression is obtained by decoding the cortical image representation to the visual field representation:

where and , . We see that, under the shift of the coordinate system by , the original position of a dor at is transformed to , resulting in the compression shown in the scene in Figure 8 with red dots referenced by red arrows to the original positions of flashed probes (black dots). Although this step is not supported by FFT, a commonly used look-up table [65] could efficiently decode the cortical image see the discussion in Section 7.1. Where and how the brain accomplished this step is the greatest mystery of primate perception.

In the modeling steps we presented above, the cortical translation shown by the green arrow in Figure 7 gives only compression of the two corresponding probes in the scene because . However, in the translation shown in Figure 7 by the blue arrow, and the corresponding image undergoes the compression and rotation, both needed to have the fovea at the center of the four red dots in the scene shown in Figure 8, when the saccade lands at T. Because of this rotation, we need two reflections to have the original parity of the image.

Although we do not show here quantitative results of the modeling steps, the qualitative results can be seen if we translate the cortical image of the bar in Figure 4(b), say in the -coordinate to the left by a number of pixels (pixels are shown in the left upper corner), and trace out mentally its retinal copy using the square lattice in (a) and its log-polar image in (b). We can see that the bar in (a) will be compressed with respect to the origin of -coordinates. We note that, as mentioned in Step 3, we cannot apply FFT to render this compressed retinal image.

The model presented in this section complements the theory proposed in [29]. Experimental results suggest a very tight time course followed by perisaccadic compression with its duration of about  ms and with the maximum mislocalization immediately before the saccade. In the model we propose here, this saccadic dynamics can be easily accounted for: the distance of the shift (in terms of cortical pixels) can be taken as a function of time. Another aspect of perisaccadic compression that is accounted for in our modeling is the fact that not all of RFs undergo shift during eyes saccadic motion. In Step 2, the translations are applied to selected (salient) retinotopic areas. These two aspects of perisaccadic perception are not supported in a natural way by the model presented in [29].

6.2. On Modeling Global Retinotopy

The global retinotopy reflects the anatomical fact that the axons in the optic nerve leaving each eye split along the retinal vertical meridian when the axons originating from the nasal half of the retina cross at the optic chiasm to the contralateral brain's hemisphere and join the temporal half, which remains on the eye's side of the brain. This splitting and crossing reorganize the local retinotopy (log-polar mapping) such that the left brain hemisphere receives the right visual field projection and the right brain hemisphere receives the left visual field projection. According to the split theory [61, 70], which provides a greater understanding of vision cognitive processes than the bilateral theory of overlapping projections, there is a sharp foveal split along the vertical meridian of hemispherical cortical projections. Both hemispheric projections are connected by a massive corpus callosal bridge of about  M of neuronal fibers [71].

Although crucial for synthesizing D representation from the binocular disparities in the pair of D retinal images, we cannot fully address the global retinotopy here because the foveal image representation is not included in out modeling. However, the fovea-less global retinotopy can be easily modeled with DPFT by two reflections (cf., Step 2 in our modeling) both computable with FFT. Figure 9(b) shows the result of the reflection about the line of the peripheral region given by . It is followed in 9(c) with the result of the reflection about .

At this point, we can only graphically show what we expect to obtain when the foveal image representation complements the peripheral (log-polar) image representation we have developed in terms of projective Fourier transform. To this end, Figure 10 shows the peripheral region (gray) and the central foveal region (yellow). These two regions are connected by the transitional region (shaded with gray lines). The green curve in Figure 10 shows the cortical projection of a straight line making an angle of with the retinal horizontal meridian and passing through the center of the fovea.

7. Discussion

Our model, which is based on PFT, uses the approximation of the retinotopy given by the complex logarithmic mapping with , where represents, with the appropriate normalization, the radius of the foveal region removed in our modeling. This mapping transforms PFT into the standard Fourier integral that is computable by FFT, providing the benchmark of efficiency not available in any other computational modeling of perisaccadic perception. Although the foveal system is indispensable to primate vision, it is rather less important to the proper functioning of the visuosaccadic system. In fact, neuronal cells in FEF/PEF, the higher cortical areas implicated in the RF shift mechanism that obtain retinotopic projections from the occipital cortex, have larger RFs and primarily code stimuli spatial locations rather than stimuli features [72]. We should note that, although we use the complex logarithmic approximation for all retinotopically organized brain areas, this approximation is well established only for the first visual cortical areas and SC of the midbrain [7, 58]. This is justified if we realize that most algorithmic principles employed by natural visual systems need to be reformulated to better fit modern computational algorithms (FFT in our case).

The results obtained in the simulations in [29], where the Schwartz' retinotopic mapping was used to approximate the cortical magnification factor, showed the unidirectional shift component of mislocalization superimposed on perisaccadic compression. However, it was noticed there that this component did not scale with the compression according to the experimental data. The unidirectional shift is absent in the model presented here because, in our approximation of retinotopy, the parameter is zero. Since the unidirectional perisaccadic shift has a different neural origin (as it is primarily caused by delays in neuronal signals) than perisaccadic compression (caused by remapping), it should not be accounted for by just this parameter.

Further, in both our model and the model in [29], perceptual compression is attributed to a translation of the origin of the logarithmic coordinate system, which results in a linear relation between perceived and actual probes' positions. Thus, the nonlinearity observed in [19, 21] is not accounted for in our modeling. However, asymmetry and nonlinearity present in experimental data could have casual origin resulting from multiple sources; we mention here three: () an asymmetric distribution of photoreceptor/ganglion cell density [73], () the average preferred fixation area located from the point of the highest cone density a distance of about half of the central fovea's radius (m) [74], and finally, () fluctuations of the cortical surface curvature (and therefore the lengths of the geodesics) across hemispheres [75]. Given the incomplete understanding of the neural processes underlying perisaccadic perception, it is impossible to distinguish between these “accidental” causes and the real neural mechanisms captured in modeling.

7.1. Relations to Other Models and the Current Research

Two computational theories of transsaccadic vision that have been proposed in visual neuroscience are related to our modeling, both with similar functional explanation of perisaccadic mislocalization by the cortical magnification factor. The first theory [29], which motivated our research, was discussed and compared with our model in the previous sections. In summary, our modeling can be seen as complementing the approach proposed in [29] by providing efficient image representation suitable for processing visual information during saccadic eye movement, and in particular, the classically assumed process of active remapping compensating for receptive fields displacements.

The second theory [18, 20] explains perisaccadic compression by spatial attention being directed to the target of a planned saccade. The authors proposed an elaborate computational modeling that assumes that the flashed stimuli RFs in cortical areas dynamically change position toward the saccade target RF as the result of the gain of feedback of the retinotopically organized activity hill of the saccade target in the oculomotor SC layer. This attention directed to the target increases spatial discrimination at the saccade target location before the saccade onset. The perceived spatial distortion of stimuli is the result of the cortical magnification factor of the visuo-cortical mapping (or retinotopy of the visuomotor pathways) when the position of each stimulus is decoded from activity of the neural ensemble. Thus, in this theory, different neural processes to those proposed in [29] are assumed: a local and transient change in the gain control around the saccade target in retinotopic-organized stimulus position represented by a hill of neural activity is inducing perisaccadic distortion of the perceived stimulus location. However, because circuitry underlying receptive field remapping is widespread and not well understood, it cannot be easily decided whether saccadic remapping is the cause or consequence of saccadic compression [29].

What really sets apart our modeling from other models is the fact that the computational efficiency is built into the modeling process, as all algorithmic steps (except the last one) involve computations with FFT. This is especially important because the incessant occurrence of saccades and the time needed for the oculomotor system to plan and execute each saccade require that visual information is efficiently processed during each fixation period without repeating, afresh, the whole process at each fixation [29].

All models proposed so far capture only the initial, front-end stage of remapping for a particularly simple scene of flashed probes and, though they explain the perisaccadic mislocalization phenomenon, they leave out the crucial modeling step of the integration of the objects' features (pattern, color, ) across saccades [3033] that achieves stability of perception [14, 76]. Many issues must be understood better before this crucial modeling step is achieved. We give two examples. During a scene viewing, a salient map of the landmarks and behaviorally significant objects of the scene is created and the RF shift updates the retinotopy of only this saliency map [13]. Although it is still unclear what a saliency map should be when viewing complex natural scenes, it points to the possibility of working with a sparse visual information data when performing Step 3 in the model outlined in Section 6.1. Thus, a lookup table approach [65] could be efficient enough for this step even when viewing a complex scene. Further, the time course of different stages in visual information processing in trans-saccadic perception and their influences on other cognitive processes is unclear. It is well known that scene gist recognition, when the scene is viewed for 50 ms [77], is critical in the early stage of scene perception, influencing more complex cognitive processes, such as directing our attention within a scene, facilitating object recognition, and influencing long-term memory. Only very recently [78] has it been found that peripheral vision is more useful for recognizing the gist of a scene than central vision (i.e., foveal + parafoveal vision) is, even though central vision is more efficient per pixel at processing gist.

Although the understanding of neural mechanisms involved in trans-saccadic perception is incomplete, a significant progress in understanding dynamic interaction taking place between different pathways in the visuosaccadic system has been recently made. In particular, the fundamental principles underlying perception of objects across saccades have been outlined [13]. Therefore, we should expect major advances in the near future. As a consequence, in robotic vision research, which is still wedged in-between the limited knowledge about biological visual processing and technological and software restrictions imposed by current cameras, scanners, and computers, it is becoming more important than ever to propose different, even if competing, perspectives on how to model known processes involved in trans-saccadic perception. In this article we proposed a comprehensive, biologically mediated engineering approach to model an active vision system. Our modeling, which is efficiently supporting both hard-wired eccentricity-dependent visual resolution and front-end modeling of mechanisms that may contribute to continuity and stability of trans-saccadic perception, is based on an abstract and less intuitive camera model with underlying nonmetric (conformal) Möbius geometry. However, our initial study of the smooth-pursuit eye movements, which complement fixations and saccades in the scanpath, indicates that the conformal camera with its DPFT-based image representation will also be important in processing visual information during the pursuits.

Further, the conformal camera geometry's effectiveness in the intermediate-level vision problems and the perspectively covariant projective Fourier analysis, well adapted to retinotopy, strongly suggest that DPFT-based image representation should be useful in modeling the neural processes that underlie the transfer of the objects' features across saccades and maintain the continuity and stability of perception.

Finally, it was observed that saccades cause not only a compression of space, but also a compression of time [79]. In order to preserve visual stability during the saccadic scanpath, receptive fields undergo a fast remapping at the time of saccades. When the speed of this remapping approaches the physical limit of neural information transfer, relativistic-like effects are psychophysiologically observed and may cause space-time compression [80, 81]. Curiously, this suggestion can also be accounted for in our model based on projective Fourier analysis since the group of image projective transformations in the conformal camera is the double cover of the group of Lorentz transformations of Einstein's special relativity; for a simple presentation [82, Section ].

8. Summary

In this article, we presented a comprehensive framework we have developed for computational vision over the last decade, and we applied this framework to model some of the processes underlying trans-saccadic perception. We have done this by bringing, in one place, physiological and behavioral aspects of primate visual perception, the conformal camera's computational harmonic analysis, and the underlying conformal geometry. This allowed us to discuss the conformal camera's effectiveness in modeling a biologically mediated active visual system. First, the conformal camera geometry fully accounts for the basic concepts of cocircularity and scale invariance employed by the human vision system in solving the difficult intermediate-level vision problems of grouping local elements into individual objects of natural scenes. Second, the conformal camera has its own harmonic analysis—projective Fourier analysis—providing image representation and processing that is well adapted to image projective transformations and the retinotopic mapping of the brain visual and oculomotor pathways. This later assertion follows from the fact that the projective Fourier transform integrates the head, eyes (conformal cameras), and visual cortex into a single computational system. Based on this system, we presented a computational model for some neural processes of the perisaccadic perception. In particular, we modeled the presaccadic activity, which, through shifts of stimuli current receptive fields to their future postsaccadic locations, is thought to underlie the scene remapping of the current foveal frame to the frame at the upcoming saccade target. This remapping uses the motor command of the impending saccade and may help maintain stability of primate perception in spite of three saccadic eye movements per second with the eyeball's maximum speed of deg/sec producing saccades per day. Our modeling also accounted for perisaccadic mislocalization observed by human subjects in laboratory experiments. Finally, we compared our model with the other computational approaches in the modeling trans-saccadic perception and discussed further developments.

Acknowledgment

The author thanks Dr. Noriyasu Homma for helpful comments.