Active Object Recognition with a Space-Variant Retina
When independent component analysis (ICA) is applied to color natural images, the representation it learns has spatiochromatic properties similar to the responses of neurons in primary visual cortex. Existing models of ICA have only been applied to pixel patches. This does not take into account the space-variant nature of human vision. To address this, we use the space-variant log-polar transformation to acquire samples from color natural images, and then we apply ICA to the acquired samples. We analyze the spatiochromatic properties of the learned ICA filters. Qualitatively, the model matches the receptive field properties of neurons in primary visual cortex, including exhibiting the same opponent-color structure and a higher density of receptive fields in the foveal region compared to the periphery. We also adopt the “self-taught learning” paradigm from machine learning to assess the model’s efficacy at active object and face classification, and the model is competitive with the best approaches in computer vision.
In humans and other simian primates, central foveal vision has an exceedingly high spatial resolution (acuity) compared to the periphery. This space-variant scheme enables a large field of view, while allowing visual processing to be efficient. The human retina contains about six million cone photoreceptors but sends only about one million axons to the brain . By employing a space variant representation, the retina is able to greatly reduce the dimensionality of the visual input, with eye movements allowing fine details to be resolved if necessary. The retina’s space-variant representation is reflected in early visual cortex’s retinotopic map. About half of primary visual cortex (V1) is devoted solely to processing the central 15 degrees of visual angle [2, 3]. This enormous overrepresentation of the fovea in V1 is known as cortical magnification .
Neurons in V1 have localized an orientation sensitive receptive fields (RFs). V1-like RFs can be algorithmically learned using independent component analysis (ICA) [5–8]. ICA finds a linear transformation that makes the outputs as statistically independent as possible , and when ICA is applied to achromatic natural image patches, it produces basis functions that have properties similar to neurons in V1. Moreover, when ICA is applied to color image patches, it produces RFs with V1-like opponent-color characteristics, with the majority of the RFs exhibiting either dark-light opponency, blue-yellow opponency, or red-green opponency [6–8].
Filters learned from unlabeled natural images using ICA and other unsupervised learning algorithms can be used as a replacement for hand-engineered features in computer vision tasks such as object recognition. This is known as self-taught learning when the natural images that the filters are learned from are distinct from the dataset used for evaluating their efficacy . Methods using self-taught learning have achieved state-of-the-art accuracy on many datasets (e.g., [9–12]).
Previous work has focused on applying ICA to square image patches of uniform resolution. Here, we use ICA to learn filters from space-variant image samples acquired using simulated fixations. We analyze the properties of the learned filters, and we adopt the self-taught learning paradigm to assess their efficacy when used for object recognition. We review related models in the discussion.
2. Space-Variant Model of Early Vision
Our model consists of a series of subcomponents, which are depicted in Figure 1. We first describe the space-variant representation we use, and then how we learn the space-variant ICA filters.
2.1. Cone-Like Representation
When our model of space-variant vision fixates a region of an image, it converts the image from standard RGB (sRGB) colorspace to LMS colorspace , which more closely resembles the responses of the long, medium, and short wavelength cone photoreceptors in the human retina. Subsequently, we apply a cone-like nonlinearity to the LMS pixels. This preprocessing helps the model cope with large-scale changes in brightness [6, 10, 14], and it is related to gamma correction . The formulation we use is given by where controls the normalization strength. In our experiments . The nonlinearity is shown in Figure 2.
2.2. A Space-Variant Representation
We use Bolduc and Levine’s [16, 17] log-polar model of space-variant vision. Log-polar representations have been used to model both cortical magnification  and the retina . Unlike other log-polar models (e.g., ), Bolduc and Levine’s model does not have a foveal blind spot. Moreover, it incorporates overlapping RFs, which produces images of superior quality , and the RFs in the fovea are of uniform size. Each unit in this representation can be interpreted as a bipolar cell, which pools pixels in a cone-like space. The mammalian retina contains at least 10 distinct bipolar cell types , and most of them are diffuse; that is, they pool the responses of multiple cones.
We briefly describe Bolduc and Levine’s [16, 17] model. The full derivation is given in . A log-polar mapping is governed by equations for the eccentricity of each ring of RFs from the center of the visual field and the spacing between individual RFs, that is, the grid rays. Bolduc and Levine’s model uses separate equations for the foveal region and the periphery. The ray spacing angle formula outside of the fovea is given by where is the ratio of the RF size to eccentricity, , and is the amount of RF overlap. The use of the round function ensures an integer number of grid rays. The eccentricity of each peripheral ring is given by where is the radius of the fovea, , and is the total number of peripheral layers. The radius of peripheral RFs at eccentricity is given by
Foveal RFs are all constrained to be the same size as the inner most ring of the periphery; that is, Constraining foveal RFs to be the same size means that there are a decreasing number of RFs in each foveal ring as the center of the retina is approached, in contrast to peripheral rings, which each contains the same number of RFs. The eccentricity of foveal ring is given by The ray spacing angle formula between RFs in foveal ring is given by .
We use normalized circular RFs for the retina, which act as linear filters. A retina RF at location with radius is defined as follows: where
The retina we used in experiments is shown in Figure 3. We set and used a RF overlap of 50%, that is, , which are biologically plausible values . We set the fovea’s radius to 7 pixels and we used 15 peripheral layers. These settings yield a retina with a radius of 35 pixels that reduces the dimensionality from 3749 pixels to 1304 retina RFs (296 in the fovea, 1008 in the periphery).
Our images are resized, so that their shortest side is 160 pixels, with the other side rescaled to preserve the image’s aspect ratio. If this canonical size is altered, then the fovea’s radius should be changed as well. This change will not alter the total number of RFs.
To use our retina with color images, we sample each color channel independently. After sampling a region of an image with the retina, we subtract each color channel’s mean and then divide the by the vector’s Euclidean norm. Sampling the image with our retina yields , a 3912-dimensional unit length vector of retinal fixation features (1304 dimensions per color channel).
2.3. Learning a Space-Variant Model of V1
We learned ICA filters from 584 images from the McGill color image dataset . Each image is randomly fixated 200 times, with each fixation location chosen with uniform probability. The images are not padded, and fixations are constrained to be within images.
Prior to ICA, we first reduce the dimensionality of the fixation data from 3912 dimensions to 1000 dimensions using principal component analysis (PCA), which preserves more than 99.4% of the variance. We then learn ICA filters using the Efficient Fast ICA algorithm . We denote the learned ICA filters using the matrix , with the rows of containing the ICA filters. The learned ICA basis functions are shown in Figure 4.
2.4. ICA Filter Activation Function
For object recognition, the discriminative power of ICA filters can be increased by taking the absolute value of the responses and then applying the cumulative distribution function (CDF) of generalized Gaussian distributions to the ICA filter responses [10, 12]. We pursue a similar approach, but we use the CDF of the exponential distribution instead. The CDF of the exponential distribution is computationally more efficient to calculate, and it is easier to fit since it has only one parameter. For each ICA filter (the th row of ), we fit an exponential distribution’s rate parameter to the absolute value of the filter responses to the fixations extracted from the McGill dataset . Fitting was done using MATLAB’s “fitdist” function. The final ICA activation nonlinearity is given for each ICA filter by where is the th element of the vector .
3. Analysis of Learned Receptive Fields
We fit Gabor functions to the ICA filters to analyze their properties. Gabor functions are localized and oriented bandpass filters given by the product of a sinusoid and a Gaussian envelope , and they are a common model for V1 RFs. To do this, we represent the ICA filters in Cartesian space and convert them to grayscale using the Decolorize algorithm , which preserves chromatic contrast. In general, Gabor functions were a good fit to the learned filters, with a median value of 0.81; however, 70 of the 1000 fits were poor () and we did not further analyze their spatial properties.
Figure 6 shows a scatter plot of the peak frequencies and orientations of the Gabor filter fits, revealing that they cover a wide spectrum of orientations and frequencies. While the orientations are relatively evenly covered irrespective of the filter’s location, most of the filters sensitive to higher spatial frequencies are located in the foveal region. We also found that there was a greater number of ICA filters in the foveal region compared to the periphery (see Figure 5), with the RFs getting progressively larger outside of the fovea (see Figure 7).
4. Image Classification with Gnostic Fields
4.1. Gnostic Fields
A gnostic field is a brain-inspired object classification model , based on the ideas of the neuroscientist Jerzy Konorski . An overview of the model is given in Figure 8. Gnostic fields have been shown to achieve state-of-the-art accuracy at image classification using color SIFT features. We use a gnostic field with our space-variant ICA features. We briefly provide the details necessary to implement gnostic fields here, but see  for additional information.
A gnostic field’s input is segregated into one or more channels , which helps it cope with irrelevant features. We used three channels: all 1000 ICA filters, the 744 achromatic ICA filters, and the 256 color ICA filters. We let be a vector that denotes features from channel , which is a subset of the dimensions of .
Whitened PCA (WPCA)  is applied to each channel independently to learn a decorrelating transformation that normalizes that channel’s variance; that is, where is the identity matrix, the columns of the matrix contain the eigenvectors of the channel’s covariance matrix calculated using the fixations from the McGill dataset, is the diagonal matrix of eigenvalues, and is a regularization parameter, with in experiments. The output is then made unit length, which allows measurements of similarity using dot products . At each time step , this yields whitened and normalized vector , that is, Let denote the location of the fixation, with the coordinates normalized by the image size to be between and . To incorporate this location information into the unit length features, we normalize to unit length and weight it by ; that is, , with controlling the strength of the fixation location’s influence. The vector is concatenated to , which is then renormalized to unit length, yielding . In our experiments, .
A gnostic field is made up of multiple gnostic sets, with one set per category. Each gnostic set contains neurons that assess how similar the fixation features are to previous observations from the category. For each gnostic set, the activity of a neuron for category and from channel is given by the dot product where is the neuron’s weight vector.
The output of the gnostic set for category and channel is given by the most active neuron:
Max pooling enables the gnostic set to vigorously respond to features matching the category’s training data.
Spherical -means  is an unsupervised clustering algorithm for unit length data that is used to learn the localized units for each of the gnostic sets and channels . The number of units in a gnostic set depends on the number of fixations from that category, albeit with fewer units being recruited as the number of fixations increases. To implement this, the number of units learned for a category from channel is given by where is the total number of fixations from category and regulates the number of units learned ( in our experiments). This equation is plotted in Figure 9.
Inhibitive competition is used to suppress the least active gnostic sets. This is implemented for the gnostic sets by attenuating their activity using with the threshold . Subsequently, the nonzero responses are normalized using with acting as a form of variance-modulated divisive normalization .
As fixations are acquired over time, the gnostic field accumulates categorical evidence from each channel Subsequently, the responses from all of these evidence accumulation units are combined across all categories and channels into a single vector . This vector is then made mean zero and normalized to unit length.
A linear multicategory classifier decodes the activity of these pooling units. This allows less discriminative channels to be downweighted and it helps the model cope with confused categories. The model’s predicted category is given by , where is the weight vector for category . The weights were learned with the LIBLINEAR toolbox  using Crammer and Singer’s multiclass linear support vector machine formulation , with a low cost parameter (0.0001).
4.2. Face and Object Recognition Experiments
We assess performance of the space-variant ICA features using two computer vision datasets: the Aleix and Robert (AR) face dataset  and Caltech-256 . Training and testing consisted of extracting 500 fixations per image from random locations without replacement. We did not attempt to tune the number of fixations.
AR contains 4,000 color face images under varying expression, dress (disguise), and lighting conditions. We use images from 120 people, with 26 images each. Example images are shown in Figure 10(a). Results are shown in Figure 11. Our model performs slightly better than the best algorithms.
Caltech-256  consists of images found using Google image search from 256 object categories. Example Caltech-256 images are shown in Figure 10(b). It exhibits a large amount of interclass variability. We adopt the standard Caltech-256 evaluation scheme . We train on a variable number of randomly chosen images per category and test on 25 other randomly chosen images per category. We report the mean per-class accuracy over five cross-validation runs in Figure 12.
We performed an additional experiment on Caltech-256 to assess the impact of omitting the location information in the fixation features. Omitting it caused performance to drop by 3.6% when using 50 training images per category.
To examine how well gnostic fields trained using each channel individually performed compared to our main results using the multichannel model, we performed another experiment with Caltech-256 using 50 training instances per category. The multichannel approach performed best, and the chromatic filters alone worked comparatively poorly. These results are shown in Table 1.
We conducted additional experiments to examine performance as a function of the number of fixations used during testing. These results are shown in Figure 13. For both datasets, performance quickly rises; however, Caltech-256 appears to need more fixations to approach its maximum performance. In both cases, it is likely that choosing fixations in a more intelligent manner would greatly decrease the number of fixations needed (see Section 5).
We applied ICA to spatially-variant samples of chromatic images. Our goal was to analyze the properties of the learned filters and to assess their efficacy at object recognition using the self-taught learning paradigm.
Our fixation-based approach to object recognition is similar to the NIMBLE model . NIMBLE used a square retina, which pooled ICA filter responses learned from square patches. Instead of a Gnostic Field, NIMBLE used a Bayesian approach to update its beliefs as it acquired fixations. NIMBLE was unable to scale to large datasets because it compared new fixations using nearest neighbor density estimation to all stored fixations for each category. For example, on Caltech-256 with 500 training fixations per image and 50 training instances per category, NIMBLE would store 25000 high dimensional fixation features per class, whereas a gnostic field would only learn 1239 gnostic units. This allows gnostic fields to be faster and more memory efficient, while also being more biologically plausible.
Like us, Vincent et al.  learned filters from a space-variant representation, but instead of ICA they used an unsupervised learning algorithm that penalized firing rate. Their algorithm also learned Gabor-like filters. They found that RF size increases away from the fovea, and that more filters are learned in the fovea compared to the periphery. While they were primarily interested in the RF properties, it would be interesting to examine how well their filters work for object recognition.
Log-polar representations can be made rotation and scale tolerant with respect to the center of a fixation , since changes in rotation and scale consist of “spinning” the retina or having it “zoom” in or out. Exploiting this could lead to improved object recognition performance, although if used in all situations it is likely to cause a loss of discriminative power (see  for an extensive discussion of the discriminative power-invariance tradeoff).
We are currently exploring avenues for developing a better controller for choosing the location of fixations. In our experiments we randomly chose the locations of fixations, but it is likely that significant gains in performance could be obtained by using a smarter controller that chose the next fixation location based on evidence acquired during previous fixations. The controller could also manipulate the rotation and size of the retina, potentially allowing it to increase its tolerance to changes in scale and rotation. One approach to learning a controller is to use reinforcement learning , with the reward function being crafted to reduce uncertainty about the object being viewed as quickly as possible. An alternative to reinforcement learning for fixation control was proposed by Larochelle and Hinton . They developed a special kind of restricted Boltzmann machine that accumulated evidence over time. Their model learned a controller that selected among fixation locations on a grid ( in their experiments), with the controller trained to choose the grid location most likely to lead to the correct label prediction.
A better controller would allow us to compare the model’s simulated eye movements to the eye movements of humans when engaged in various visual tasks. We could also explore how changes in the retinal input might impact the way the controller behaves. For example, we could induce an artificial scotoma into our retinal model. Scotomas are regions of diminished visual acuity, which are caused by diseases such as retinitis pigmentosa and age-related macular degeneration. Inducing an artificial scotoma would allow us to examine how the scotoma alters the acquired policy and if the changes are consistent with eye tracking studies in humans that have similar scotomas.
Here, for the first time, ICA was applied to a spatially-variant input, and we showed that this produces filters that share many spatiochromatic properties with V1 neurons, including eccentricity properties. Further, we showed that when these features are used with an object recognition system, they rival the best hand-engineered features in discriminative performance, despite being entirely self-taught.
The author would like to thank Akinyinka Omigbodun and Garrison Cottrell for feedback on earlier versions of this paper. This work was completed, while the author was affiliated with the University of California San Diego. This work was supported in part by NSF Science of Learning Center Grants SBE-0542013 and SMA-1041755 to the Temporal Dynamics of Learning Center.
C. A. Curcio and K. A. Allen, “Topography of ganglion cells in human retina,” Journal of Comparative Neurology, vol. 300, no. 1, pp. 5–25, 1990.View at: Google Scholar
P. M. Daniel and D. Whitteridge, “The representation of the visual field on the cerebral cortex in monkeys,” The Journal of Physiology, vol. 159, pp. 203–221, 1961.View at: Google Scholar
Q. V. Le, M. A. Ranzato, R. Monga et al., “Building high-level features using large scale unsupervised learning,” in Proceedings of the International Conference on Machine Learning (ICML '12), pp. 81–88, 2012.View at: Google Scholar
M. D. Fairchild, Color Appearance Models, Wiley Interscience, 2nd edition, 2005.
M. Bolduc and M. D. Levine, “A real-time foveated sensor with overlapping receptive fields,” Real-Time Imaging, vol. 3, no. 3, pp. 195–212, 1997.View at: Google Scholar
M. Bolduc and M. D. Levine, “A review of biologically motivated space-variant data reduction models for robotic vision,” Computer Vision and Image Understanding, vol. 69, no. 2, pp. 170–184, 1998.View at: Google Scholar
E. L. Schwartz, “Spatial mapping in the primate sensory projection: analytic structure and relevance to perception,” Biological Cybernetics, vol. 25, no. 4, pp. 181–194, 1977.View at: Google Scholar
J. P. Jones and L. A. Palmer, “An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex,” Journal of Neurophysiology, vol. 58, no. 6, pp. 1233–1258, 1987.View at: Google Scholar
R. Gattass, C. G. Gross, and J. H. Sandell, “Visual topography of V2 in the Macaque,” Journal of Comparative Neurology, vol. 201, no. 4, pp. 519–539, 1981.View at: Google Scholar
J. Konorski, Integrative Activity of the Brain, University of Chicago Press, Chicago, Ill, USA, 1967.
R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin, “LIBLINEAR: a library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.View at: Google Scholar
K. Crammer and Y. Singer, “On the algorithmic implementation of multiclass kernel-based vector machines,” Journal of Machine Learning Research, vol. 2, pp. 265–292, 2001.View at: Google Scholar
A. M. Martinez and R. Benavente, “The AR face database,” Tech. Rep. 24, CVC, 1998.View at: Google Scholar
G. Griffin, A. D. Holub, and P. Perona, “The Caltech-256 object category dataset,” Tech. Rep. CNS-TR-2007-001, Caltech, Pasadena, Calif, USA, 2007.View at: Google Scholar
P. Gehler and S. Nowozin, “On feature combination for multiclass object classificationpages,” in Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV '09), pp. 221–228, IEEE Computer Society, Los Alamitos, Calif, USA, 2009.View at: Google Scholar
A. Bergamo and L. Torresani, “Meta-class features for large-scale object categorization on a budget,” in Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR '12), 2012.View at: Google Scholar
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, Mass, USA, 1998.
H. Larochelle and G. Hinton, “Learning to combine foveal glimpses with a third-order Boltzmann machine,” in Proceedings of the 24th Annual Conference on Neural Information Processing Systems 2010 (NIPS '10), December 2010.View at: Google Scholar