#### Abstract

We elucidate the practical implementation of Spiking Neural Network (SNN) as local ensembles of classifiers. Synaptic time constant is used as learning parameter in representing the variations learned from a set of training data at classifier level. This classifier uses coincidence detection (CD) strategy trained in supervised manner using a novel supervised learning method called Prediction which adjusts the precise timing of output spikes towards the desired spike timing through iterative adaptation of . This paper also discusses the approximation of spike timing in Spike Response Model (SRM) for the purpose of coincidence detection. This process significantly speeds up the whole process of learning and classification. Performance evaluations with face datasets such as AR, FERET, JAFFE, and CK+ datasets show that the proposed method delivers better face classification performance than the network trained with Supervised Synaptic-Time Dependent Plasticity (STDP). We also found that the proposed method delivers better classification accuracy than nearest neighbor, ensembles of NN, and Support Vector Machines. Evaluation on several types of spike codings also reveals that latency coding delivers the best result for face classification as well as for classification of other multivariate datasets.

#### 1. Introduction

Donald Hebb first proposed that if the synapses between two neurons effectively cooperate in an activity then the synaptic efficacy of the synapse would be strengthened. Since the cooperativeness between these neurons would be more effective if it happens within a specific period of time, the idea of “Hebbian Plasticity” could also be considered as a form of coincidence detection or neuronal synchronization between the inputs of the two neurons. Previous studies show that thalamic synchronization has significant impact on cortical responsiveness and suggest that coincidence detection plays a critical role in the sensory information transmission between different brain’s regions [1] as well as in phosphoinositide signaling [2]. Subsequently, the resulting Long-Term Potentiation (LTP), a phenomenon in which synaptic strength is enhanced following bursts of synaptic activity, is vital for learning and memory [3].

In this paper, we discuss two ways of learning and classification by coincidence detections, namely, (1) learning by weight adaptation in the form of Supervised STDP and (2) learning by synaptic time constant adaptation in the form of a novel approach called Prediction. These two strategies are both based on Hebbian plasticity but their implementations are quite different. Here supervised learning rules are used to form the necessary synaptic weights or synaptic time constant that represent the training data and then the trained network is used for classification.

In learning stage, the network is presented with set of positive (negative) samples, and the network will be allowed certain amount of time to fire spikes. If the designated neurons fail to fire spikes, the weights or synaptic time constant will be adjusted accordingly based on the desired spike timing. This strategy would result in higher (lower) weights or synaptic time constants for neurons that barely (easily) fire. In face identification, similar faces compete more than dissimilar faces; thus this strategy would enforce stricter conditions for spike firing on neurons dedicated to facial regions with high degree of similarity. This is done by imposing smaller weights or synaptic time constant on the synaptic connections. The restrictions would ensure that only highly similar facial regions would cause firing in output neuron. On the other hand, the process would also impose looser restrictions on dissimilar facial regions.

Coincidence detection strategy in practice would cause neurons connected to similar faces to fire more easily and vice versa; thus, based on coincidence detection and input synchrony point of view, it can be hypothesized that if the synaptic connection between presynaptic input pair of similar face requires larger weight or larger synaptic time constant to facilitate output spike firing, it signifies that the input pairs are connected to the facial area which possesses smaller discriminative capacity for face recognition.

Implicitly, to realize this hypothesis, it is thus assumed that (1) human beings exhibit similar intrapersonal variations, and thus (2) the learned weights and synaptic time constant from set of generic face samples could represent the actual variations that might appear on the unseen face samples.

#### 2. Related Works

In this section we review briefly the idea of coincidence detection as proposed by Maass [4]. Then we highlight several supervised spiking neural network (SNN) learning methods. Subsequently, we discuss the basics of Spike Response Model (SRM) used in this paper. Finally, we take a closer look on multilayer supervised learning algorithm based on Synaptic-Time-Dependent Plasticity (STDP) approach proposed by Sporea and Grüning [5].

##### 2.1. Coincidence Detection Overview

Spiking neurons can act as a coincidence detector for incoming input pulses by relaying the synchronized synaptic inputs and exact timing of spikes [4, 6, 7]. Studies of the somatosensory [8] and visual systems [9] suggested that neuronal synchronization is critical in transmitting sensory information. This relies on the fact that synchronous inputs signals are more effective in producing higher firing rates of output spikes than asynchronous inputs signals.

Assuming the input pulses received are encoding some set of numbers, the coincidence detection can determine whether some of these numbers have equal or almost equal values. This operation if carried out on more traditional type of ANN is actually very expensive [4]. By the description of the basic idea of coincidence detection by spiking neurons [4], an output neuron would not fire if the input neurons fire at a temporal distance of but it will fire when the input neurons fire at a temporal distance of . If a set of input neuron is used to encode real numbers , the firing patterns of output neuron denoted as can be used to decode the input. For example, two cases, namely, and , are shown in Figure 1.

**(a)**

**(b)**

Recent work showed that a simple SNN model constructed by integrate-and-fire neurons and single coincidence-detector neuron can precisely read out subthreshold noisy signal [10]. The authors highlight that the two important parameters that will determine reliability and precision of the coincidence detection of the input pulses are the detection time window which can be manipulated by and the threshold. They suggest that it is possible to obtain as much as 100% reliability of the outputs by having an optimal pair of the detection time window and the threshold.

##### 2.2. Supervised Learning Methods for SNN

Supervised learning for SNN is usually performed based on the traditional gradient descent techniques. However, due to nature of spiking neurons timing, some modifications or special methods are introduced for dealing specifically with this temporal adaptation problem. Some popular methods of supervised learning in SNN includes learning method designed specifically for SNN based on the gradient descent by backpropagation of error called SpikeProp [11, 12], and Remote Supervised Method (ReSuMe) [13]. Later, Sporea and Grüning [5] extend ReSuMe into multilayer supervised learning algorithm based on STDP where results of benchmarks on XOR problem and Iris dataset reveal successful implementation of the algorithm as well as the flexibility of this learning rule to learn different spike coding and timing patterns.

Recently an SNN learning rule called Chronotron is proposed by Florian [14]. Xu et al. [15] then proposed a supervised multispike learning rule with temporal coding based on gradient descent that aims to solve the problem of error function construction and interference among multiple output spikes during learning. Another learning rule, called Spike Pattern Association (SPAN) [16], is based on Widrow-Hoff learning rule and temporal coding that can associate multiple spatiotemporal spike patterns to desired output spike pattern. Other methods also include statistical method [17, 18], linear algebra method [19], evolutionary method [20], and Analog Spiking Neuron Approximation Backpropagation (ASNAProp) [21].

##### 2.3. Spike Response Model

The formulation of spiking neuron behavior in SNN implementation described by integrate and fire model can be further simplified and represented by SRM [22]. Let be the weight between postsynaptic neuron and presynaptic neuron , is the synaptic time constant, is recovery time constant, is the external current, is the time of presynaptic spikes, and is time of output spike, while , , and are kernels, is Dirac delta function, and . According to [22] the state of membrane potential can be computed from (1) where the kernel is an alpha function computed from (2):

In SRM, each incoming spike from neuron at time will perturb to produce presynaptic potentials (PSP) and the time course of as a result of the perturbation is defined by the kernel . If after the summation of PSPs reaches the threshold , output spike at time is therefore triggered. The form of the spike and the after-spike potential is described by kernel . Then the zero order SRM can be constructed by neglecting the dependence of and upon the argument, so the kernels and are set so that and , respectively. Assuming that there is no external current discharged into the neuron, we let , so now (1) becomes:

Therefore each presynaptic spike evokes a PSP with the same time course, independent of the index of the presynaptic neuron and independent of the last firing time of the postsynaptic neuron. Thus the synaptic efficacies and are the parameters that are responsible to scale the amplitude of the PSPs and their “effective time interval,” respectively.

##### 2.4. Supervised STDP Learning Method

Using neurons described by SRM model in fully connected feed-forward SNN with single hidden layer, this learning rule is based on backpropagation of error [5]. The error is defined as the difference between the actual firing rate and target firing rate for all neurons. It is similar to the standard backpropagation in discrete time but derived as a functional derivative in continuous time. Assuming that the neuron has only single spike train, according to the STDP learning [22–25], the weight change between output and hidden neurons and between hidden neurons and input neurons can be described aswhere is the amplitude, is the time constant, is hidden neuron firing time, is input neuron firing time, is output neuron firing time, and is target neuron firing time. Note that the weight modification rules do not depend on the specific dynamics of the neuron model but only depend on the target, output, and input firing time thus making it applicable to any neuron model.

#### 3. Detailed Description on Coincidence Detection

To illustrate several cases of coincidence detections under different variants of learning parameters such as synaptic time constant , weight , and threshold , consider two presynaptic inputs from 2 presynaptic neurons and fired at times and , respectively, where , having temporal distance between inputs were propagated into a coincidence detection (CD) neuron. Under different values of learning variables, these two presynaptic inputs can either (1) invoke a spike or more than one spike (2) or none at all at the CD neuron. Since is effectively determined by the spatiotemporal pattern of the input, then the objective of a CD neuron would be to facilitate an output spike at only certain range of and depress firing at other instances of .

However, in order to achieve the specified objective of coincidence detection, behaviors of learning parameters such as , , and need to be closely examined. Descriptions on a few possible coincidence detection outcomes of presynaptic inputs ms and ms (thus ms) with respect to different learning parameters are shown in Figure 2. All spikes generated using SRM model with fixed threshold.

Consider the maximum amplitude of PSP evoked by the first presynaptic input, conveniently denoted afterwards as PSP_{1,max}, the weight , and synaptic time constant . From (2), the time at which PSP_{1} would reach its peak can be defined as . Since , then (2) can be written asThus, the PSP_{1,max} can be denoted as

For a CD neuron receiving two presynaptic spikes at a time, to ensure appropriate firing while avoiding firing facilitated by only single presynaptic spike, the proper selection of threshold should follow . It is assumed that there exists a minimum required synaptic time constant that would cause a CD neuron to fire a spike. For any value of larger than it would definitely cause the CD neuron to fire, but with a larger delay. This can be summarized in

There are two distinctive regions which can be defined for the firing behavior of a CD neuron. These two regions are (1) firing region (FG) and (2) nonfiring region (NFG). These regions are illustrated in Figure 3. According to Figure 3, there is 100% probability that any PSP that are strong enough to cause a CD neuron to fire would reach the threshold in FG region, while, in NFG, any neuron that failed to fire in FG region would absolutely not be able to fire in NFG region. The two NFG regions are discontinued by an FG region, where signifies the maximum time interval of FG region. The boundaries for these distinctive regions are given in the following equation, where we defined :

Consider if threshold is set at the highest possible value such that , for this threshold, only PSP resulting from two presynaptic spikes fired at the same time can reach it, where . Based on (3), assuming that the CD neuron is allowed to fire only once, the refractory kernel can be set to be fixed at . Let and ; since , (3) can be written as (9). For complete calculation readers are referred to Appendix A. Consider

As is analytically unsolvable, we can use numerical approximation of function of for , where solution for is found as ; hence . Since , and any PSP that occurs before would yield a lower membrane potential, thus the maximum achievable membrane potential for this coincidence neuron would happen before or precisely at .

#### 4. Proposed Method

##### 4.1. Output Spike Time Prediction

One major problem of any spiking neuron model is the processing time taken to evaluate the exact level of membrane potential prior to any spike triggering. According to Makino [26] it is difficult to predict firing for a complex neuron such as SRM model since it involves delayed-firing and causality. Furthermore, approximation methods such as firing time prediction can be inexact while exact simulation is limited to simple models [27]. Makino [26] proposes an event-driven SRM using incremental partitioning method which uses linear envelopes of the state variable of a neuron to partition the simulated time. This would cause the firing time to be reliably calculated by implementing the bisection-combined Newton-Rhapson method to each resulting partition.

In discrete-time approach, the system needs to compute accurately the level of membrane potential in each discrete time and also to update the state variables for precise output spike timing. If the numbers of discrete intervals (which is also known as sampling rate, in continuous-to-discrete signal conversion) between specific period of time are large, then the evaluations would take considerably large amount of time. One way to overcome this problem is by reducing the number of but this would reduce the precision of the resulting output spike timing. Here we propose an approximation method called Output Spike Time Prediction (OSTP) SRM to solve this problem.

Consider the simplified SRM model in (3) where the output neuron is assumed to receive two presynaptic inputs from two presynaptic neurons and at two different times and where . The objective of OSTP is to find the estimated spike time of the output neuron. For convenience, let and so that (3) now becomes

Then, let ; thus ; consequently, . Thus, we can write (10) as (11). For simplification of notation, let and so that the OSTP equation can be described in (12). For complete calculation readers are referred to Appendix B. Then, using the numerical approximation as discussed previously to solve for , the estimated time of spike can be obtained using (13). For ms, ms, and , plot of OSTP equation in (12) with 10 equally spaced values of the term from 0 until 3 is illustrated in Figure 4:

From Figure 4, several lines manage to cross 0 while others do not, which indicates that no solution of could be obtained. The lines that cross 0 (blue lines) are drawn from variables that would produce spike firings while others that do not cross (red-dashed lines) are drawn from variables that would not produce any spike firing. Thus, the precise spike firing time can be approximated given that the numbers of discrete intervals between the boundaries of are sufficient to provide precise estimation. Similar to discrete estimation case, OSTP relies on discrete intervals between in order to produce accurate estimation. However, discrete intervals between in OSTP do not have significant effect on the overall processing time unlike discrete-time which will be seen later.

##### 4.2. Output of Coincidence Detection Network

As explained earlier, any sufficiently close pair of presynaptic inputs would produce output spikes. However, for reliable and accurate output spikes, two additional spikes or “cues” called the start cue and end cue are added for each output spike train. These two cues indicate the start and end of CD neuron simulation. Without start cue, the firing delay of each occurring spike cannot be accurately determined. The significance of the start cue has been discussed at length in literatures [11, 28]. The time of start cue is kept when the first presynaptic spike arrives at CD neuron. In contrast, without end cue, some meaningful information encoded by presynaptic spikes that are unable to evoke any output spike would be lost and this would affect the spike coding accuracy.

Another important element which is particularly important for STDP supervised learning is “imaginary spike.” It is important to mention that imaginary spike is not used for classification; however it is used to compute error signal for supervised adjustment of the weight. This imaginary spike indicates the end of a designated time if a presynaptic input fails to produce a spike within certain period. As shown earlier, the maximum time for presynaptic input pair to attain the highest membrane level and evoke a spike is so the designated time for a spike to fire is set slightly larger than , where

Note that the presynaptic inputs only maintain the temporal integrity of the input, while the spatial integrity is embedded into the spatial location of the presynaptic neuron. To achieve spatiotemporally reliable output spike pattern, it is vital to sequence the presynaptic inputs. This process is carried out by allowing only single pair of presynaptic inputs from neurons sharing the same spatial location to take part in evoking output spike at one time. After an output or imaginary spike is produced, another pair from next location is allowed to take part in evoking spike, if any. This process is repeated until the end of simulation; that is, all presynaptic inputs have taken part in evoking output spike.

In this implementation, the computation of exact membrane potential and the term would slow the whole network down. Thus, for simplicity, the current level of membrane potential (for spiking case) and the membrane reset term (for nonspiking case) just before next presynaptic input pair takes part in evoking the output spike are changed to membrane reset constants and , respectively, with respect to (10). This is actually feasible assuming that both spiking and nonspiking cases of coincidence detection are producing spikes (actual and imaginary spikes). The output spike train of a CD neuron using continuous SRM, discrete-time SRM, and OSTP SRM in a CD classification network accepting 9 presynaptic input pairs is shown in Figure 5, where the parameters used are , ms, mV, mV, and mV.

Based on Figure 5 each bar represents the precise timing of output spikes (blue), imaginary output spikes (red), cues (dashed black), sequenced presynaptic inputs (green), and original presynaptic inputs (purple). The resulting output spikes for these 3 approaches are exactly the same. Imaginary output spikes are needed specifically to train the network by Supervised STDP approach and are not used for classification. There are 2 cues (start and end (cooccurring at the end of output spike)), and 7 spikes and 2 imaginary spikes are produced by the CD neuron.

The numbers of actual spikes are generally less than the number of presynaptic input pairs; however the number of total spikes (actual + imaginary) would be equal to the number of presynaptic input pairs. Note that the original presynaptic inputs do not have spatial integrity intact since they share several instances of similar presynaptic spike time and tend to clutter together in temporal neural network. The temporal sequencing of input makes use of spatial locations by temporally rearranging the presynaptic inputs. Furthermore, it can be observed that the OSTP SRM implementation produces identical spikes as discrete-time SRM.

##### 4.3. Learning Process for Coincidence Detection

We propose a novel learning approach called Prediction, which is carried out by approximating the required synaptic time constant to produce an output spike. Unlike Supervised STDP in [5], this learning process only requires positive (+1) class sample.

The main objective of is much simpler than Supervised STDP, that is, to evoke small-delayed spike in matching sample. Additionally, this process is assumed to be able to implicitly depress the spike or delay it longer for nonmatching sample, given that the temporal distances between the input pairs are greater than a certain range. Since changes in could facilitate, depress, or delay the output spike, using as learning parameter should allow this type of training in coincidence detection spiking neural network.

Assume that the training process needs to approximate the value of that would allow a pair of inputs and from matching face to evoke a spike at CD neuron output at desired time . By letting , from (10) we can have (15). For complete calculation readers are referred to Appendix C. Then, can be solved by finding the zero crossing using the numerical approximation discussed earlier. After that, we can compute the approximation of synaptic time constant using (16). Finally, the change in synaptic time constant for each pixel belongs to local patch such that can be computed using (17):where is total number of training subjects (classes), is positive sample per training subject, and is learning constant. Since the teacher signal depends on the current state of , it would evolve along the training process allowing it to have dynamic behavior. Additionally, there are instances of function in (15) where the plot does not produce any zero crossing, similar to no-solution case shown in Figure 5. Therefore, the approximated can take the value of since the minimum synaptic time constant required to evoke the spike in a CD neuron is less than or equal to as proven in Appendix D. Since this approach only involves estimation based on positive samples, the corresponding spike error for Prediction can be given simply as a resulting error from sensitivity (true positive rate) which is given as where, for Prediction, the spike errors and . Thus, from (18) the total spike error becomes

##### 4.4. Face Classification Using Coincidence Detection

In order to apply coincidence detection as a classifier, the CD neurons are used as output neurons in a 2-layer feed-forward neural network. Using local ensemble strategy for face recognition employed in [29, 30], each CD neuron is attached to each local patch. Thus the number of CD neurons in the classifier network would be equivalent to total number of local patches . The number of inputs neuron however depends on the dimension, of the local patch. Since this coincidence detection would evaluate the synchronization between a gallery and a probe input, each CD neuron would have presynaptic neurons and connections of . The detailed network connections and elements comprising single CD neuron as output neuron are shown in Figure 6.

The output spike train for each local patch from the CD neuron will then be fed to a summation stage where the outputs of all CD neurons in the network will then be evaluated to produce vectors called Non-Coincidence Factor (NCF), for each local patch. This NCF describes degree of coincidence between an input pair where smaller values of would indicate higher coincidence between the inputs thus higher matching probability and vice versa. Different spike codings can be used to interpret the output spike trains and, here in this paper, the performance of several spike codings, namely, latency, rate by spike count (conveniently denoted simply as “rate” afterwards), rank order (RO), and time to first spike (TTFS) is investigated. Equations (20) to (23) describe several spike codings used and the associated NCF, as follows.

*Latency.*
Consider the following:

*Rate (Spike Count).*
Consider the following:

*RO.*
Consider the following:

*TTFS.*
Consider the following:

At the summation stage, the NCF obtained from different spike codings would indicate the gallery image which has the highest probability to be the correct match of the probe. For a more detailed description on the classification model, the delay and spike firing counts of a CD neuron’s firings caused by classification of a probe image’s patch of dimension, with 10 lateral gallery images’ patches are shown in Figure 7. Note that each bar for each spike coding case represents a gallery image patch, and the first gallery in each case is the actual match for the probe. Based on the figure, we can observe that, for latency, RO, and TTFS coding, lower delay in CD neurons’ firing signifies higher coincidence and hence higher probability of correct match, while, for spike interpretation by rate coding, the firing delay is totally insignificant but higher rate of spike firings would indicate higher coincidence between the probe and the gallery image. Note also in the figure that a misclassification occurs in TTFS case, where the 9th gallery image patch is found to be having the highest coincidence.

Aside from applying the proper spike codings to interpret the output spike trains, two other processes are also carried out at the summation stage in order to add discriminative influence on the final classification. Firstly, after the spike codings are applied, local ensemble strategies are adopted to locally classify the resulting and then are normalized to find the local confidence vectors . Secondly, these confidence vectors will be weighted using discriminants denoted as acquired from the learnt synaptic time constants computed using

Here, the discriminants are then normalized to ensure that would take the values between . Consider a probe needing to be matched to gallery with each local patch dimension using a CD classification network with fully trained , the total weighted confidence of probe as belonging to gallery can be obtained from (25). For illustration purpose, the whole classification network is shown in Figure 8:

#### 5. Experimental Results and Discussions

In this section, we conduct a test to validate and evaluate the accuracy of OSTP approximation and its efficiency. Then we compare the performance of coincidence detection trained by Prediction against coincidence detection trained by supervised STDP. Subsequently, we investigate the performance of several spike codings used in our proposed coincidence detection. Then using Principal Component Analysis (PCA) and Gabor features, we assess the performance of the proposed CD classifier against several types of classifiers, namely, nearest neighbor classifier (*k*NN), ensembles of* k*NN classifier (soft* k*NN) [30], Support Vector Machine (SVM), and ensembles of SVM classifiers (soft SVM) (inspired by [30]).

We adopt Single Sample per Person (SSPP) face recognition (for review, see [31]), where only single image per person is used as gallery. Four publicly available datasets are used for the experiments, namely, AR, JAFFE, FERET, and CK+ datasets. The AR dataset [32] contains frontal images of 76 males and 60 females with several types of variations such as different illumination conditions, expressions, and partial occlusions. Images were taken in two sessions (S1 and S2) with 13 images per session. We use only 8 expression-variant images (neutral, smile, angry, and scream) and 4 partially occluded images (sunglasses and scarf) from both sessions. JAFFE dataset [33] contains 212 expression-variant images from 10 female Japanese subjects. There are 7 types of expressions in this dataset and each subject portrays at least 3 images for each expression. FERET dataset [34] consists of 13,539 facial images corresponding to 1,565 subjects, which are diverse across ethnicity, gender, and age. Two subsets were used, namely, and following the standard FERET evaluation protocol [34]. Subset , containing 1,196 frontal images of 1,196 subjects, was specifically used as gallery, while (1,195 expression-variant images) was used as probes. CK+ dataset [35] contains 523 sequences from 123 subjects portraying seven basic expressions (happiness, sadness, surprise, anger, disgust, fear, and contempt). Examples of images from AR, JAFFE, FERET, and CK+ datasets are shown in Figure 9.

As standard preprocessing step, all images used are aligned and resized to 84 × 84 pixels. Histogram equalization is applied on all images except for images with scarves and sunglasses in AR dataset. This is due to too much irregularity caused by the histogram equalization process on the occluded parts of the image (i.e., the scarves and sunglasses), consistent with the suggestion in [30]. Each image is partitioned into 144 square local patches of 7 × 7 scanning window. This will result into total dimension of 7056 pixels per image and the vector dimension of per local patch. Hence the number of afferents that connect input neurons pair to CD neuron is equal to 7056. Each afferent is assumed to be representing a pair of neurons connected to inputs at spatial location .

##### 5.1. OSTP Performance Analysis

A test is carried out to determine the accuracy and speed of the OSTP approximation by comparing with the exact spike firing time obtained from computation of discrete-time SRM in coincidence neuron. Consider that the sampling frequency or discrete intervals for discrete-time SRM and OSTP SRM is measured for each 1 ms and , respectively; the evaluation of discrete-time SRM uses while OSTP SRM uses . The error for this test is defined as , where denotes the ceiling process. The test is performed on 191000 neurons, having different values of threshold, synaptic time constant, and weights. It is found that the approximation is correct 99.6% of the time and the processing time taken for OSTP is 9.2454 seconds while discrete-time SRM took 498.7757 seconds. This yields a significant reduction in processing speed by more than 98% of the discrete-time SRM’s.

In order to investigate further on the effect of discrete intervals and on the performance of the spike firing time approximation, the accuracy, Mean Squared Error (MSE), processing speed, and number of floating points operations of OSTP and discrete-time SRM are compared for different values of and , ranging between 1 and 10000. This test uses pairs of input neurons with different combinations of presynaptic inputs, thresholds, synaptic time constants, and weights. The accuracy and MSE of both OSTP and discrete-time SRM is compared with the exact spike timing , where is obtained by discrete-time using . The approximated spike firing accuracy is calculated based on the following equation while MSE is computed from :

The number of floating points operations is computed as the total numerical operations from the start to the end of the test. The results on the effect of discrete intervals’ size on the performance of OSTP and discrete-time SRM are shown in Figure 10.

**(a)**

**(b)**

**(c)**

**(d)**

According to Figure 10, OSTP produces comparable optimal accuracy and MSE to discrete-time SRM at , while consuming minimal processing speed and constant floating points operations. Discrete-time operation on the other hand, even though producing good accuracy and low MSE, consumes exponentially increasing processing speed and floating points operations with respect to . From Figure 10(b), for all tested , an average of 12.82 seconds is required by OSTP to produce all spikes as opposed to 18.75 × 10^{4} seconds required by discrete-time SRM. As a matter of fact, according to Figure 10(b), at , the processing speed achieved by OSTP is more than 99% faster than discrete-time SRM while producing comparable accuracy of spike timing as indicated by Figure 10(a). These results highlight the efficiency and performance of the proposed OSTP.

##### 5.2. Face Recognition Performance of Prediction and Supervised STDP

In order to examine the recognition accuracy of CD classifier trained with Supervised STDP and Prediction, an experiment is conducted. Following the recommendation by Nordlie et al. [36], a tabular description of experimental setup is given in Table 1.

Each dataset is randomly split into two groups, where each group has half of total number of subjects available. Each split follows 2-fold cross validation method, where each group is interchangeably used as training and then tested once, and after that the average recognition accuracy is taken. This random split is repeated 10 times and the final average accuracy for both training and test along with the standard deviation is recorded in Table 2. For this particular experiment, the spike coding used to interpret the output spikes of CD neuron is latency coding. The Baseline accuracy in Table 2 is obtained from* k*NN approach.

Based on result presented in Table 2, CD classifier using either Supervised STDP or Prediction on average delivers better recognition accuracy than the Baseline approach. For the test sets, CD classifier with Supervised STDP delivers average recognition accuracy of 95.56%, while prediction is at 96.31% where both are more than 20% better than Baseline accuracy. In terms of performance between training and test samples, their performances are comparable, signifying that no overfitting occurs during training process. On average, Prediction performs slightly better than Supervised STDP by just around 1% difference in recognition accuracy.

##### 5.3. Convergence Analysis

Since both learning methods are iterative algorithms, their performances with respect to different number of iterations need to be examined. Using AR Scream S1 and Scarf S1, the average recognition accuracy of the training set as the iteration grows for Supervised STDP and Prediction is shown in Figures 11(a) and 11(b), respectively. Similarly the corresponding spike error with is also presented in Figure 11(c). According to Figure 11(b), the latency coding converges to minimum recognition error immediately at , while RO and TTFS coding converges at . Special case of convergence is observed for rate coding since it converges to minimum after and produces lowest error . However, the performance of test set for rate coding is not as high, in which we found that test error of is obtained (not shown in Figure 11). This indicates an* overfitting* case for rate coding, which also signifies that, in rate coding, large data with large dimension would require a very large number of spikes to reliably distinguish each individual class. Thus it is recommended to use to avoid overtraining the synaptic time constant , where the train error and test error are obtained. By comparing Figures 11(a) and 11(b) the convergence to minimum recognition error is achieved faster in Prediction for all types of spike codings. Similarly, from Figure 11(c), the spike error converges to minimum faster in Prediction than in Supervised STDP.

**(a)**

**(b)**

**(c)**

To investigate the convergence of both methods further, evolution of output spikes from a CD neuron receiving inputs from 2 populations of neurons sharing similar spatial location as reference and test during Supervised STDP Learning and Prediction learning process is shown in Figure 12. The 2 populations of neurons are acquired from images belonging to the same subject (matching samples). Based on Figure 12, both methods start the training by producing only small number of spikes at epoch 0. However, Supervised STDP learning requires more epochs (30 to 35 epochs) before the output spikes stabilize while Prediction learning only requires 5 to 15 epochs in order to do so. Based on the convergence alone, Prediction would be favorable since it trains and converges faster than Supervised STDP.

##### 5.4. Discriminants from Synaptic Time Constant

The resulting trained synaptic time constant can capture the underlying variations embedded within the training faces. We observed that in Prediction learning the trained synaptic time constants values are lower at synaptic connections attached to face feature with lower importance. This conforms to the initial objective of enforcing stricter conditions for spike firing on neurons attached to facial regions with high degree of similarity. This would ensure that only highly similar facial regions would cause firing in output neuron. The distribution of fully trained is shown in Figure 13.

According to Figure 13, facial regions with low discrimination such as mouth in scream set receives higher values of fully trained which signifies lower importance to final classification. This is used in accordance to feature selection strategy by locally rewarding or penalizing each local NCF obtained from CD classifier based on the computed discriminants .

##### 5.5. Performance Comparison of Different Spike Codings

From results presented earlier, in terms of recognition accuracy and convergence, one learning method stands out from the other. Prediction delivers better recognition accuracy than Supervised STDP and trains faster too. Furthermore, there is a limitation on type of spike coding that could be used by Supervised STDP, where the rank-order coding does not deliver acceptable result. Thus, next analysis on performance of CD classifier using different spike codings to interpret the output spike is based only on CD classifier trained using Prediction. Using similar experimental settings described in Section 5.2, the recognition accuracy is recorded in Table 3.

According to Table 3, on average, latency coding produces the best test result on test samples with 96.31% accuracy, followed by rate coding with 94.67% accuracy, RO coding with 94.30% accuracy, and TTFS coding with 93.42% accuracy. RO coding particularly works slightly better than latency coding in AR Scarf S1, where it produces 95.90% accuracy as opposed to 95.30% produced by latency coding. On the other hand, considering that TTFS only uses the first output spike from each CD neuron, it delivers quite an impressive result, on average only lacks around 3% accuracy compared to latency coding.

In addition, in order to closely examine the interpretation of each spike coding on output spike distribution for both matching samples and nonmatching samples, 4 images of 2 subjects from AR Scream S1 and Scarf S1 are used. Each pair constitutes to 2 matching samples and 2 nonmatching samples, with each pair of image from the 4 different samples classified by fully trained CD classifier and the input and output spike patterns are recorded. The input patterns and output spike interpretations of different spike codings are given in Figure 14. The codings are applied to outputs from each local population of input neurons (i.e., 144 local patches).

**(a)**

**(b)**

From Figure 14, the output spikes delays rely heavily on the coincidence of the presynaptic input sequence where matching samples appear to produce lower-delay spikes and vice versa. The rate of output spike firing is also higher in matching sample while slightly lower in nonmatching sample. In the plot, upper face parts are bound to neurons’ afferents at lower location, while lower face parts are attached to neurons’ afferents at higher location. Note that, in both matching and nonmatching samples, the output spike delays and the spike counts are quite the same for upper afferent. However, at lower afferents, significant changes in delays and spike counts can be observed between matching and nonmatching samples. Nonmatching samples produce less spike counts and higher delay than matching sample at lower afferents. Since stricter condition is imposed on upper face part, it is much harder to evoke output spikes when the inputs actually belong to different subjects.

##### 5.6. Results on Face Recognition Using PCA and Gabor Features

For the final experiment, we investigate the performance of our proposed CD classifier against several widely used classifiers. We use two popular feature representation approaches, namely, PCA and Gabor Wavelets, to represent the face. The PCA implementation follows Locally Lateral Subspace (LLS) strategy employed in [29] where the retained PCA features per local patch are 8. Local Gabor features on the other hand were acquired using approach adopted in [37] and the resulting Gabor features per local patches were further downsampled by a factor of 3. Soft* k*NN follows the approach detailed in [30] while the soft SVM implementation follows the similar sum aggregation of ensembles of classifiers adopted by soft* k*NN [30]. SVM implementation uses LibSVM library with RBF kernel [38]. For AR, JAFFE, and FERET datasets, the trained CD classifier acquired in Section 5.2 was used, while, for CK+ dataset, 123 images at the beginning of first sequence are used as gallery while 577 peak images from each sequence are used as probe in training. Then, 4 most expressive images from each sequence resulting into a total of 2290 test images were used as test samples. The result of this experiment is given in Table 4.

According to Table 4, for PCA representation, CD classifier delivers the best result for all tested datasets, while, for Gabor representation, CD classifier gives best recognition accuracy except for CK+ dataset. On average, CD classifier is more than 5% and 11% better than soft* k*NN and soft SVM, respectively, in PCA representation. For Gabor representation, CD classifier is 2% and 13% better than soft* k*NN and soft SVM, respectively. The reason why the advantage of using CD classifier is more apparent in PCA representation rather than Gabor is due to the robustness of Gabor features against small spatial perturbations thus increasing the discriminations of facial features, while, in PCA, the noise due to variations is higher than Gabor features; thus CD classifier’s ability to elevate the discriminations of PCA features is more obvious.

##### 5.7. Results on Other Multivariate Datasets

To investigate the viability of the proposed CD classifier approach on multivariate data other than face images, another experiment is conducted using Iris dataset, Breast Cancer Wisconsin (Diagnostic) dataset, and Statlog (Landsat Satellite) dataset [39]. Iris dataset contains 3 classes of 50 vectors each, where each vector has dimension of 4. Wisconsin breast cancer dataset contains 569 vectors belonging to two classes, namely, “malignant” and “benign,” where each vector has 32 elements. Statlog dataset contains the multispectral values of pixels in neighborhoods in a satellite image. There are 6435 vectors of 36 elements to be classified into 7 classes. Using 10-fold cross validations, 10 splits of training/testing are carried out, except for Statlog dataset since the training and test data are fixed, and the average and standard deviations are recorded in Table 5. CD classifier parameters used are , mV, , mV, and mV. Training is done by Prediction learning method where is used for classification by latency and rate coding, and is used for classification by RO and TTFS coding.

According to results presented in Table 5, CD classifier with latency coding produces slightly superior result, which is around 2% better than* k*NN and SVM approach in all datasets, even though the local discrimination is not applicable since the variations within the data are not as generic as the variations found in face image and there is no clear indication on how to locally divide each piece of data into locally lateral vectors. Even if the division was done by assuming single element of the vector as a local vector, we found that no further improvements in classification can be achieved. Furthermore, variations are more random and even though the discrimination can be computed, the learnt discriminants of training data would not be able to faithfully represent the variations in the test set.

Additionally, from this experiment it is found that, for classification of multivariate data, latency coding works best with the average accuracy being 18%, 2%, and 33% better than rate, RO, and TTFS, respectively. The reason behind the inferiority of rate coding is the limitation on maximum encoding capacity, of rate of firings due to relatively small number of variables when compared against the number of the samples; that is, for iris dataset only different variables were available for classification of 120 samples. In contrast, significantly better results achieved by rate coding on Wisconsin and Statlog rather than Iris dataset are due to higher number of variables, which are 32 and 36 variables, respectively, thus increasing the rate-of-firing’s maximum encoding capacity. Meanwhile, RO coding is just slightly inferior to latency coding. On the other hand, worst average performance is produced by TTFS coding since it failed to capture the underlying similarities between the probe and the gallery due to only one spike per CD neuron (first spike) being considered in this type of coding.

#### 6. Conclusions and Future Works

In this paper, a classifier based on SNN is proposed, namely, coincidence detection (CD) classifier, where two learning methods used to train CD classifier are also presented. A method of optimizing the discrete-time Spike Response Model (SRM) by predicting the output spike time is also discussed in details. We found that our proposed Output Spike Time Prediction (OSTP) method can produce output spike pattern from input pair identical to discrete SRM but with significantly lower floating operations and much faster processing time, with an average of 12.82 seconds as opposed to 18.75 × 10^{4} seconds in discrete-time SRM for all tested discrete intervals. Besides, we showed that coincidence detection can capture the degree of synchronization between two presynaptic inputs by producing lower-delay output spikes for more synchronized input pairs and vice versa. While CD classifier can produce spike based on the coincidence of inputs, the closeness between the inputs that will trigger the output spike is explicitly determined by the training process of learning parameter .

In addition, CD classifier trained with Prediction delivered comparable performance to Supervised STDP; however it can achieve convergence faster with less number of epochs required. We found that latency coding produced best recognition accuracy at 96.31% but its performance is not too far from other spike codings. Furthermore, the distribution of discriminants derived from the learning parameters revealed the ability of Prediction learning to capture the underlying variation within the training faces. Further investigation on the performance of CD classifier using PCA and Gabor features showed that our proposed method performs 5% and 11% better than soft* k*NN and soft SVM, respectively, in PCA representation, while as for Gabor representation it is 2% and 13% better than soft* k*NN and soft SVM, respectively. Besides, experiment on the feasibility of CD classifier on classifying other multivariate data revealed that CD classifier with latency coding is around 2% better than* k*NN and SVM classifiers. Additionally, for the tested multivariate data, latency coding delivers the best result which is 18%, 2%, and 33% better than rate, RO, and TTFS, respectively.

As for future work, we will explore the possibility of extending the application of proposed method into object recognition task and also for temporal recognition of faces from video sequences. We would further study how to embed the global information of face image together with local patches information so that the resulting classification is more robust against global variations such as poses, age variation, and illumination.

#### Appendices

#### A. Two Presynaptic Spikes Firing at the Same Time

Calculating from (3) to (9)Let :

#### B. Proposed OSTP Equation

Complete calculation of OSTP equation (12) from (10):Then, let and thus. Consequently, becomesLet and :

#### C. Approximating Synaptic Time Constant from SRM Model

Complete calculation from (10) to (15):Let , and

#### D. Proof of Minimum Required to Evoke a Spike

Given a presynaptic input pair and , where , and that the threshold would make the firing possible, that is, , then we can let . To prove this, based on (10) assuming that the spike would fire at time , where , thus (10) becomesSince , the second term on the right is almost 0. In order to ensure definitely that a spike would be evoked, should be at least equal to ; thus by setting Taking , then so that

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The authors would like to thank Dr. Aleix M. Martinez for providing the AR Face Dataset. Some portions of the research in this paper also use Extended Yale Face Database B. The work presented here is cooperatively sponsored by the Ministry of Higher Education Malaysia, under MyRA Incentive Grant Scheme (MIRGS) no. MIRGS13-02-001-0001 and University of Technology MARA, Malaysia.