Computational Intelligence and Neuroscience

Volume 2015 (2015), Article ID 986574, 18 pages

http://dx.doi.org/10.1155/2015/986574

## Encoding Sequential Information in Semantic Space Models: Comparing Holographic Reduced Representation and Random Permutation

^{1}University of Cambridge, Cambridge CB2 1TN, UK^{2}Swedish Institute of Computer Science, 164 29 Kista, Sweden^{3}Redwood Center for Theoretical Neuroscience, University of California, Berkeley, Berkeley, CA 94720, USA^{4}Indiana University, Bloomington, IN 47405, USA

Received 14 December 2014; Accepted 26 February 2015

Academic Editor: Carlos M. Travieso-González

Copyright © 2015 Gabriel Recchia et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Circular convolution and random permutation have each been proposed as neurally plausible binding operators capable of encoding sequential information in semantic memory. We perform several controlled comparisons of circular convolution and random permutation as means of encoding paired associates as well as encoding sequential information. Random permutations outperformed convolution with respect to the number of paired associates that can be reliably stored in a single memory trace. Performance was equal on semantic tasks when using a small corpus, but random permutations were ultimately capable of achieving superior performance due to their higher scalability to large corpora. Finally, “noisy” permutations in which units are mapped to other units arbitrarily (no one-to-one mapping) perform nearly as well as true permutations. These findings increase the neurological plausibility of random permutations and highlight their utility in vector space models of semantics.

#### 1. Introduction

Semantic space models (SSMs) have seen considerable recent attention in cognitive science both as automated tools to estimate semantic similarity between words and as psychological models of how humans learn and represent lexical semantics from contextual cooccurrences (for a review, see [1]). In general, these models build abstract semantic representations for words from statistical redundancies observed in a large corpus of text (e.g., [2, 3]). As tools, the models have provided valuable metrics of semantic similarity for stimulus selection and control in behavioral experiments using words, sentences, and larger units of discourse [4–6]. As psychological models, the vectors derived from SSMs serve as useful semantic representations in computational models of word recognition, priming, and higher-order comprehension processes [7–12]. In addition, the semantic abstraction algorithms themselves are often proposed as models of the cognitive mechanisms used by humans to learn word meaning from repeated episodic experience, although there has been criticism that this theoretical claim may be overextending the original intention of SSMs [13–15].

A classic example of an SSM is Landauer and Dumais’ [2] latent semantic analysis model (LSA). LSA begins with a word-by-document matrix representation of a text corpus, where a word is represented as a frequency distribution over documents. Next, a lexical association function is applied to dampen the importance of a word proportionate to its entropy across documents (see [16] for a review of functions used in various SSMs). Finally, singular value decomposition is applied to the matrix to reduce its dimensionality. In the reduced representation, a word’s meaning is a vector of weights over the 300 latent dimensions with the largest eigenvalues. The dimensional reduction step has the effect of bringing out latent semantic relationships between words. The resulting space positions words proximally if they cooccur more frequently than would be expected by chance and also if they tend to occur in similar semantic contexts (even if they never directly cooccur).

More recent SSMs employ sophisticated learning mechanisms borrowed from probabilistic inference [18], holographic encoding [19], minimum description length [20], random indexing [21], and global memory retrieval [22]. However, all are still based on the fundamental notion that lexical semantics may be induced by observing word cooccurrences across semantic contexts [23, 24], and no single model has yet proven itself to be the dominant methodology [1].

Despite their successes both as tools and as psychological models, current SSMs suffer from several shortcomings. Firstly, the models have been heavily criticized in recent literature because they learn only from linguistic information and are not grounded in perception and action; for a review of this debate, see de Vega et al. [25]. The lack of perceptual grounding is clearly at odds with the current literature in embodied cognition, and it limits the ability of SSMs to account for human behavior on a variety of semantic tasks [26]. While the current paper does not address the issue of incorporating perceptual grounding into computational models trained on linguistic data, the issue is discussed at length in several recent papers (e.g., [15, 27–30]). Secondly, SSMs are often criticized as “bag of words” models because (with the exception of several models to be discussed in the next section) they encode only the contexts in which words cooccur, ignoring statistical information about the temporal order of word use within those contexts. Finally, many SSMs suffer from a difficulty to scale to linguistic data comparable to what humans experience. In this paper, we simultaneously address order and scalability in SSMs.

*The Role of Word Order in Lexical Semantics*. A wealth of evidence has emphasized the importance of domain-general sequential learning abilities in language processing (see [31, 32], for reviews), with recent evidence suggesting that individual differences in statistical sequential learning abilities may even partially account for variations in linguistic performance [33]. Bag of words SSMs are blind to word order information when learning, and this has been criticized as an “architectural failure” of the models [13] insofar as it was clear a priori that humans utilize order information in almost all tasks involving semantic cognition. For example, interpretation of sentence meaning depends on the sequential usage tendencies of the specific component words [34–40].

One common rebuttal to this objection is that order information is unimportant for many tasks involving discourse [2]. However, this seems to apply mostly to applied problems with large discourse units such as automatic essay grading [41]. A second rebuttal is that SSMs are models of how lexical semantics are learned and represented, but not how words are used to build sentence/phrase meaning [42, 43]. Hence, order is not typically thought of as a part of word learning or representation, but rather how lexical representations are put together for comprehension of larger units of discourse. Compositional semantics is beyond the scope of SSMs and instead requires a process account of composition to build meaning from SSM representations, and this is the likely stage at which order plays a role [9, 11].

However, this explanation is now difficult to defend, given a recent flurry of research in psycholinguistics demonstrating that temporal order information is used by humans when learning about words, and that order is a core information component of the lexical representation of the word itself. The role of statistical information about word order was traditionally thought to apply only to the rules of word usage (grammar) rather than the lexical meaning of the word itself. However, temporal information is now taking a more prominent role in the lexical representation of a word’s meaning. Elman [44] has recently argued that the lexical representations of individual words contain information about common temporal context, event knowledge, and habits of usage (cf. [4, 45–47]). In addition, recent SSMs that integrate word order information have seen greater success at fitting a human data in semantic tasks than SSMs encoding only contextual information (e.g., [16, 19, 48–50]).

*The Role of Data Scale in Lexical Semantics*. SSMs have also been criticized due to their inability to scale to realistic sizes of linguistic data [51, 52]. The current corpora that SSMs such as LSA are frequently trained on contain approximately the number of tokens that children are estimated to have experienced in their ambient environment by age three (in the range of 10–30 million), not even including words produced by the child during this time [14, 53]. Given that SSMs are typically evaluated using benchmarks elicited from college-age participants, it would be ideal if they were trained upon a quantity of linguistic input approximating the experience of this age.

However, SSMs that rely on computationally complex decomposition techniques to reveal the latent components in a word-by-document matrix (e.g., singular value decomposition) are not able to scale up to corpora of hundreds of millions of tokens, even with high-end supercomputing resources. Although new methods for scaling up singular value decomposition to larger input corpora have shown promise [54, 55], there will always be a practical upper limit to the amount of data that can be processed when compared to continuous vector accumulation techniques. The problem is exacerbated by the fact that as the size of the corpus increases, the numbers of rows and columns in the matrix both increase significantly: the number of columns grows linearly in proportion to the number of documents, and the number of rows grows approximately in proportion to the square root of the number of tokens (Heap’s law).

As the availability of text increases, it is an open question whether a better solution to semantic representation is to employ simpler algorithms that are capable of both integrating order information and scaling up to take advantage of large data samples or whether time would better be spent optimizing decomposition techniques. Recchia and Jones [52] demonstrated that although using an extremely simple method (a simplified version of pointwise mutual information) to assess word pairs’ semantic similarity was outperformed by more complex models such as LSA on small text corpora, the simple metric ultimately achieved better fits to human data when it was scaled up to an input corpus that was intractable for LSA. Similarly, Bullinaria and Levy [16] found that simple vector space representations achieved high performance on a battery of semantic tasks, with performance increasing monotonically with the size of the input corpus. In addition, Louwerse and Connell’s [56] simulations indicated that first-order cooccurrence structure in text was sufficient to account for a variety of behavioral trends that had seemed to be indicative of a “latent” learning mechanism, provided that the text learned from was at a sufficiently large scale. These findings were one factor that led these authors to favor simple and scalable algorithms to more complex nonscalable algorithms.

The issue of scalability is more than simply a practical concern of computing time. Connectionist models of semantic cognition (e.g., [57, 58]) have been criticized because they are trained on “toy” artificial languages that have desirable structure built-in by the theorist. These small training sets do not contain the complex structure inherent in real natural language. To produce humanlike behavior with an impoverished training set, the models are likely to be positing overly complex learning mechanisms compared to humans who learn from experience with much larger amounts of complex linguistic data. Hence, humans may be using considerably simpler learning mechanisms because much of the requisite complexity to produce their semantic structure is the result of large sampling from a more complex dataset [14, 19]. A model of human learning should be able to learn data at a comparable scale to what humans experience, or it risks being overly complex. As Onnis and Christiansen [59] have noted, many models of semantic learning “assume a computational complexity and linguistic knowledge likely to be beyond the abilities of developing young children” [59, abstract].

The same complexity criticism applies to most current SSMs. Although they learn from real-world linguistic data rather than artificial languages, the amount of data they learn from is only about 5% of what is likely experienced by the college-age participants who produce the semantic data that the models are fit to. Of course, a strict version of this argument assumes equal and unchanging attention to incoming tokens, which is unlikely to be true (see [7, 60]). Hence, to produce a good fit to the human data with impoverished input, we may be developing SSMs that have unnecessary complexity built into them. This suggestion explains why recent research with* simple and scalable* semantic models has found that simple models that scale to large amounts of data consistently outperform computationally complex models that have difficulty scaling (e.g., [51, 52]; cf. [61]).

#### 2. Methods of Integrating Word Order into SSMs

Early work with recurrent neural networks [57, 62–64] demonstrated that paradigmatic similarity between words could be learned across a distributed representation by attending to the sequential surroundings of the word in the linguistic stream. However, this work was limited to small artificial languages and did not scale to natural language corpora. More recently, work by Howard and colleagues with temporal context models [65–68] has shown promise at applying neurally inspired recurrent networks of temporal prediction by the hippocampal system to large real-world language corpora. Tong and colleagues have demonstrated the utility of echo state networks in learning a grammar with long-distance dependencies [69], although their work focused on a corpus of an artificial language similar to that of Elman [70]. In a similar vein, liquid state machines have been successfully trained upon a corpus of conversations obtained from humans performing cooperative search tasks to recognize phrases unfolding in real time [71].

Other noteworthy works on distributional representations of word meaning include “deep learning” methods [72], which have attracted increasing attention in the artificial intelligence and machine learning literature due to their impressive performance on a wide variety of tasks (see [73, 74] for reviews). Deep learning refers to a constellation of related methods for learning functions composed of multiple nonlinear transformations by making use of “deep” (i.e., highly multilayered) neural networks. Intermediate layers, corresponding to intermediate levels of representation, are trained one at a time with restricted Boltzmann Machines, autoencoders, or other unsupervised learning algorithms [72, 75, 76]. These methods have been applied to construct distributed representations of word meaning [77–79] and compositional semantics [80]. Of particular relevance to the present work, recurrent neural networks—referred to as the “temporal analogue” of deep neural networks [81]—have been successfully used to model sequential dependencies in language. By applying a variant of Hessian-free optimization to recurrent neural networks, Sutskever et al. [82] surpassed the previous state-of-the-art performance in character-level language modeling. Similarly, Mikolov et al. [80] achieved new state-of-the-art performance on the Microsoft Research Sentence Completion challenge with a weighted combination of an order-sensitive neural network language model and a recurrent neural network language model.

The improvements in performance achieved by deep learning methods over the past decade and the variety of tasks on which these improvements have been realized are such that deep learning has been referred to as a “breakthrough” in machine learning within academia and the popular press [74]. However, reducing the computational complexity of training deep networks remains an active area of research, and deep networks have not been compared with human performance on “semantic” behavioral tasks (e.g., semantic priming and replicating human semantic judgments) as thoroughly as have most of the SSMs described previously in this section. Furthermore, although deep learning methods have several properties that are appealing from a cognitive perspective [73], researchers in machine learning are typically more concerned with a method’s performance and mathematical properties than its cognitive plausibility. Given the similarity in the ultimate goals of both approaches—the development of unsupervised and semisupervised methods to compute vector representations of word meaning—cognitive scientists and machine learning researchers alike may benefit from increased familiarity with the most popular methods in each other’s fields. This is particularly true given that both fields often settle on similar research questions, for example, how best to integrate distributional lexical statistics with information from other modalities. Similar to findings in cognitive science demonstrating that better fits to human data are achieved when a distributed model is trained simultaneously (rather than separately) on textual data and data derived from perceptual descriptions [27], performance with deep networks is improved when learning features for one modality (e.g., video) with features corresponding to a second modality (e.g., audio) simultaneously rather than in isolation [83].

One of the earliest large-scale SSMs to integrate sequential information into a lexical representation was the Hyperspace Analogue to Language model (HAL; [3]), and it has been proposed that HAL produces lexical organization akin to what a large-scale recurrent network would when trained on language corpora [84]. HAL essentially tabulates a word-by-word cooccurrence matrix in which cell entries are inversely weighted by distance within a moving window (typically 5–10 words) slid across a text corpus. A word’s final lexical representation is the concatenation of its row (words preceding target) and column (words succeeding target) vectors from the matrix, normalized by length to reduce the effect of marginal frequency. Typically, columns with the lowest variance are removed prior to concatenation to reduce dimensionality. HAL has inspired several related models for tabulating context word distances (e.g., [50, 85, 86]), and this general class of model has seen considerable success at mimicking human data from sources as diverse as deep dyslexia [87], lexical decision times [88], semantic categorization [15, 89], and information flow [90].

Topic models (e.g., [18]) have seen a recent surge of popularity in modeling the semantic topics from which linguistic contexts could be generated. Topic models have been very successful at explaining high-level semantic phenomena such as the structure of word association norms, but they have also previously been integrated with hidden-Markov models to simultaneously learn sequential structure [48, 91]. These models either independently infer a word’s meaning and its syntactic category [91] or infer a hierarchical coupling of probability distributions for a word’s topic context dependent on its sequential state. Although promising formal approaches, neither model has yet been applied to model behavioral data.

An alternative approach to encoding temporal information in vector representations is to use vector binding based on high-dimensional random representations (for a review, see [92]). Two random binding models that have been successfully applied to language corpora are the bound encoding of the aggregate language environment model (BEAGLE; [19]) and the random permutation model (RPM; [17]). BEAGLE and RPM can both be loosely thought of as noisy -gram models. Each uses a dedicated function to associate two contiguous words in a corpus but may recursively apply the same function to create vectors representing multiple chunks. For example, in the short phrase “Mary loves John,” an associative operator can be used to create a new vector that represents the -grams Mary-loves and (Mary-loves)-John. The continuous binding of higher-order -grams from a single operator in this fashion is remarkably simple but produces very sophisticated vector representations that contain word transition information. In addition, the associative operations themselves may be inverted to retrieve from memory previously stored associations. Hence, given the probe Mary____John, the operation can be inverted to retrieve plausible words that fit this temporal context from the training corpus that are stored in a distributed fashion in the vector. The applications of BEAGLE and RPM to natural language processing tasks have been studied extensively elsewhere. The focus of this current set of experiments is to study their respective association operators in depth.

Rather than beginning with a word-by-document matrix, BEAGLE and RPM each maintain a static randomly generated* signal* vector for each word in the lexicon. A word’s signal vector is intended to represent the mental representation elicited by its invariant physical properties such as orthography and phonology. In both models, this signal structure is assumed to be randomly distributed across words in the environment, but vectors with realistic physical structure are also now possible and seem to enhance model predictions [93].

BEAGLE and RPM also maintain dynamic* memory* vectors for each word. A word’s memory representation is updated each time it is experienced in a semantic context as the sum of the signal vectors for the other words in the context. By this process, a word’s context is a mixture of the other words that surround it (rather than a frequency tabulation of a document cooccurrence), and words that appear in similar semantic context will come to have similar memory representations as they have had many of the same random signal vectors summed into their memory representations. Thus, the dimensional reduction step in these models is implicitly achieved by superposition of signal vectors and seems to accomplish the same inductive results as those attained by dimensional reduction algorithms such as in LSA, but without the heavy computational requirements [49, 94]. Because they do not require either the overhead of a large word-by-document matrix or computationally intensive matrix decomposition techniques, both BEAGLE and RPM are significantly more scalable than traditional SSMs. For example, encoding with circular convolution in BEAGLE can be accomplished in time, where is a constant representing the number of dimensions in the reduced representation [95], and in time with random permutation. By contrast, the complexity of LSA is , where is the number of nonzero entries in the matrix and is the number of dimensions in the reduced representation [96]. Critically, increases roughly exponentially with the number of documents [97]. Scalable and incremental random vector accumulation has been shown to be successful on a range of experimental tasks without being particularly sensitive to the choice of parameters such as dimensionality [21, 94, 98, 99].

To represent statistical information about the temporal order in which words are used, BEAGLE and RPM bind together -gram chunks of signal vectors into composite* order* vectors that are added to the memory vectors during training. Integrating information about a word’s sequential context (*where* words tend to appear around a target) in BEAGLE has produced greater fits to human semantic data than only encoding a word’s discourse context (*what* words tend to appear around a target; [19, 49]). Similarly, Sahlgren et al. [17] report superior performance when incorporating temporal information about word order. Hence, in both models, a word’s representation becomes a pattern of elements that reflects both its history of cooccurrence with and position relative to, other words in linguistic experience. Although BEAGLE and RPM differ in respects such as vector dimensionality and chunk size, arguably the most important difference between them is the binding operation used to create order vectors.

BEAGLE uses the operation of circular convolution to bind together signal vectors into a* holographic reduced representation* (HRR; [95, 100]) of -gram chunks that contain each target word. Convolution is a binary operation (denoted by ) performed on two vectors such that every element of is given bywhere is the dimensionality of and . Circular convolution can be seen as a modulo- variation of the tensor product of two vectors and such that is of the same dimensionality as and . Furthermore, although is dissimilar from both and by any distance metric, approximations of and can be retrieved via the inverse operation of correlation (not related to Pearson’s ); for example, . Hence, not only can BEAGLE encode temporal information together with contextual information in a single memory representation, but also it can invert the temporal encoding operation to retrieve grammatical information directly from a word’s memory representation without the need to store grammatical rules (see [19]). Convolution-based encoding and decoding have many precedents in memory modeling (e.g., [101–106]) and have played a key role in models of many other cognitive phenomena as well (e.g., audition [107]; object perception [108]; perceptual-motor skills [109]; reasoning [110]).

In contrast to convolution, RPM employs the unary operation of* random permutation* (RP; [17]) to encode temporal information about a word. RPs are functions that map input vectors to output vectors such that the outputs are simply randomly shuffled versions of the inputs:such that the expected correlation between and is zero. Just as produces a vector that differs from and but from which approximations of and can be retrieved, the sum of two RPs of and , , where is defined as , produces a vector dissimilar from and but from which approximations of the original and can be retrieved via and , respectively.

Both convolution and random permutation offer efficient storage properties, compressing order information into a single composite vector representation, and both encoding operations are reversible. However, RPs are much more computationally efficient to compute. In language applications of BEAGLE, the computationally expensive convolution operation is what limits the size of a text corpus that the model can encode. As several studies [16, 17, 52] have demonstrated, scaling a semantic model to more data produces much better fits to human semantic data. Hence, both order information and magnitude of linguistic input have been demonstrated to be important factors in human semantic learning. If RPs prove comparable to convolution in terms of storage capacity, performance on semantic evaluation metrics, and cognitive plausibility, the scalability of RPs to large datasets may afford the construction of vector spaces that better approximate human semantic structure while preserving many of the characteristics that have made convolution attractive as a means of encoding order information.

For scaling to large corpora, the implementation of RPs in semantic space models is more efficient than that of circular convolution. This is partly due to the higher computational complexity of convolution with respect to vector dimensionality. Encoding -dimensional bindings with circular convolution can be accomplished in time [95] by means of the fast Fourier transform (FFT). The algorithm to bind two vectors and in time involves calculating discrete Fourier transforms of and , multiplying them pointwise to yield a new vector , and calculating the inverse discrete Fourier transform of . In the BEAGLE model, storing a single bigram (e.g., updating the memory vector of “fox” upon observing “red fox”) would require one such binding, as well as the addition of the resulting vector to the memory vector of “fox.”

In contrast, encoding with RPs can be accomplished in (i.e., linear) time, as permuting a vector only requires copying the value at every index of the original vector to a different index of another vector of the same dimensionality. For example, the permutation function may state that the first cell in the original vector should be copied to the 1040th cell of the new vector that the next should be copied to the 239th cell of the new vector, and so on. Thus, this process yields a new vector that contains a shuffled version of the original vector, in a number of steps that scales linearly with vector dimensionality. To update the memory vector of “fox” upon observing “red fox,” RPM would need to apply this process to the environmental vector of “red,” yielding a new shuffled version that would then be added to the memory vector of “fox.”

In addition to the complexity difference, the calculations involved in the FFT implementation of convolution require more time to execute on each vector element than the copy operations involved in random permutation. Combining these two factors means that circular convolution is considerably less efficient than random permutation in practice. In informal empirical comparisons using the FFT routines in a popular open-source mathematics library (http://math.net/), we found circular convolution to be over 70 times slower than random permutation at a vector dimensionality of 2,048. Due to convolution’s greater computational complexity, the gap widened even further as dimensionality increased. These factors made it impossible to perform our simulations with BEAGLE on the large corpus.

We conducted four experiments intended to compare convolution and RP as means of encoding word order information with respect to performance and scalability. In Experiment 1, we conducted an empirical comparison of the storage capacity and the probability of correct decoding under each method. In Experiment 2, we compared RP with convolution in the context of a simple vector accumulation model equivalent to BEAGLE’s “order space” on a battery of semantic evaluation tasks when trained on a Wikipedia corpus. The model was trained on both the full corpus and a smaller random subset; results improved markedly when RP is allowed to scale up to the full Wikipedia corpus, which proved to be intractable for the convolution-based HRR model. In Experiment 3, we specifically compared BEAGLE to RPM, which differs from BEAGLE in several important ways other than its binding operation, to assess whether using RP in the context of RPM improves performance further. Finally, Experiment 4 demonstrates that similar results can be achieved with random permutations when the constraint that every unit of the input must be mapped to a unique output node is removed. We conclude that RP is a promising and scalable alternative to circular convolution in the context of vector space models of semantic memory and has properties of interest to computational modelers and researchers interested in memory processes more generally.

#### 3. Experiment 1: Associative Capacity of HRR and RP

Plate [95] made a compelling case for the use of circular convolution in HRRs of associative memory, demonstrating its utility in constructing distributed representations with high storage capacity and high probability of correct retrieval. However, the storage capacity and probability of correct retrieval with RPs have not been closely investigated. This experiment compared the probability of correct retrieval of RPs with that of circular convolution and explored the how the memory capacity of RPs varies with respect to dimensionality, number of associations stored, and the nature of the input representation.

##### 3.1. Method

As a test of the capacity of convolution-based associative memories, Plate [95, Appendix D] describes a simple paired-associative memory task in which a retrieval algorithm must select the vector that is bound to its associate out of a set of possible random vectors. The retrieval algorithm is provided with a memory vector of the form:that stores a total of vectors. All vectors are of dimensionality , and each of and is a normally distributed random vector, i.i.d. with elements sampled from . The retrieval algorithm is provided with the memory vector, , and the probe , and works by first calculating , where is the* correlation operator* described in detail in Plate [95, pp. 94–97], an approximate inverse of convolution. The algorithm then retrieves the vector in the “clean-up memory” set that is the most similar to . This is accomplished by calculating the cosine between and each vector in the set and retrieving the vector from for which the cosine is highest. If this vector is not equal to , this counts as a retrieval error. We replicated Plate’s method to empirically derive retrieval accuracies for a variety of choices of and , keeping fixed at 1,000.

Sahlgren et al. [17] bind signal vectors to positions by means of successive self-composition of a permutation function Π and construct memory vectors by superposing the results. In contrast to circular convolution, which requires normally distributed random vectors, random permutations support a variety of possible inputs. Sahlgren et al. employ* random ternary vectors*, so-called because elements take on one of three possible values (+1, 0, or −1). These are sparse vectors or “spatter codes” [111, 112] whose elements are all zero, with the exception of a few randomly placed positive and negative values (e.g., two +1s and two −1s). In this experiment, we tested the storage capacity of an RP-based associative memory first with normally distributed random vectors (*Gaussian vectors*) to allow a proper comparison to convolution and second with random ternary vectors (*sparse vectors*) with a varying number of positive and negative values in the input.

As for the choice of the permutation function itself, any function that maps each element of the input onto a different element of the output will do; vector rotation (i.e., mapping element of the input to element of the output, with the exception of the final element of the input, which is mapped to the first element of the output) may be used for the sake of efficiency [17]. Using the notation of function exponentiation employed in our previous work [17, 113], refers to composed with itself times; , , and so forth. The notion of a memory vector of paired associations can then be recast in RP terms as follows:where the task again is to retrieve some ’s associate when presented only with and . A retrieval algorithm for accomplishing this can be described as follows: given a probe vector , the algorithm applies the inverse of the initial permutation to memory vector , yielding . Next, the cosine between and the probe vector is calculated, yielding a value that represents the similarity between and . The previous steps are then iterated: the algorithm calculates the cosine between and , between and , and so forth, until this similarity value exceeds some high threshold; this indicates that the algorithm has “found” in the memory. At that point, is permuted one more time, yielding , a noisy approximation of ’s associate . This approximation can then be compared with clean-up memory to retrieve the original associate .

Alternatively, rather than selecting a threshold, may be permuted some finite number of times, having its cosine similarity to stored after each permutation. In Plate’s [95, p. 252] demonstration of the capacity of convolution-based associative memories, the maximal number of pairs stored in a single memory vector was 14; we likewise restrict the maximal number of pairs in a single memory vector to 14 (i.e., 28 vectors total). Let be the inverse permutation for which cosine was the highest. We can permute one more time to retrieve , that is, our noisy approximation . This method is appropriate if we always want our algorithm to return an answer (rather than, say, timing out before the threshold is exceeded) and is the method we used for this experiment.

The final clean-up memory step is identical to that used by Plate [95]: we calculate the cosine between and each vector in the clean-up memory and retrieve the vector in for which this cosine is highest. As when evaluating convolution, we keep (the number of vectors in ) fixed at 1,000 while varying the number of stored vectors and the dimensionality .

##### 3.2. Results and Discussion

Five hundred pairs of normally distributed random vectors were sampled with replacement from a pool of 1,000 and the proportion of correct retrievals was computed. All 1,000 vectors in the pool were potential candidates for retrieval; an accuracy level of 0.1% would represent chance performance. Figure 1 reports retrieval accuracies for the convolution-based algorithm, while Figure 2 reports retrieval accuracies for the RP formulation of the task. A 2 (algorithm: convolution versus random permutations) × 4 (dimensionality: 256, 512, 1024, 2048) ANOVA with number of successful retrievals as the dependent variable revealed a main effect of algorithm, , , with more successful retrievals when using random permutations (, ) than when using circular convolution (, ). There was also a main effect of dimensionality, , . The interaction was not significant, , . Post hoc Tukey’s HSD tests showed a significantly lower number of successful retrievals with vectors of dimensionality 256 than with any other vector dimensionality at an alpha of 0.05. All other comparisons were not significant.