Abstract

This paper reviews evidence for the idea that much of human learning, perception, and cognition may be understood as information compression and often more specifically as “information compression via the matching and unification of patterns” (ICMUP). Evidence includes the following: information compression can mean selective advantage for any creature; the storage and utilisation of the relatively enormous quantities of sensory information would be made easier if the redundancy of incoming information was to be reduced; content words in natural languages, with their meanings, may be seen as ICMUP; other techniques for compression of information—such as class-inclusion hierarchies, schema-plus-correction, run-length coding, and part-whole hierarchies—may be seen in psychological phenomena; ICMUP may be seen in how we merge multiple views to make one, in recognition, in binocular vision, in how we can abstract object concepts via motion, in adaptation of sensory units in the eye of Limulus, the horseshoe crab, and in other examples of adaptation; the discovery of the segmental structure of language (words and phrases), grammatical inference, and the correction of over- and undergeneralisations in learning may be understood in terms of ICMUP; information compression may be seen in the perceptual constancies; there is indirect evidence for ICMUP in human cognition via kinds of redundancy such as the decimal expansion of which are difficult for people to detect; much of the structure and workings of mathematics—an aid to human thinking—may be understood in terms of ICMUP; and there is additional evidence via the SP Theory of Intelligence and its realisation in the SP Computer Model. Three objections to the main thesis of this paper are described, with suggested answers. These ideas may be seen to be part of a “Big Picture” with six components, outlined in the paper.

1. Introduction

“Fascinating idea! All that mental work I’ve done over the years, and what have I got to show for it? A goddamned zipfile! Well, why not, after all?” (John Winston Bush, 1996).

This paper describes empirical evidence for the idea that much of human learning, perception, and cognition may be understood as information compression. (This paper updates, revises, and extends the discussion in [1, Chapter 2] but with the main focus on human learning, perception, and cognition.) To be more specific, evidence will be presented that much of human learning, perception, and cognition may be understood as information compression via the discovery of patterns that match each other, with the merging or “unification” of two or more instances of any pattern to make one. References will also be made to the SP Theory of Intelligence and its realisation in the SP Computer Model in which information compression has a central role (Section 2.2.1).

Although this paper is primarily about information compression in human brains, it seems that similar principles apply throughout the nervous system and throughout much of the animal kingdom. Accordingly, this paper has things to say here and there about the workings of neural tissue outside the human brain and in nonhuman species.

1.1. Abbreviations

For the sake of brevity in this paper: “information compression” may be shortened to “IC”; the expression “information compression via the matching and unification of patterns” may be referred to as “ICMUP”; and “human learning, perception, and cognition” may be “HLPC”.

The main thesis of this paper—that much of HLPC may be understood as IC—may be referred to as “ICHLPC”.

For reasons given in Section 2.2, the name “SP” stands for Simplicity and Power.

The SP Theory of Intelligence, with its realisation in the SP Computer Model, may be referred to, together, as the SP System.

1.2. Presentation

In this paper, the next section (Section 2) describes some of the background to this research and some relevant general principles; the next-but-one section (Section 3) describes related research; Sections 4 to 20 describe relatively direct empirical evidence in support of ICHLPC; and Section 21 summarises indirect support for ICHLPC via the SP Theory of Intelligence.

Appendix A, referenced from Section 2.3 and elsewhere, gives some mathematical details related to ICMUP and the SP System.

Appendix B, referenced from Section 3.1.1 and elsewhere, describes Horace Barlow’s change of view about the significance of IC in mammalian learning, perception, and cognition, with comments.

Appendix C, referenced from Section 22 and elsewhere, describes apparent contradictions of ideas in this paper and how they may be resolved.

2. Background and General Principles

This section provides some background to this paper and summarises some general principles that have a bearing on ICHLPC and the programme of research of which this paper is a part.

2.1. Seven Variants of “Information Compression via the Matching and Unification of Patterns” (ICMUP)

This subsection fills out the concept of ICMUP, starting with the essentials, described in Section 2.1.1, next. Six variants of the basic idea are described in Sections 2.1.2 to 2.1.7.

While care has been taken in this programme of research to avoid unnecessary duplication of information across different publications, the importance of the following seven variants of ICMUP has made it necessary, for the sake of clarity, to describe them quite fully both in this paper and also in [2].

2.1.1. Basic ICMUP

The main idea in ICMUP is illustrated in the top part of Figure 1. Here, a stream of raw data may be seen to contain two instances of the pattern “INFORMATION”. Subjectively, we “see” this immediately. But, in a computer or a brain, the discovery of that kind of replication of patterns must necessarily be done by some kind of searching for matches between patterns.

In itself, the detection of repeated patterns is not very useful. But by merging or “unifying” the two instances of “INFORMATION” in Figure 1, we may create the single instance shown below the raw data, thus achieving some compression of information in the raw data (Appendix A.1).

Other relevant points include the following:(i) Repetition of patterns and “redundancy” in information. From the perspective of ICMUP, the concept of redundancy in information may be seen as the occurrence of two or more arrays of symbols that match each other. As noted in Section 2.2.2 below, redundancy may take the form of good partial matches between patterns as well as exact matches between patterns.(ii) A threshold on frequency of occurrence. With regard to the previous point, an important qualification is that, for a given repeating array of symbols, A, to represent redundancy within a given body of information, I, A’s frequency of occurrence within I must be higher than what would be expected by chance for an array of the same size [1, Sections 2.2.8.3 and 2.2.8.4].(iii) Frequencies and sizes of patterns. In connection with the preceding point, the minimum frequency needed to exceed the threshold is smaller for large patterns than it is for small patterns. Contrary to the common assumption that large frequencies are needed to attain statistical significance, frequencies as small as 2 can be statistically significant with patterns of quite moderate size or larger; and large patterns of a given frequency yield more compression than small ones of the same frequency (Appendix A.1 [1, Section 2.2.8.4]).(iv) The concept of a “chunk” of information. A discrete pattern like “INFORMATION” is often referred to as a chunk of information, a term that gained prominence in psychology largely because of its use by George Miller in his influential paper The magical number seven, plus or minus two [3].Miller did not use terms like “unification” or “IC”, and he sees some uncertainty in the significance of the concept of a chunk: “The contrast of the terms bit and chunk also serves to highlight the fact that we are not very definite about what constitutes a chunk of information” (p. 93, emphasis in the original). However, he describes how chunking of information may achieve something like compression of information: “… we must recognize the importance of grouping or organizing the input sequence into units or chunks. Since the memory span is a fixed number of chunks, we can increase the number of bits of information that it contains simply by building larger and larger chunks, each chunk containing more information than before” (p. 93, emphasis in the original) and “ … the dits and dahs are organized by learning into patterns and … as these larger chunks emerge the amount of message that the operator can remember increases correspondingly” (p. 93, emphasis in the original).(v) Basic ICMUP means lossy compression of information. A point to notice about basic ICMUP of a body of information, I, is that, without the code mentioned above, it must always be “lossy”, meaning that nonredundant information in I will be lost. This is because, in the unification of two or more matching patterns in I, information is lost about the location of the following: all but one of those patterns if the unified chunk is stored in one of the original locations within I or alternatively all of those patterns if the unified chunk is stored outside I.

2.1.2. Chunking-with-Codes

The key idea with the chunking-with-codes variant of ICMUP is that each unified chunk of information (Section 2.1.1) receives a relatively short name, identifier, or code, and that code is used as a shorthand for the chunk of information wherever it occurs.

As already noted, this idea is illustrated in Figure 1, where, in the middle of the figure, the relatively short code or identifier “w62” is attached to a copy of the “chunk” “INFORMATION”, and we may suppose that the pairing of code and unified chunk would be stored in some kind of “dictionary”, separate from the main body of data. Then, under the heading “Compressed data” at the bottom of the figure, each of the two original instances of “INFORMATION” is replaced by the short code “w62” yielding an overall compression of the original data.

Examples of chunking-with-codes from this paper are the use of “ICMUP” as a shorthand for “information compression via the matching and unification of patterns” and “HLPC” as a shorthand for “human learning, perception, and cognition”.

The chunking-with-codes variant of ICMUP overcomes the weakness of basic ICMUP noted at the end of Section 2.1.1: that it loses nonredundant information about the locations of chunks in the original data, I. The problem may be remedied with chunking-with-codes because copies of the code for a given chunk may be used to mark the locations of each instance of the chunk within I.

Another point of interest is that, with the chunking-with-codes technique, compression of information may be optimised by assigning shorter codes to more frequent chunks and longer codes to rarer chunks, in accordance with some such scheme as Shannon-Fano-Elias coding [4, Section 5.9].

Similar principles may be applied in the other variants of ICMUP described in Sections 2.1.3 to 2.1.7 below.

2.1.3. Schema-Plus-Correction

The schema-plus-correction variant of ICMUP is like chunking-with-codes but the unified chunk of information may have variations or “corrections” on different occasions.

An example from everyday life is a menu in a restaurant or café. This provides an overall framework, something like “starter, main course, pudding” which may be seen as a chunk of information. Each of the three elements of the menu may be seen as a place where each customer may make a choice or “correction” to the menu. For example, one customer may choose “starter(soup), main course(fish), pudding(apple pie)” while another customer may choose “starter(salad) main course(vegetable hotpot)pudding(ice cream)”, and so on.

The schema-plus-correction variant of ICMUP may achieve compression of information via two mechanisms:(i)The schema may itself have a short code. In our menu example, each menu may have a short code such as “bm” for the breakfast menu, “lm” for the lunch-time menu, and so on.(ii)Each “correction” may have a short code. Again with our menu example, options such as “soup”, “fish”, and so on may each have a short code such as “s” for soup, “f” for fish, and so on.

With those two devices, a customer’s order such as “[lunch-time-menu: starter(soup), main course(fish), pudding(apple pie)]” may be reduced to something like “[lm: s, f, ap]”.

2.1.4. Run-Length Coding

The run-length coding variant of ICMUP may be used with any sequence of two or more copies of a pattern where each copy except the first one follows immediately after the preceding copy. In that case, it is only necessary to record one copy of the pattern with the number of copies or with symbols or “tags” to mark the start and end of the sequence.

For example, a repeated pattern like

INFORMATIONINFORMATIONINFORMATIONINFORMATIONINFORMATION

may be reduced to something like “INFORMATION(×5)” (where “×5” records the number of instances of “INFORMATION”). Alternatively, the sequence may be reduced to something like “p INFORMATION* #p”, where “*” means that the pattern “INFORMATION” is repeated an unspecified number of times, and “p … #p” specifies where the sequence begins and where it stops.

2.1.5. Class-Inclusion Hierarchy with Inheritance of Attributes

With the class-inclusion hierarchy variant of ICMUP, there is a hierarchy of classes and subclasses, with “attributes” at each level. At every level except the top level, each subclass “inherits” the attributes of all the higher levels.

For example, in simplified form, the class “motorised vehicle” contains subclasses like “road vehicle” and “rail vehicle”; the class “road vehicle” contains subclasses like “bus”, “lorry”, and “car”, and so on. An attribute like “contains engine” would be assigned to the top level (“vehicle”) and would be inherited by all lower-level classes, thus avoiding the need to record that information repeatedly at all levels in the hierarchy and likewise for attributes at lower levels. Thus a class-inclusion hierarchy with inheritance of attributes combines IC with inference, in accordance with the close relation between those two things, noted in Section 2.5.

Of course there are many subtleties in the way people use class-inclusion hierarchies, such as cross-classification, “polythetic” or “family resemblance” concepts (in which no single attribute is necessarily present in every member of the given category and there need be no single attribute that is exclusive to that category [5]), and the ability to recognise that something belongs in a class despite errors of omission, commission, or substitution. The way in which the SP System can accommodate those kinds of subtleties is discussed in [1, Sections 2.3.2, 6.4.3, 12.2, and 13.4.6.2].

2.1.6. Part-Whole Hierarchy with Inheritance of Contexts

The part-whole hierarchy variant of ICMUP is like a class-inclusion hierarchy with inheritance of attributes except that the hierarchical structure represents the parts and subparts of some class or entity, and any given part inherits information about the context which it shares with all its siblings on the same level. A part-whole hierarchy promotes economy by sidestepping the need for each part of an entity at any given level to store full information about the higher-level structures of which it is a part—which is the same as other parts on the same level.

A simple example is the way that a “person” has parts like “head”, “body”, “arms”, and “legs”, while an arm may be divided into “upper arm”, “forearm”, “hand”, and so on. In a structure like this, inheritance means that if one hears that a given person has an injury to his or her hand, one can infer immediately that that person”s “arm” has been injured and indeed his or her whole “person”.

2.1.7. SP-Multiple-Alignment as a Generalised Version of ICMUP

The seventh of the versions of ICMUP considered in this paper is the concept of SP-multiple-alignment, described in Section 2.2.2 below.

SP-multiple-alignment may be seen to be a generalised version of ICMUP which encompasses the other six versions described in Sections 2.1.1 to 2.1.6.

This versatility in modelling other versions of ICMUP is not altogether surprising since SP-multiple-alignment is largely responsible for the SP System’s versatility in diverse aspects of intelligence (including diverse kinds of reasoning), in the representation of diverse kinds of knowledge, and its potential for the seamless integration of diverse aspects of intelligence and diverse kinds of knowledge, in any combination (Section 2.2.5).

2.2. The SP Theory of Intelligence

Readers will see that the paper contains references to the SP Theory of Intelligence, its realisation in the SP Computer Model, and associated ideas, especially the concept of SP-multiple-alignment. But it must be emphasised that the SP Theory is not the main focus of the paper. Instead it is relevant for subsidiary reasons:(i)Empirical evidence for ICHLPC strengthens empirical support for the SP Theory. Since IC and, more specifically, ICMUP are central in the SP Theory, empirical evidence for ICHLPC (presented in Sections 4 to 20) strengthens empirical support for the SP Theory, viewed as a theory of HLPC.(ii)Direct empirical evidence for the SP Theory provides indirect evidence for ICHLPC. Direct empirical evidence for the SP Theory—summarised in Section 2.2.5—provides indirect evidence for ICHLPC which is additional to that in in Sections 4 to 20 (see Section 21).(iii)Clarifying theoretical issues related to HLPC. The SP Computer Model, which may be seen as a working model of several aspects of HLPC, can help to clarify theoretical issues related to HLPC. It has, for example, proved useful in understanding issues discussed in Appendices B and C.

For those reasons, an outline of the theory is appropriate here.

2.2.1. Outline of the SP Theory of Intelligence: Introduction

The SP Theory of Intelligence and its realisation in the SP Computer Model—the SP System—is a unique attempt to simplify and integrate observations and concepts across artificial intelligence, mainstream computing, mathematics, and human learning, perception, and cognition, with IC as a unifying theme. This broad scope for the SP programme of research has been adopted for reasons summarised in Section 2.6 below.

As mentioned in Section 1.1, the name “SP” stands for Simplicity and Power. This is because compression of any given body of information, I, may be seen as a process of reducing informational “redundancy” in I and thus increasing its “simplicity”, while retaining as much as possible of its nonredundant expressive “power”.

The SP Theory, the SP Computer Model, and some applications are described quite fully in [6] and much more fully in [1]. Details of other publications about the SP System, most with download links, may be found on http://www.cognitionresearch.org/sp.htm. A download link for the source code of SP71, the latest version of the SP Computer Model, may be found under the heading “SOURCE CODE” near the bottom of that page.

The SP Theory is conceived as a brain-like system as shown schematically in Figure 2. The system receives New information via its senses and stores some or all of it in compressed form as Old information.

All kinds of knowledge or information in the SP System are represented with arrays of atomic SP-symbols in one or two dimensions called SP-patterns. At present, the SP Computer Model works only with one-dimensional SP-patterns but it is envisaged that, at some stage, it will be generalised to work with two-dimensional SP-patterns.

2.2.2. SP-Multiple-Alignment

A central part of the SP System is the powerful concept of SP-multiple-alignment, outlined here. The concept is described more fully in [6, Section 4] and [1, Sections 3.4 and 3.5].

The concept of SP-multiple-alignment in the SP System is derived from the concept of “multiple sequence alignment” in bioinformatics (see, e.g., [7]). That latter concept means an arrangement of two or more DNA sequences or sequences of amino-acid residues so that, by judicious “stretching” of sequences in a computer, symbols that match from row to row are aligned—as illustrated in Figure 3. A “good” multiple sequence alignment is one with a relatively high value for some metric related to the number of symbols that have been brought into line.

For a given set of sequences, finding or creating “good” multiple sequence alignments amongst the many possible “bad” ones is normally a complex process—normally too complex to be solved by exhaustive search. For that reason, bioinformatics programs for finding good multiple sequence alignments use heuristic methods, building multiple sequence alignments in stages and discarding low-scoring multiple sequence alignments at the end of each stage, with backtracking or something equivalent to improve the robustness of the search.

With such methods, it is not normally possible to guarantee that the best possible multiple sequence alignment has been found, but it is normally possible to find multiple sequence alignments that are good enough for practical purposes.

The two main differences between the concept of SP-multiple-alignment in the SP System and the concept of multiple sequence alignment in bioinformatics are the following:(i)New and Old information. With an SP-multiple-alignment, one of the SP-patterns (sometimes more than one) is New information from the system’s environment (see Figure 2), and the remaining SP-patterns are Old information, meaning information that has been previously stored (also shown in Figure 2).(ii)Encoding New information economically in terms of Old information. In the creation of SP-multiple-alignments, the aim is to build ones that, in each case, allow the New SP-pattern (or SP-patterns) to be encoded economically in terms of the Old SP-patterns in the given SP-multiple-alignment. In each case, there is an implicit merging or unification of SP-patterns or parts of SP-patterns that match each other, as described in [6, Section 4.1] and [1, Section 3.5].

In the SP-multiple-alignment shown in Figure 4, one New SP-pattern is shown in row 0, and Old SP-patterns, drawn from a repository of Old SP-patterns, are shown in rows 1 to 9. By convention, the New SP-pattern(s) is always shown in row 0 and the Old SP-patterns are shown in the other rows, one SP-pattern per row.

In this example, the New SP-pattern is a sentence and the Old SP-patterns in rows 1 to 9 represent grammatical structures including words. The overall effect of the SP-multiple-alignment is to “parse” or analyse the sentence into its constituent parts and subparts, with each part marked with a category like “NP” (meaning “noun phrase”), “N” (meaning “noun”), “VP” (meaning “verb phrase”), and so on. But, as described in Section 2.2.5, the SP-multiple-alignment construct can do much more than parse sentences.

Each SP-multiple-alignment is evaluated in terms of how it provides for the New SP-pattern in row 0 to being encoded economically in terms of the Old SP-patterns in the other rows. An SP-multiple-alignment is “good” if the encoding is indeed economical. Details of how this is done are described in Appendix A.4.

With SP-multiple-alignments in the SP System, as with multiple sequence alignments in bioinformatics, the process of finding “good” SP-multiple-alignments is too complex for exhaustive search, so it is normally necessary to use heuristic methods—which means that, as before, the best possible results may be missed but it is normally possible to find SP-multiple-alignments that are reasonably good.

At the heart of SP-multiple-alignment is a process for finding good full and partial matches between SP-patterns, described quite fully in [1, Appendix A]. As in the building of SP-multiple-alignments, heuristic search is an important part of the process of finding good full and partial matches between SP-patterns. Some details with relevant calculations are given in Appendix A.8.

As noted in Section 2.1.7, the concept of SP-multiple-alignment may be seen to be a generalised version of ICMUP, which encompasses all the other six variants of ICMUP described in Section 2.1.

2.2.3. Unsupervised Learning in the SP System

Unsupervised learning in the SP System is described in [6, Section 5] and [1, Chapter 9]. In brief, it means searching for one or more collections of Old SP-patterns called grammars which are relatively good for the economical encoding of a given set of New SP-patterns.

As with the building of SP-multiple-alignments (Section 2.2.2) and the process of finding good full and partial matches between SP-patterns [1, Appendix A] and many other AI programs, unsupervised learning in the SP System uses heuristic techniques: doing the search in stages and, at each stage, concentrating the search in the most promising areas and cutting out the rest.

Some of the details of relevant calculations are given in Appendix A.7.

As mentioned in Section 2.2.4, learning in the SP System is quite different from the popular “Hebbian” learning, often characterised as “Cells that fire together wire together”, and it is quite different from how deep learning systems learn. (Hebb’s original version of his learning rule is “When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.” [9, p. 62])

2.2.4. SP-Neural

Functionality that is similar to that of the SP System may be realised in a “neural” sister to the SP System called SP-Neural, expressed in terms of neurons and their interconnections [8], as illustrated in Figure 5. Although the main elements of SP-Neural have been defined, there are details to be filled in. As with the development of the SP Theory itself, it is likely that many insights may be gained by building computer models of SP-Neural.

An important point here is that SP-Neural is quite different from the kinds of “artificial neural network” that are popular in computer science, including those that provide the basis for “deep learning” [10].

It is relevant to mention that Section V of [11] describes thirteen problems with deep learning in artificial neural networks and how, with the SP System, those problems may be overcome. The SP System also provides a comprehensive solution to a fourteenth problem with deep learning—“catastrophic forgetting”—meaning the way in which new learning in a deep learning system wipes out old memories.

Probably, SP-Neural’s closest relative is Donald Hebb’s [9] concept of a “cell assembly” but, since learning in SP-Neural is likely to be modelled on learning in the SP System (Section 2.2.3), it will be quite different from Hebbian learning and also quite different from learning in deep learning systems. More loosely, SP-Neural, when it is more fully developed, is likely to bear a superficial resemblance to Alan Turing’s concept of an “unorganised” machine [12] because its neural tissues would become progressively more organised as it learns.

2.2.5. Strengths and Potential of the SP System

Largely because of the versatility of the SP-multiple-alignment construct, the SP System has strengths and potential in modelling several aspects of HLPC, as outlined here:(i)Versatility in aspects of intelligence. The SP System has strengths in several aspects of human-like intelligence including: unsupervised learning; the analysis and production of natural language; pattern recognition that is robust in the face of errors in data; pattern recognition at multiple levels of abstraction; computer vision; best-match and semantic kinds of information retrieval; several kinds of reasoning (next bullet point); planning; and problem solving.(ii)Versatility in reasoning. Strengths of the SP System in reasoning include: one-step “deductive” reasoning; chains of reasoning; abductive reasoning; reasoning with probabilistic networks and trees; reasoning with “rules”; nonmonotonic reasoning and reasoning with default values; Bayesian reasoning with “explaining away” (This means “If A implies B, C implies B, and B is true, then finding that C is true makes A less credible. In other words, finding a second explanation for an item of data makes the first explanation less credible” [13, p. 7]. See also [6, Section 10.2] and [1, Section 7.8].); causal reasoning; reasoning that is not supported by evidence; the already-mentioned inheritance of attributes in class hierarchies; and inheritance of contexts in part-whole hierarchies. There is also potential in the SP System for spatial reasoning and for what-if reasoning. Probabilities for inferences may be calculated in a straightforward manner (Appendix A.6).(iii)Versatility in the representation and processing of knowledge. The SP System has strengths in the representation and processing of several different kinds of knowledge including the syntax of natural languages; class-inclusion hierarchies (with or without cross classification); part-whole hierarchies; discrimination networks and trees; if-then rules; entity-relationship structures; relational tuples; and concepts in mathematics, logic, and computing, such as “function”, “variable”, “value”, “set”, and “type definition”. With the addition of two-dimensional SP-patterns to the SP System, there is potential to represent such things as photographs, diagrams, structures in three dimensions, and procedures that work in parallel.(iv)Seamless integration of diverse aspects of intelligence and diverse kinds of knowledge, in any combination. Because the SP System’s versatility (in diverse aspects of intelligence and in the representation of diverse kinds of knowledge) flows from one relatively simple framework—SP-multiple-alignment—the system has clear potential for the seamless integration of diverse aspects of intelligence and diverse kinds of knowledge, in any combination. That kind of seamless integration appears to be essential in modelling the fluidity, versatility, and adaptability of the human mind.

Figure 6 shows schematically how the SP System, with SP-multiple-alignment centre stage, exhibits versatility and integration.

There are more details in [6] and much more details in [1]. Distinctive features and advantages of the SP System are described quite fully in [11].

How absolute and relative probabilities for SP-multiple-alignments may be calculated (for use in reasoning and other aspects of AI) is detailed in Appendix A.6.

2.2.6. Potential Benefits and Applications of the SP System

Apart from its strengths and potential in modelling aspects of the human mind, it appears that, in more humdrum terms, the SP System has several potential benefits and applications. These include helping to solve nine problems with big data, helping to develop intelligence in autonomous robots, development of an intelligent database system, medical diagnosis, computer vision and natural vision, suggesting avenues for investigation in neuroscience, commonsense reasoning, and more. Details of relevant papers, with download links, may be found on http://www.cognitionresearch.org/sp.htm.

2.3. Avoiding Too Much Dependence on Mathematics

Many approaches to IC have a mathematical flavour (see, e.g., [14]). Much the same is true of concepts of inference and probability which, as outlined in Section 2.5, are closely related to IC.

In the SP programme of research, the orientation is different. The SP Theory attempts to get below or behind the mathematics of other approaches to IC and to concepts of inference and probability: it attempts to focus on ICMUP, the relatively simple, “primitive” idea that information may be compressed by finding two or more patterns that match each other, and merging or “unifying” them so that multiple instances of the pattern are reduced to one.

That said, there is some mathematics associated with ICMUP, and there is some more which is incorporated in the SP Computer Model. They are described in Appendix A and referenced at appropriate points throughout this paper.

There are four main reasons for this focus on ICMUP and the avoidance of too much dependence on mathematics:(i) Opening the door to nonmathematical mechanisms for compression of information. Since ICMUP is relatively “concrete” and less abstract than the more mathematical approaches to IC, it may open the door to nonmathematical mechanisms for compression of information which may otherwise be overlooked. Here are two putative examples:(a) SP-multiple-alignment. The concept of SP-multiple-alignment (Section 2.2.2) is founded on ICMUP and is not a recognised part of today’s mathematics—but it has been proven to be effective in the compression of information, it makes possible a relatively straightforward approach to the calculation of probabilities for inferences (Appendix A.6), and it facilitates the modelling of several aspects of HLPC (Section 2.2.5, [1, 6]).(b) ICMUP in SP-Neural. Because SP-Neural (Section 2.2.4) is derived from the SP Theory, ICMUP is implicit in how, when it is more fully developed, SP-Neural is likely to work.(ii) Do not use mathematics in describing the foundations of mathematics. The SP Theory aims to be, amongst other things, a theory of the foundations of mathematics [2], so it would not be appropriate for the theory to be too dependent on mathematics.(iii) The SP Theory is not founded on the concept of a universal Turing machine. While the SP Theory has benefitted from valuable insights gained from mathematically oriented research on Algorithmic Probability Theory, Algorithmic Information Theory, and related work (Section 3.2), it differs from that work in that it is not founded on the concept of a “universal Turing machine”.Instead, a focus on ICMUP has yielded a new theory of computing and cognition, founded on ICMUP and SP-multiple-alignment, with the generality of the universal Turing machine [1, Chapter 4] but with strengths in the modelling of human-like intelligence which, as Alan Turing acknowledged [12, 15], are missing from the universal Turing machine (Section 2.2.5, [1, 6]).(iv) ICMUP not obvious in such techniques as as wavelet compression and arithmetic coding. At some abstract level, it may be that all mathematically based techniques for compression of information are founded on ICMUP. And if the thesis of [2] is true, then all such techniques will indeed have an ICMUP foundation. But, nevertheless, techniques for the compression of information such as wavelet compression or arithmetic coding seem far removed from the simple idea of finding patterns that match each other and merging them into a single instance.

The SP System, including the concepts of SP-multiple-alignment and ICMUP, provides a novel approach to concepts of IC and probability (Section 2.5) which appears to have potential as an alternative to more widely recognised methods in these areas.

2.4. Empirical Evidence and Quantification

Although quantification of empirical evidence can in some studies be necessary or at least useful, it appears that, with most of the evidences presented in this paper (except in Sections 15 and 16), quantification would not be feasible or useful. In any case, attempts at quantification would be a distraction from the main thrust of the paper: that many examples of IC in HLPC are staring us in the face without the need for quantification.

As an example (from Section 6), a name like “New York” is, in the manner of chunking-with-codes, a relatively brief “code” for the enormously complex “chunk” of information which is the structure and workings of the city itself. Similar things can be said about most other names for things, and also “content” words like “house”, “table”, and so forth. In short, natural language may be seen to be a very powerful means of compressing information via the chunking-with-codes technique—and this is clear without the need for quantification.

2.5. IC and Concepts of Inference and Probability

It has been recognised for some time that there is an intimate relation between IC and concepts of inference and probability (Appendix A.2, [1619]).

In case this seems obscure, it makes sense in terms of ICMUP. A pattern that is repeated is one that invites ICMUP but it is also one that, via inductive reasoning, suggests possible inferences:(i)Any repeating pattern provides a basis for prediction. Any repeating pattern—such as the association between black clouds and rain—provides a basis for prediction: black clouds suggest that rain may be on the way, and probabilities may be derived from the number of repetitions.(ii)Inferences via partial matching. With basic ICMUP and its variants, inferences may be made when one new pattern from the environment matches part of a stored, unified pattern. If, for example, we see “INFORMA”, we may guess, on the strength of the stored pattern, “INFORMATION” (Figure 1), that the letters “TION” are likely to follow. This idea is sometimes called “prediction by partial matching” [20]. Of course, the pattern may be completed in a similar way if the incoming information is “INFORMAN”, “INMATION”, “INFRMAION”, and so on.(iii)The SP System is designed to find partial matches as well as exact matches. Because of the need to make inferences like those just described and because a prominent feature of human perception is that we are rather good at finding good partial matches between patterns as well as exact matches, the SP System, including the process for building SP-multiple-alignments, is designed to search for redundancy in the form of good partial matches between patterns, as well as redundancy in the form of exact matches. This is done with a version of dynamic programming, described in [1, Appendix A].

There is a lot more detail about how this works with the SP-multiple-alignment concept in Appendix A.6 [6, Section 4.4] and [1, Section 3.7 and Chapter 7]. The SP System has proven to be an effective alternative to Bayesian theory in explaining such phenomena as “explaining away” ([6, Section 10.2], [1, Section 7.8]).

As indicated in Section 4, the close connection between IC and concepts of inference and probability makes sense in terms of biology.

2.6. The Big Picture

The credibility of the ICHLPC thesis of this paper is strengthened by its position in a “Big Picture” of the importance of IC in at least six areas:(i)Evidence for IC as a unifying principle in human learning, perception, and cognition. This paper describes relatively direct empirical evidence for IC (and more specifically ICMUP) as a unifying principle in HLPC.(ii)IC in the SP Theory of Intelligence. ICMUP is central in the SP Theory of Intelligence (Section 2.2) which itself has much empirical and analytical support, summarised in Section 2.2.5, with pointers to where further information may be found.(iii)IC in Neuroscience. Because of its central role in the SP System, IC is central in SP-Neural (Section 2.2.4) and may thus have an important role in neuroscience.(iv)IC and concepts of inference and probability. It is known that there is an intimate relation between IC and concepts of inference and probability (Section 2.5).(v)IC as a foundation for mathematics. The paper “Mathematics as information compression via the matching and unification of patterns” [2] argues that much of mathematics, perhaps all of it, may be understood in terms of ICMUP.(vi)IC as a unifying principle in science. It is widely agreed that “Science is, at root, just the search for compression in the world” [21, p. 247], with variations such as “Science may be regarded as the art of data compression” [19, p. 585], and more.

The Big Picture, as just outlined, is important for reasons summarised here:(i)You can’t play 20 questions with nature and win. In his famous essay, “You can’t play 20 questions with nature and win”, Allen Newell [22] writes about the sterility of developing theories in narrow fields and calls for each researcher to focus on “a genuine slab of human behaviour” (p. 303). ( Newell’s essay and his book Unified Theories of Cognition [23] led to many attempts by himself and others to develop such theories. But the difficulty of reaching agreement on a comprehensive framework for general, human-like AI is suggested by the following observation in [24, Locations 43–52]: “Despite all the current enthusiasm in AI, the technologies involved still represent no more than advanced versions of classic statistics and machine learning.” And what follows [24, Location 52] seems to confirm the persistence of the long-standing fragmentation of AI: “Behind the scenes, however, many breakthroughs are happening on multiple fronts: in unsupervised language and grammar learning, deep-learning, generative adversarial methods, vision systems, reinforcement learning, transfer learning, probabilistic programming, blockchain integration, causal networks, and many more”.)(ii)Ockham’s razor. Newell’s exhortation accords with a slightly extended version of Ockham’s razor: in developing simple theories of empirical phenomena, we should concentrate on those with the greatest explanatory range. Such theories will, naturally, be more useful than those with narrow scope, but, in addition, it seems that they are often relatively robust in the face of new evidence.(iii)If you can’t solve a problem, enlarge it. In a similar vein, President Eisenhower is reputed to have said: “If you can’t solve a problem, enlarge it”, meaning that putting a problem in a broader context may make it easier to solve. Good solutions to a problem may be hard to see when the problem is viewed through a keyhole but become visible when the door is opened.

In keeping with these three reasons, the Big Picture is important in showing the potential of IC as a unifying principle across a wide canvass, including the six areas mentioned above.

Each of the six components of the Big Picture has support via empirical and analytical evidence which is specific to that component. In addition, the six components are mutually supportive in the sense that the credibility of any one of them, including the main ICHLPC thesis of this paper, is strengthened via its position in the Big Picture.

Implications of the Big Picture include, for example, the fact that IC should be a key part of any and all proposals for general, human-like AI, for theories of human learning, perception, and cognition and for theories of cognitive neuroscience.

2.7. Volumes of Data and Speeds of Learning

As noted in Section 2.1.1, large patterns may exceed the threshold for redundancy at a lower frequency than small patterns. With a complex pattern, such as an image of a person or a tree, there can be significant redundancy in a mere 2 occurrences of the pattern.

If redundancies can be detected via patterns that occur only 2 or 3 times in a given sample of data, unsupervised learning may prove to be effective with smallish amounts of data. This may help to explain why, in contrast to the very large amounts of data that are apparently required for success with deep learning, children and non-deep-learning types of learning program can do useful things with relatively tiny amounts of data [11, Section V-E].

In this connection, neuroscientist David Cox has been reported as saying: “To build a dog detector [with a deep learning system], you need to show the program thousands of things that are dogs and thousands that aren’t dogs. My daughter only had to see one dog” and, the report says, she was happily pointing out puppies ever since. (“Inside the moonshot effort to finally figure out the brain”, MIT Technology Review, 2017-10-12, https://bit.ly/2wRxsOg.)

This issue relates to the way in which a camouflaged animal is likely to become visible when it moves relative to its background (Section 12). As with random-dot stereograms (Section 11), only two images that are similar but not the same are needed to reveal hidden structure.

2.8. Emotions and Motivations

A point that deserves emphasis is that while this paper is part of a programme of research aiming for simplification and integration of observations and ideas in HLPC and related fields, it does not aspire to be a comprehensive view of human psychology. In particular, it does not attempt to say anything about emotions or motivations, despite their undoubted importance and relevance to many aspects of human psychology, including cognitive psychology. That said, it seems possible that IC might apply to emotions or motivations in the same way that it may be applied to sensory data and our concepts about the world.

An early example of thinking relating to IC in HLPC was the suggestion by William of Ockham in the 14th century that “Entities are not to be multiplied beyond necessity”. Later, Isaac Newton wrote that “Nature is pleased with simplicity” [25, p. 320], Albert Einstein wrote that “A theory is more impressive the greater the simplicity of its premises, the more different things it relates, and the more expanded its area of application” (Quoted in [26, p. 512].) and more. Research with a more direct bearing on ICHLPC began in the 1950s and 1960s after the publication of Claude Shannon’s [16] “theory of communication” (later called “information theory”) and was partly inspired by it.

In the two subsections that follow, there is a rough distinction between research with the main focus on issues in HLPC and neuroscience and research that concentrates on issues in mathematics and computing. In both sections, research is described roughly in the order in which it was published.

In this research, the prevailing view of information, compression of information, and probabilities is that they are things to be defined and analysed in mathematical terms. This perspective has yielded some useful insights but, as suggested in Section 2.3, there are potential advantages in the ICMUP perspective adopted in the SP research. This ICMUP perspective is what chiefly distinguishes the evidence that provides the main thrust of this paper from the related research described in this section.

3.1. Psychology-Related and Neuroscience-Related Research

Research relating to IC and HLPC and neuroscience may be divided roughly into two parts: early research initiated in the 1950s and 1960s by Fred Attneave, Horace Barlow and others and then, after a relative lull in activity, later research from the 1990s onwards.

3.1.1. Early Psychology-Related and Neuroscience-Related Research

In a paper called “Some informational aspects of visual perception”, Fred Attneave [27] describes evidence that visual perception may be understood in terms of the distinction between areas in a visual image where there is much redundancy and boundaries between those areas where nonredundant information is concentrated: “… information is concentrated along contours (i.e., regions where color changes abruptly) and is further concentrated at those points on a contour at which its direction changes most rapidly (i.e., at angles or peaks of curvature)” [27, p. 184].

For those reasons, he suggests that “Common objects may be represented with great economy and fairly striking fidelity by copying the points at which their contours change direction maximally and then connecting these points appropriately with a straight edge” [27, p. 185]. And he illustrates the point with a drawing of a sleeping cat reproduced in Figure 7.

And he concludes with the suggestion that perception may be seen as economical description: “It appears likely that a major function of the perceptual machinery is to strip away some of the redundancy of stimulation, to describe or encode incoming information in a form more economical than that in which it impinges on the receptors” [27, p. 189].

Satosi Watanabe picked up the baton in a paper called “Information-theoretical aspects of inductive and deductive inference” [28]. He later wrote about the role of IC in pattern recognition [29, 30].

At about this time, Horace Barlow published a paper called “Sensory mechanisms, the reduction of redundancy, and intelligence” [31] in which he argued, on the strength of the large amounts of sensory information being fed into the [mammalian] central nervous system, that “the storage and utilization of this enormous sensory inflow would be made easier if the redundancy of the incoming messages was reduced” (p. 537). And he draws attention to evidence that, in mammals at least, each optic nerve is too small, by a wide margin, to carry reasonable amounts of the information impinging on the retina unless there is considerable compression of that information [31, p. 548].

In the paper, Barlow makes the interesting suggestion that “… the mechanism that organises [the large size of the sensory inflow] must play an important part in the production of intelligent behaviour” (p. 555), and in a later paper [32, p. 210] he writes the following:

“… the operations required to find a less redundant code have a rather fascinating similarity to the task of answering an intelligence test, finding an appropriate scientific concept, or other exercises in the use of inductive reasoning. Thus, redundancy reduction may lead one towards understanding something about the organization of memory and intelligence, as well as pattern recognition and discrimination”.

These prescient insights into the significance of IC for the workings of human intelligence, with further discussion in [33], are a strand of thinking that has been carried through into the SP Theory of Intelligence, with a wealth of supporting evidence, summarised in Section 2.2.5. (When I was an undergraduate at Cambridge University, it was fascinating lectures by Horace Barlow about the significance of IC in the workings of brains and nervous systems that first got me interested in those ideas.)

Barlow developed these and related ideas over a period of years in several papers, some of which are referenced in this paper. However, in [34], he adopted a new position, arguing that

“… the [compression] idea was right in drawing attention to the importance of redundancy in sensory messages because this can often lead to crucially important knowledge of the environment, but it was wrong in emphasizing the main technical use for redundancy, which is compressive coding. The idea points to the enormous importance of estimating probabilities for almost everything the brain does, from determining what is redundant to fuelling Bayesian calculations of near optimal courses of action in a complicated world” (p. 242).

While there are some valid points in what Barlow says in support of his new position, his overall conclusions appear to be wrong. His main arguments are summarised in Appendix B, with what I’m sorry to say are my critical comments after each one. (I feel apologetic about this because, as I mentioned, Barlow’s lectures and his earlier research relating to IC in brains and nervous systems have been an inspiration for me over many years.)

3.1.2. Later Psychology-Related and Neuroscience-Related Research

Like the earlier studies, later studies relating to IC in brains and nervous systems have little to say about ICMUP. But they help to confirm the importance of IC in HLPC and thus provide support for ICHLPC. A selection of publications are described briefly here.

Ruma Falk and Clifford Konold [35] describe the results of experiments indicating that the perceived randomness of a sequence is better predicted by various measures of its encoding difficulty than by its objective randomness. They suggest that judging the extent of a sequence’s randomness is based on an attempt to encode it mentally and that the subjective experience of randomness may result when that kind of attempt fails.

Jose Hernández-Orallo and Neus Minaya-Collado [36] propose a definition of intelligence in terms of IC. At the most abstract level, it chimes with remarks by Horace Barlow quoted in Section 3.1.1, and indeed it is consonant with the SP Theory itself. But the proposal shows no hint of how to model the kinds of capabilities that one would expect to see in any artificial system that aspires to human-like intelligence.

Nick Chater, with others, has conducted extensive research on HLPC, compression of information, and concepts of probability, generally with an orientation towards Algorithmic Information Theory, Bayesian theory, and related ideas. For example,(i)Chater [37] discusses how “simplicity” and “likelihood” principles for perceptual organisation may be reconciled, with the conclusion that they are equivalent. He suggests that “the fundamental question is whether, or to what extent, perceptual organization is maximizing simplicity and maximizing likelihood” (p. 579).(ii)Chater [38] discusses the idea that the cognitive system imposes patterns on the world according to a simplicity principle, meaning that it chooses the pattern that provides the briefest representation of the available information. Here, the word “pattern” means essentially a theory or system of one or more rules, a meaning which is quite different from the meaning of “pattern” or “SP-pattern” in the SP research, which simply means an array of atomic symbols in one or two dimensions. There is further discussion in [39].(iii)Emmanuel Pothos and Nick Chater [40] present experimental evidence in support of the idea that, in sorting novel items into categories, people prefer the categories that provide the simplest encoding of these items.(iv)Nick Chater and Paul Vitányi [41] describe how the “simplicity principle” allows the learning of language from positive evidence alone, given quite weak assumptions, in contrast to results on language learnability in the limit [42]. There is further discussion in [43].(v)Editors Nick Chater and Mike Oaksford [44] present a variety of studies using Bayesian analysis to understand probabilistic phenomena in HLPC.(vi)Paul Vitányi and Nick Chater [45] discuss whether it is possible to infer a probabilistic model of the world from a sample of data from the world and, via arguments relating to Algorithmic Information Theory, they reach positive conclusions.

Jacob Feldman [46] describes experimental evidence that when people are asked to learn “Boolean concepts”, meaning categories defined by logical rules, the subjective difficulty of learning a concept is directly proportional to its “compressibility”, meaning the length of the shortest logically equivalent formula.

Don Donderi [47] presents a review of concepts that relate to the concept of “visual complexity”. These include Gestalt psychology, Neural Circuit Theory, Algorithmic Information Theory, and Perceptual Learning Theory. The paper includes discussion of how these and related ideas may contribute to an understanding of human performance with visual displays.

Vivien Robinet and coworkers [48] describe a dynamic hierarchical chunking mechanism, similar to the MK10 Computer Model (Section 15). The theoretical orientation of this research is towards Algorithmic Information Theory, while the MK10 Computer Model embodies ICMUP.

From analysis and experimentation, Nicolas Gauvrit and others [49] conclude that how people perceive complexity in images seems to be partly shaped by the statistics of natural scenes. In [50], a slightly different grouping with Gauvrit as lead author describe how it is possible to overcome the apparent shortcoming of Algorithmic Information Theory in estimating the complexity of short strings of symbols, and they show how the method may be applied to examples from psychology.

In a review of research on the evolution of natural language, Simon Kirby and others [51] describe evidence that transmission of language from one person to another has the effect of developing structure in language, where “structure” may be equated with compressibility. On the strength of further research, [52] conclude that increases in compressibility arise from learning processes (storing patterns in memory), whereas reproducing patterns leads to random variations in language.

On the strength of a theoretical framework, an experiment, and a simulation, Benoît Lemaire and coworkers [53] argue that the capacity of the human working memory may be better expressed as a quantity of information rather than a fixed number of chunks.

In related work, Fabien Mathy and Jacob Feldman [54] redefine George Miller’s [3] concept of a “chunk” in terms of Algorithmic Information Theory as a unit in a “maximally compressed code”. On the strength of experimental evidence, they suggest that the true limit on short-term memory is about 3 or 4 distinct chunks, equivalent to about 7 uncompressed items (of average compressibility), consistent with George Miller’s famous magical number.

And Mustapha Chekaf and coworkers [55] describe evidence that people can store more information in their immediate memory if it is “compressible” (meaning that it conforms to a rule such as “all numbers between 2 and 6”) than if it is not compressible. They draw the more general conclusion that immediate memory is the starting place for compressive recoding of information.

In addition to these several studies, there is quite a large body of research which relates to the concept of “efficient coding” in brains and nervous systems. These include the studies described in the following paragraphs.

Tiberiu Teşileanu, Bence Ölveczky, and Vijay Balasubramanian [56] developed a computer model of efficient two-stage learning, which proved accurate against data for the learning of birdsong by birds.

Ann Hermundstad and colleagues [57] found evidence in support of the propositions that efficient coding extends to higher-order sensory features and that more neural resources are applied when sensory data is limited.

Vijay Balasubramanian [58] argues that the remarkable energy efficiency of the brain is achieved in part through the dedication of specialized circuit elements and architectures to specific computational tasks, in a hierarchy stretching from the scale of neurons to the scale of the entire brain, and that these structures are learned via an evolutionary process.

Francisco Heras and colleagues [59] provide evidence for mechanisms promoting energy efficiency in the workings of blowfly photoreceptors.

Biswa Sengupta and colleagues [60] investigate why the conversion of “graded” potentials in the brain’s neural circuits to “action” potentials in those circuits is accompanied by substantial information loss and how this changes energy efficiency.

Simon Laughlin and Terrence Sejnowski [61] describe some of “the geometric, biophysical, and energy constraints that have governed the evolution of cortical networks”, how “nature has optimized the structure and function of cortical networks with design principles similar to those used in electronic networks”, and how “the brain … exploits the adaptability of biological systems to reconfigure in response to changing needs”.

Joseph Atick [62] reviews evidence relating to the principle that efficiency of information representation may be a design principle for sensory processing. In particular, it appears that this principle applies to large monopolar cells in the fly’s visual system and retinal coding in mammals in the spatial, temporal, and chromatic domains.

Joseph Atick and Norman Redlich [63] argue that the goal of processing in the retina is to transform the visual input as much as possible into a “statistically independent” form as a first step in creating a compressed representation in the cortex, as suggested by Horace Barlow. But the amount of compression that can be achieved in the retina is reduced by the need to suppress noise in the sensory input.

Adrienne Fairhall and colleagues [64] consider evidence relating to the optimisation of neural coding when the statistics of sensory data is changing. They conclude that “the speed with which information is optimized and ambiguities are resolved approaches the physical limit imposed by statistical sampling and noise”.

Naama Brenner and colleagues [65] show that the input/output relation of a sensory system in a dynamic environment changes with the statistical properties of the environment. More specifically, when the dynamic range of inputs changes, the input/output relation rescales so as to match the dynamic range of responses to that of the inputs. And the scaling of the input/output relation is set to maximize information transmission for each distribution of signals.

William Bialek and colleagues [66] review progress on the question: “Does the brain construct an efficient representation of the sensory world?” In their answer to this question they take account of the biological value of sensory information, and they report preliminary evidence from studies of the fly’s visual system which appear to support their view.

Stephanie Palmer and colleagues [67] show that efficient predictive computation starts at the earliest stages of the visual system and that this is true of nearly every cell in the retina and beyond. “Efficient representation of predictive information is a candidate principle that can be applied at each stage of neural computation”.

Bruno Olshausen and David Field [68] discuss how “sparse coding” (the encoding of sensory information using a small number of active neurons at any given point in time) may confer several advantages and that there is evidence that “sparse coding could be a ubiquitous strategy employed in several different modalities across different organisms”.

The same two authors, in [69], discuss the problem of how images can best be encoded and transmitted, with particular emphasis on how the eye and brain process visual information. They remark that “computer scientists and engineers now focusing on the problem of image compression should keep abreast of emerging results in neuroscience. At the same time, neuroscientists should pay close attention to current studies of image processing and image statistics”.

Kristin Koch and colleagues [70] consider the question: how much information does the retina send to the brain and how is it apportioned among different cell types? They conclude that “with approximately ganglion cells, the human retina would transmit data at roughly the rate of an Ethernet connection”. This figure appears to be for the amount of information that is transmitted after decompression.

3.2. Mathematics-Related and Computer-Related Research

Other researches, with an emphasis on issues in mathematics and computing, including artificial intelligence, can be helpful in the understanding of IC in brains and nervous systems. This includes the following:(i)Ray Solomonoff developed Algorithmic Probability Theory showing the intimate relation between IC and inductive inference [17, 18] (Section 2.5).(ii)Chris Wallace with others explored the significance of IC in classification and related areas (see, e.g., [7173]).(iii)Gregory Chaitin and Andrei Kolmogorov, working independently, developed Algorithmic Information Theory, building on the work of Ray Solomonoff. The main idea here is that the information content of a string of symbols is equivalent to the length of the shortest computer program that anyone has been able to devise that describes the string.(iv)Jorma Rissanen has developed related ideas in [74, 75] and other publications.

A detailed description of these and related bodies of research may be found in [19].

In research on deep learning in artificial neural networks, well reviewed by Jürgen Schmidhuber [10], there is some recognition of the importance of IC (in [10, Sections 4.2, 4.4, and 5.6.3]), but it appears that the idea is not well developed in deep learning systems.

Marcus Hutter, with others, [7678] has developed the “AIXI” model of intelligence based on Algorithmic Probability Theory and Sequential Decision Theory. He has also initiated the “Hutter Prize”, a competition with € 50,000 of prize money, for lossless compression of a given sample of text. The competition is motivated by the idea that “being able to compress well is closely related to acting intelligently, thus reducing the slippery concept of intelligence to hard file size numbers”. (From http://www.hutter1.net, retrieved 2017-10-10.) This is an interesting project which may yet lead to general, human-level AI.

4. IC and Biology

This section and those that follow (up to and including Section 21) describe evidence that, in varying degrees, lends support to the ICHLPC perspective. Most of this evidence comes directly from observations of people, but some of it comes from studies of animals—with the expectation that similar principles would be true of people.

First, let us take an abstract view of why IC might be important in people and other animals. In terms of biology, IC can confer a selective advantage to any creature by allowing it to store more information in a given storage space or use less storage space for a given amount of information and by speeding up the transmission of any given volume of information along nerve fibres—thus speeding up reactions—or reducing the bandwidth needed for the transmission of the same volume of information in a given time.

Perhaps more important than the impact of IC on the storage or transmission of information is the close connection, outlined in Section 2.5, between IC and concepts of inference and probability. Compression of information provides a means of predicting the future from the past and estimating probabilities so that, for example, an animal may learn to predict where food may be found or where there may be dangers.

As mentioned in Section 2.5, the close connection between IC and concepts of inference and probability makes sense in terms of ICMUP: any repeating pattern can be a basis for inferences, and the probabilities of such inferences may be derived from the number of repetitions of the given pattern.

Being able to make inferences and estimate probabilities can mean large savings in the use of energy and other benefits in terms of survival.

5. Sensory Inflow, Redundancy, and the Transmission and Storage of Information

As mentioned in Section 3.1.1, Fred Attneave [27] describes how visual perception may be understood in terms of the distinction between areas in a visual image where there is much redundancy and boundaries between those areas where nonredundant information is concentrated. And he suggests that visual perception may be understood, at least in part, as the economical description of sensory input.

Also mentioned in the same section is Horace Barlow’s [31] argument that compression of sensory information is needed to cope with the large volumes of such information and, more specifically, his recognition that, without compression of the information falling on the retina, each optic nerve would be too small to transmit reasonable amounts of that information to the brain [31, p. 548].

6. Chunking-with-Codes

ICMUP is so much embedded in our thinking and seems so natural and obvious that it is easily overlooked. This section, with Sections 7 and 8, describes some examples.

In the same way that “TFEU” may be a convenient code or shorthand for the rather cumbersome expression “Treaty on the Functioning of the European Union” (Appendix C.1.2), a name like “New York” is, as previously noted in Section 2.4, a compact way of referring to the many things and activities in that renowned city and likewise for the many other names that we use: “Nelson Mandela”, “George Washington”, “Mount Everest”, and so on.

The “chunking-with-codes” variant of ICMUP (Section 2.1.2) permeates our use of natural language, both in its surface forms and in the way in which surface forms relate to meanings. (Although natural language provides a very effective means of compressing information about the world, it is not free of redundancy. And redundancy has a useful role to play in, for example, enabling us to understand speech in noisy conditions and in learning the structure of language. How this apparent inconsistency may be resolved is discussed in Appendix C.2.)

Because of its prominence in natural language and because of its intrinsic power, chunking-with-codes is probably important in nonverbal aspects of our thinking, as may be inferred from empirical support for the SP System and its strengths in several aspects of intelligence (Section 2.2.5). (Contrary to the view which is sometimes expressed that thinking is not possible without language, there is evidence in [79] for nonverbal thinking by congenitally deaf people without knowledge of written or spoken natural language, and there is another evidence in [80] for nonverbal thinking in people and in animals.)

Ever since George Miller’s influential paper [3], the concept of a “chunk” has been the subject of much research in psychology and related disciplines (see, e.g., [8184]).

Principles outlined in this section are likely to apply also to variants of ICMUP discussed in Sections 7 and 8 below.

7. Class-Inclusion Hierarchies

As with chunking-with-codes, class-inclusion hierarchies, with variations such as cross-classification, are prominent in our use of language and in our thinking. Benefits arise from economies in the storage of information and in inferences via inheritance of attributes, in accordance with the “class-inclusion hierarchies” variant of ICMUP (Section 2.1.5).

As with chunking-with-codes, names for classes of things provide for great economies in our use of language: most “content” words (nouns, verbs, adjectives, and adverbs) in our everyday language stand for classes of things and, as such, are powerful aids to economical description.

Imagine how cumbersome things would be if, on each occasion that we wanted to refer to a “table”, we had to say something like “A horizontal platform, often made of wood, used as a support for things like food, normally with four legs but sometimes three, …”, like the slow Entish language of the Ents in Tolkien’s The Lord of the Rings. (J. R. R. Tolkien, The Lord of the Rings, London: HarperCollins, 2005, Kindle edition. For a description of Entish, see, e.g., page 480. See also, pages 465, 468, 473, 477, 478, 486, and 565.) Similar things may be said for verbs like “speak” or “dance”, adjectives like “artistic” or “exuberant”, and adverbs like “quickly” or “carefully”.

Classes and categories have been the subject of much research in psychology and related disciplines over several decades (see, e.g., [8587]).

8. Schema-Plus-Correction, Run-Length Coding, and Part-Whole Hierarchies

As with chunking-with-codes and class-inclusion hierarchies, it seems natural to conceptualise things in terms of other techniques described in Section 2.1. In all cases, there is clear potential for substantial economies in how knowledge is represented and for the making of useful inferences.

8.1. Schema-Plus-Correction

As mentioned in Section 2.1.3, a menu in a restaurant or café is an obvious example of the schema-plus-correction device in everyday thinking. Other examples are the uses of forms to gather information about candidates for a job, the features of a house for sale, a check-list for repairs on a car, and so on. And knowledge of almost any skill such as baking a cake, gardening, or woodwork may be seen as a schema that may be tailored for a specific task—such as baking a coffee-and-walnut cake—by plugging in values for that task.

An interesting example of schema-plus-correction in everyday life is the UK shipping forecast which leaves out most of the schema and gives only the corrections to the schema. So, for example, “good, becoming moderate or poor” refers to visibility without mentioning that word; “moderate or rough” refers to the state of the sea, without mentioning that expression; figures for wind speed are given without mentioning that they refer to the Beaufort wind force scale; a word like “later” means a time that is more than 12 hours from the time the forecast was issued; and so on.

8.2. Run-Length Coding

If anything is repeated, especially if it is repeated a large number of times, it seems natural and obvious to describe the repetition with a form of run-length coding. For example, an instruction to walk from one place to another may be “From the old oak tree keep walking until you see the river”. Here, “the old oak tree” marks the start of the repetition, “keep walking” describes the repeated operation of putting one foot in front of the other, and “until you see the river” marks the end of the repetition.

8.3. Part-Whole Hierarchies

As with class-inclusion hierarchies, part-whole hierarchies are prominent in our language and in our thinking. In describing anything that is more complex than “very simple”, such as a house or a car, it seems natural and obvious to divide it into parts and subparts through as many levels as are needed, thus promoting economies and the making of inferences as described in Section 2.1.6.

9. Merging Multiple Views to Make One

Here is another example of something that is so familiar that we are normally not aware that it is part of our perceptions and thinking.

If, when we are looking at something, we close our eyes for a moment and open them again, what do we see? Normally, it is the same as what we saw before. But creating a single view out of the before and after views means unifying the two patterns to make one and thus compressing the information, as shown schematically in Figure 8. (It is true that people may, on occasion, not detect large changes to objects and scenes (“change blindness”) [88] and that, without attention, we may not even perceive objects (“inattentional blindness”) [89], but it is also true that we can detect differences between pairs of images that are similar but not identical—which means that we can also detect the similarities between such pairs of images. That ability to detect similarities, together with our ordinary experience that we normally merge multiple views to make one, as described in the main text, implies that compression of information is an important part of visual perception.)

It seems so simple and obvious that if we are looking at a landscape like the one in the figure, there is just one landscape even though we may look at it two, three, or more times. But if we did not unify successive views we would be like an old-style cine camera that simply records a sequence of frames, without any kind of analysis or understanding that, very often, successive frames are identical or nearly so.

10. Recognition

With the kind of merging of views just described, we do not bother to give it a name. But if the interval between one view and the next is hours, months, or years, it seems appropriate to call it “recognition”. In cases like that, it is more obvious that we are relying on memory, as shown schematically in Figure 9. Notwithstanding the undoubted complexities and subtleties in how we recognise things, the process may be seen in broad terms as ICMUP: matching incoming information with stored knowledge, merging or unifying patterns that are the same, and thus compressing the information.

If we did not compress information in that way, our brains would quickly become cluttered with millions of copies of things that we see around us—people, furniture, cups, trees, and so on—and likewise for sounds and other sensory inputs.

As mentioned earlier, Satosi Watanabe has explored the relationship between pattern recognition and IC [29, 30].

11. Binocular Vision

ICMUP may also be seen at work in binocular vision:

“In an animal in which the visual fields of the two eyes overlap extensively, as in the cat, monkey, and man, one obvious type of redundancy in the messages reaching the brain is the very nearly exact reduplication of one eye’s message by the other eye.” [32, p. 213].

In viewing a scene with two eyes, we normally see one view and not two. This suggests that there is a matching and unification of patterns, with a corresponding compression of information. A sceptic might say, somewhat implausibly, that the one view that we see comes from only one eye. But that sceptical view is undermined by the fact that, normally, the one view gives us a vivid impression of depth that comes from merging the two slightly different views from both eyes.

Strong evidence that, in stereoscopic vision, we do indeed merge the views from both eyes comes from a demonstration with “random-dot stereograms”, as described in [90, Section 5.1] (see also Appendix A.3).

In brief, each of the two images shown in Figure 10 is a random array of black and white pixels, with no discernable structure, but they are related to each other as shown in Figure 11: both images are the same except that a square area near the middle of the left image is further to the left in the right image.

When the images in Figure 10 are viewed with a stereoscope, projecting the left image to the left eye and the right image to the right eye, the central square appears gradually as a discrete object suspended above the background.

Although this illustrates depth perception in stereoscopic vision—a subject of some interest in its own right—the main interest here is on how we see the central square as a discrete object. There is no such object in either of the two images individually. It exists purely in the relationship between the two images, and seeing it means matching one image with the other and unifying the parts which are the same.

This example shows that, although the matching and unification of patterns is a usefully simple idea, there are interesting subtleties and complexities that arise in finding a good match when the two patterns are similar but not identical.

11.1. Finding a Good Match

Seeing the central object in a random-dot stereogram means finding a good match between relevant pixels in the central area of the left and right images and likewise for the background. Here, a good match is one that yields a relatively high level of IC. Since there is normally an astronomically large number of alternative ways in which combinations of pixels in one image may be aligned with combinations of pixels in the other image, it is not normally feasible to search through all the possibilities exhaustively.

11.2. The Best Is the Enemy of the Good

As with the SP System (Sections 2.2.1 to 2.2.3) and many problems in artificial intelligence, the best is the enemy of the good. Instead of looking for the perfect solution—which may lead to outright failure—we can do better, achieving something useful on most occasions by looking for solutions that are good enough for practical purposes. With this kind of problem, acceptably good solutions can often be found in a reasonable time with heuristic search. One such method for the analysis of random-dot stereograms has been described by Marr and Poggio [92].

12. Abstracting Object Concepts via Motion

It seems likely that the kinds of processes that enable us to see a hidden object in a random-dot stereogram also apply to how we see discrete objects in the world. The contrast between the relatively stable configuration of features in an object such as a car, compared with the variety of its surroundings as it travels around, seems to be an important part of what leads us to conceptualise the object as an object [90, Section 5.2].

Any creature that depends on camouflage for protection—by blending with its background—must normally stay still. As soon as it moves relative to its surroundings, it is likely to stand out as a discrete object ([90, Section 5.2], see also Section 2.7).

The idea that IC may provide a means of discovering “natural” structures in the world—such as the many objects in our visual world—has been dubbed the “DONSVIC” principle: the discovery of natural structures via information compression [6, Section 5.2]. Of course, the word “natural” is not precise, but it has enough precision to be a meaningful name for the process of learning the kinds of concepts which are the bread-and-butter of our everyday thinking.

Similar principles may account for how young children come to understand that their first language (or languages) is composed of words (Section 15).

13. Adaptation in the Eye of Limulus and Run-Length Coding

IC may also be seen down in the works of vision. Figure 12 shows a recording from a single sensory cell (ommatidium) in the eye of a horseshoe crab (Limulus polyphemus), first when the background illumination is low, then when a light is switched on and kept on for a while, and later switched off—shown by the step function at the bottom of the figure.

Perhaps contrary to what one might expect—a low rate of firing when illumination is low—the ommatidium fires at a moderate “background” rate of about 20 impulses per second when the illumination is low (shown at the left of the figure). When the light is switched on, the rate of firing increases sharply but instead of staying high while the light is on (as one might expect), it drops back almost immediately to the background rate. The rate of firing remains at that level until the light is switched off, at which point it drops sharply and then returns to the background level, a mirror image of what happened when the light was switched on.

In connection with the main theme of this paper, a point of interest is that the positive spike when the light is switched on and the negative spike when the light is switched off have the effect of marking boundaries, first between dark and light and later between light and dark. In effect, this is a form of run-length coding (Section 2.1.4). At the first boundary, the positive spike marks the fact of the light coming on. As long as the light stays on, there is no need for that information to be constantly repeated, so there is no need for the rate of firing to remain at a high level. Likewise, when the light is switched off, the negative spike marks the transition to darkness and, as before, there is no need for constant repetition of information about the new low level of illumination. (It is recognised that this kind of adaptation in eyes is a likely reason for small eye movements when we are looking at something, including sudden small shifts in position (“microsaccades”), drift in the direction of gaze, and tremor [94]. Without those movements, there would be an unvarying image on the retina so that, via adaptation, what we are looking at would soon disappear!)

Another point of interest is that this pattern of responding—adaptation to constant stimulation—can be explained via the action of inhibitory nerve fibres that bring the rate of firing back to the background rate when there is little or no variation in the sensory input [95].

Inhibitory mechanisms are widespread in the brain [96, p. 45] and it appears that, in general, their role is to reduce or eliminate redundancies in information ([8, Section 9]), in keeping with the main theme of this paper.

14. Other Examples of Adaptation

Adaptation is also evident at the level of conscious awareness. If, for example, a fan starts working nearby, we may notice the hum at first but then adapt to the sound and cease to be aware of it. But when the fan stops, we are likely to notice the new quietness at first but adapt again and stop noticing it.

Another example is the contrast between how we become aware if something or someone touches us but we are mostly unaware of how our clothes touch us in many places all day long. We are sensitive to something new and different and we are relatively insensitive to things that are repeated.

As with adaptation in the eye of Limulus, these other kinds of adaptation may be seen as examples of the run-length coding technique for compression of information.

15. Discovering the Segmental Structure of Language

There is evidence that much of the segmental structure of language—words and phrases—may be discovered via ICMUP, as described in the following two subsections. To the extent that these mechanisms model aspects of HLPC, they provide evidence for ICHLPC.

With regard to Section 2.4, about the possible role of quantification in empirical evidence for ICHLPC, the MK10 Computer Model, designed for the discovery of segmental structure in language and outlined below, assigns a central role to the quantification of frequencies with which basic symbols such as letters, or sequences of symbols, occur in any given sample of language.

15.1. The Word Structure of Natural Language

As can be seen in Figure 13, people normally speak in “ribbons” of sound, without gaps between words or other consistent markers of the boundaries between words. In the figure—the waveform for a recording of the spoken phrase “on our website”—it is not obvious where the word “on” ends and the word “our” begins and likewise for the words “our” and “website”. Just to confuse matters, there are three places within the word “website” which look as if they might be word boundaries.

Given that words are not clearly marked in the speech that young children hear, how do they get to know that language is composed of words? Learning to read could provide an answer but it appears that young children develop an understanding that language is composed of words well before the age when, normally, they are introduced to reading. Perhaps more to the point is that there are still, regrettably, many children throughout the world that are never introduced to reading but, in learning to talk and to understand speech, they inevitably develop a knowledge of the structure of language, including words. (It has been recognised for some time that skilled speakers of any language have an ability to create or recognise sentences that are grammatical but new to the world. Chomsky’s well-known example of such a sentence is Colorless green ideas sleep furiously. [97, p. 15], which, when it was first published, was undoubtedly novel. This ability to create or recognise grammatical but novel sentences implies that knowledge of a language means knowledge of words as discrete entities that can form novel combinations.)

In keeping with the main theme of this paper, ICMUP provides an answer [98, p. 193] which works largely via ICMUP and can reveal much of the word structure in an English-language text from which all spaces and punctuation have been removed [6, Section 5.2]. It is true that there are added complications with speech but it seems likely that similar principles apply.

This discovery of word structure by the MK10 program, illustrated in Figure 14, is achieved without the aid of any kind of externally supplied dictionary or other information about the structure of English. The program builds its own dictionary via “unsupervised” learning using only the unsegmented sample of English with which it is supplied. It learns without the assistance of any kind of “teacher”, or data that is marked as “wrong”, or the grading of samples from simple to complex (cf. [42]).

Statistical tests show that the correspondence between the computer-assigned word structure and the original (human) division into words is significantly better than chance.

Two aspects of the MK10 model strengthen its position as a model of what children do in learning the segmental structure of language [98, p. 200]: the growth in the lengths of words learned by the program corresponds quite well with the same measure for children; and the pattern of changing numbers of new words that are learned by the program at different stages corresponds quite well with the equivalent pattern for children.

Discovering the word structure of language via ICMUP is another example of the DONSVIC principle, mentioned in Section 12—because words are the kinds of “natural” structure which are the subject of the DONSVIC principle and because ICMUP provides a key to how they may be discovered.

15.2. The Phrase Structure of Natural Language

In addition to its achievements in learning the word structure of natural language, the MK10 Computer Model, featured in Section 15.1, does quite a good job at discovering the phrase structure of unsegmented text in which each word has been replaced by a symbol representing the grammatical class of the word [98, p. 194]. An example is shown in Figure 15. As before, the program works without any prior knowledge of the structure of English and, apart from the initial assignment of word classes, it works in unsupervised mode without the assistance of any kind of “teacher” or anything equivalent. As before, statistical tests show that the correspondence between computer-assigned and human-assigned structures is statistically significant. (Thanks to Dr. Isabel Forbes, a person qualified in theoretical linguistics, for the assignment of grammatical class symbols to words in the given text and for phrase-structure analyses of the text.)

Since ICMUP is central in the workings of the MK10 Computer Model, this result suggests that ICMUP may have a role to play not merely in discovering the phrase structure of language but more generally in discovering the grammatical structure of language.

16. Grammatical Inference

Regarding the last point from the previous section, it seems likely that learning the grammar of a language may also be understood in terms of ICMUP. Evidence in support of that expectation comes from research with two programs designed for grammatical inference:(i)The SNPR Computer Model. The SNPR Computer Model, which was developed from the MK10 Computer Model, can discover plausible grammars from samples of English-like artificial languages [98, pp. 181–185]. This includes the discovery of segmental structures, classes of structure, and abstract patterns. ICMUP is central in how the program works.(ii)The SP Computer Model. The SP Computer Model, one of the main products of the SP programme of research, achieves results at a similar level to that of SNPR. As before, ICMUP is central in how the program works. With the solution of some residual problems, outlined in [6, Section 3.3], there seems to be a real possibility that the SP System will be able to discover plausible grammars from samples of natural language. Also, it is anticipated that, with further development, the program may be applied to the learning of nonsyntactic “semantic” knowledge and the learning of grammars in which syntax and semantics are integrated.

What was the point of developing the SP Computer Model when it does no better at grammatical inference than the SNPR Computer Model? The reason is that the SNPR Computer Model, which was designed for the discovery of syntactic structures and worked mainly via the building of hierarchical structures, was not compatible with the new and much more ambitious goal of the SP programme of research: to simplify and integrate observations and concepts across artificial intelligence, mainstream computing, mathematics, and HLPC. What was needed was a new organising principle that would accommodate hierarchical structures and several other kinds of structure as well.

It turns out that the SP-multiple-alignment concept is much more versatile than the hierarchical organising principle in the SNPR program, providing for several aspects of intelligence and the representation and processing of a variety of knowledge structures of which hierarchical structures is only one (Section 2.2.5). It appears that the SP System provides a much firmer foundation for the development of human-level intelligence than the SNPR Computer Model or indeed deep learning models, as discussed in [11, Section V].

With regard to Section 2.4 about the possible role of quantification in empirical evidence for ICHLPC, the SNPR Computer Model and the SP Computer Model, like the MK10 Computer Model (Section 15), both have a central role for quantification of the frequencies with which basic symbols such as letters, or contiguous or broken patterns of symbols, occur in any given sample of data.

17. Generalisation, the Correction of Wrong Generalisations, and “Dirty Data”

Issues relating to generalisation in learning are best described with reference to the Venn diagram shown in Figure 16. That figure relates to the unsupervised learning of a natural language but it appears that generalisation issues in other areas of learning are much the same.

The evidence to be described derives largely from the SNPR Computer Model and the SP Computer Model. Since both models are founded on ICMUP, evidence that they have human-like capabilities with generalisation and related phenomena may be seen as evidence in support of ICHLPC.

In the figure, the smallest envelope shows the finite but large sample of “utterances” from which a young child learns his or her native language (which we shall call L)—where an “utterance” is a speech sound of any kind, and the speakers from which a young child learns are adults or older children (To keep things simple in this discussion we shall assume that each child learns only one first language, although many children learn two or more first languages.). The middle-sized envelope shows the (infinite) set of utterances in L, and the largest envelope shows the (infinite) set of all possible utterances, including those that are in L and those which are not. “Dirty data” are the many “ungrammatical” utterances that children normally hear—outside the envelope for L but inside the envelope representing the utterances from which a young child learns.

The child generalises “correctly” when he or she infers L, and only L, from the finite sample he or she has heard, including dirty data. Anything that spills over into the outer envelope, like “mouses” as the plural of “mouse” or “buyed” as the past tense of “buy”, is an overgeneralisation, while failure to learn the whole of L represents undergeneralisation.

In connection with the foregoing summary of concepts relating to generalisation, there are three main problems:(i)Generalisation without overgeneralisation. How can we generalise our knowledge without overgeneralisation and this in the face of evidence that children can learn their first language or languages without the correction of errors by parents or teachers or anything equivalent? (Evidence comes chiefly from children who learned language without the possibility that anyone might correct their errors. Christy Brown was a cerebral-palsied child who not only lacked any ability to speak but whose bodily handicap was so severe that for much of his childhood he was unable to demonstrate that he had normal comprehension of speech and nonverbal forms of communication [99]. Hence, his learning of language must have been achieved without the possibility that anyone might correct errors in his spoken language.)(ii)Generalisation without undergeneralisation. How can we generalise our knowledge without undergeneralisation? As before, there is evidence that learning of a language can be achieved without explicit teaching.(iii)Dirty data. How can we learn correct knowledge despite errors in the examples we hear? Again, it appears that this can be done without correction of errors.

These things are discussed quite fully in [1, Section 9.5.3] and [6, Section 5.3]. There is also relevant discussion in [11, Section V-H and XI-C].

In brief, IC provides an answer to all three problems like this: for a given body of raw data, I, compress it thoroughly via unsupervised learning; the resulting compressed version of I may be split into two parts, a grammar and an encoding of I in terms of the grammar; normally, the grammar generalises correctly without over- or undergeneralisation, and errors in I are weeded out; the encoding may be discarded.

This scheme is admirably simple, but, so far, the evidence in support of it is only informal, derived largely from informal experiments with English-like artificial languages with the SNPR Computer Model of language learning [98, pp. 181–185] and the SP Computer Model [1, Section 9.5.3].

The weeding out of errors via this scheme may seem puzzling, but errors, by their nature, are rare. The grammar retains the repeating parts of I (which are relatively common), while the encoding contains the nonrepeating parts including most of the errors. “Errors” that are not rare acquire the status of “dialect” and cease to be regarded as errors.

A problem with research in this area is that the identification of any over- or undergeneralisations produced by the above scheme or any other model depends largely on human intuitions. But this is not so very different from the long-established practice in research on linguistics of using human judgements of grammaticality to establish what any given person knows about a particular language.

The problem of generalising our learning without over- or undergeneralisation applies to the learning of a natural language and also to the learning of such things as visual images. It appears that the solution outlined here has distinct advantages compared with, for example, what appear to be largely ad hoc solutions that have been proposed for deep learning in artificial neural networks [11, Section V-H].

As noted above, evidence for human-like generalisation with the SNPR and SP computer models, without either over- or undergeneralisation, may be seen as evidence in support of ICMUP as a unifying principle in HLPC.

18. Perceptual Constancies

It has long been recognised that our perceptions are governed by constancies:(i)Size constancy. To a large extent, we judge the size of an object to be constant despite wide variations in the size of its image on the retina [100, pp. 40-41].(ii)Lightness constancy. We judge the lightness of an object to be constant despite wide variations in the intensity of its illumination [100, p. 376].(iii)Colour constancy. We judge the colour of an object to be constant despite wide variations in the colour of its illumination [100, p. 402].

These kinds of constancy, and others such as shape constancy and location constancy, may each be seen as a means of encoding information economically: it is simpler to remember that a particular person is “about my height” than many different judgements of size, depending on how far away that person is. In a similar way, it is simpler to remember that a particular object is “black” or “red” than all the complexity of how its lightness or its colour changes in different lighting conditions.

By filtering out variations due to viewing distance or the intensity or colour of incident light, we can facilitate ICMUP and thus, for example, in watching a football match, simplify the process of establishing that there is (normally) just one ball on the pitch and not many different balls depending on viewing distances, whether the ball is in a bright or shaded part of the pitch, and so on.

19. Kinds of Redundancy That People Find Difficult or Impossible to Detect

Although the matching and unification of patterns is often effective in the detection and reduction of redundancy in information, there are kinds of redundancy that are not easily revealed via ICMUP. It seems that those kinds of redundancy are also ones that people find difficult or impossible to detect. A well-known example is the decimal representation of , which appears to most people to be entirely random, but which can be created by a simple program so that, in terms of Algorithmic Information Theory, it contains much redundancy.

At first sight, this observation seems to contradict the main thesis of this paper that much of HLPC may be may be understood as IC. But there is nothing in the ICHLPC thesis to say that people can or should be able to detect all kinds of redundancy via ICMUP. And the apparent randomness of the decimal representation of suggests that any natural or artificial system that works via ICMUP would fail to detect the redundancy in data of that kind.

In short, what appears at first sight to be evidence against ICHLPC turns out to be evidence in support of that thesis: the failure of most people to detect the redundancy in the decimal representation of may be explained via the ICHLPC thesis, together with the apparent weakness of ICMUP in discovering and reducing that kind of redundancy.

20. Mathematics

A discussion of mathematics may seem out of place in a paper about ICHLPC but mathematics is relevant because it has been developed over many years as an aid to human thinking. For that reason, in the spirit of George Boole’s An investigation of the laws of thought [101], a consideration of the organisation and workings of mathematics is relevant to ICHLPC (Another book with the suggestion in its title is that it is relevant to human thinking is William Thomson’s “Outline of the Laws of Thought” [102], although his orientation is more towards concepts in logic than concepts in mathematics.).

In [2] it is argued that much of mathematics, perhaps all of it, may be seen as a set of techniques for the compression of information via the matching and unification of patterns and their application. In case this seems implausible, we have the following:(i)An equation as a compressed representation of data. An equation like Albert Einstein’s may be seen as a very compressed representation of what may be a very large set of data points relating energy () and mass (), with the speed of light () as a constant. Similar things may be said about such well-known equations as (derived from Newton’s second law of motion), (Pythagoras’s equation), (Boyle’s law), and (the charged-particle equation).(ii)Variants of ICMUP may be seen at work in mathematical notations. The second, third, and fourth of the variants of ICMUP outlined in Section 2.1 may be seen at work in mathematical notations. For example, multiplication as repeated addition may be seen as an example of run-length coding.

Owing to the close connections between logic and mathematics and between computing and mathematics, it seems likely that similar principles apply in logic and in computing [2, Section 4].

Although in this research it has seemed necessary to avoid too much dependence on mathematics (for reasons outlined in Section 2.3), there is now the interesting possibility that the scope of mathematics may be greatly extended by incorporating within it such concepts as SP-multiple-alignment and other elements of the SP Theory [2, Section 7].

21. Evidence for ICHLPC via the SP System

Another strand of empirical evidence for ICHLPC is via the SP System and the central role within it of SP-multiple-alignment (Section 2.2.2), a variant of ICMUP which, as described in Section 2.1.7, encompasses the six others described in Section 2.1.

The evidence for ICHLPC via the SP System derives largely from the strengths of the SP System in modelling several aspects of HLPC, summarised in Section 2.2.5 and described in more detail in [6] and in [1].

22. Some Apparent Contradictions and How They May Be Resolved

The idea that IC is fundamental in HLPC, and also in the SP Theory as a theory of HLPC, seems to be contradicted by the following:(i)The ways in which people may create redundant copies of information as well as how they may compress information(ii)The fact that redundancy in information is often useful in detecting and correcting errors and in the storage and processing of information(iii)A less direct challenge to ICHLPC, and the SP Theory as a theory of HLPC, is persuasive evidence, described by Gary Marcus [103], that in many respects, the human mind is a kluge, meaning “a clumsy or inelegant—yet surprisingly effective—solution to a problem” (p 2)

These apparent contradictions and how they may be resolved are discussed in Appendix C.

23. Conclusion

This paper presents evidence for the idea that much of human learning, perception, and cognition (HLPC) may be understood as IC, often via the matching and unification of patterns.

The paper is part of a programme of research developing the SP Theory of Intelligence and its realisation in the SP Computer Model—a theory which aims to simplify and integrate observations and concepts across artificial intelligence, mainstream computing, mathematics, and HLPC.

Since IC is central in the SP Theory, evidence for IC in HLPC, presented in this paper in Sections 4 to 20 inclusive (but excluding Section 21), strengthens empirical support for the SP Theory, viewed as a theory of HLPC.

More direct empirical evidence for the SP Theory as a theory of HLPC—summarised in Section 2.2.5—provides evidence for the IC in HLPC thesis which is additional to that in Sections 4 to 20 inclusive.

Four possible objections to the IC in HLPC thesis, and the SP Theory, are described in Appendix C, with answers to those objections.

The ideas developed in this research may be seen to be part of a “Big Picture” of the importance of IC in at least six areas, outlined in Section 2.6.

Appendix

A. Mathematics Associated with ICMUP and Mathematics Incorporated in the SP System

As mentioned in Section 2.3, this appendix details some mathematics associated with ICMUP and some of the mathematics incorporated in the SP System.

A.1. Searching for Repeating Patterns

At first sight, the process of searching for repeating patterns (Sections 2.1.1 and 2.2.2) is simply a matter of comparing one pattern with another to see whether they match each other or not. But there are, typically, many alternative ways in which patterns within a given body of information, I, may be compared—and some are better than others. We are interested in finding those matches between patterns that, via unification, yield most compression—and a little reflection shows that this is not a trivial problem [1, Section 2.2.8.4].

Maximising the amount of redundancy found means maximising where is the frequency of the th member of a set of patterns, and is its size in bits. Patterns that are both big and frequent are best. This equation applies irrespective of whether the patterns are coherent substrings or patterns that are discontinuous within I.

Maximising means searching the space of possible unifications for the set of big, frequent patterns which gives the best value. For a sequence containing symbols, the number of possible subsequences (including single symbols and all composite patterns, both coherent and fragmented) is

The number of possible comparisons is the number of possible pairings of subsequences which is

For all except the very smallest values of , the value of is very large and the corresponding value of is huge. In short, the abstract space of possible comparisons between patterns and thus the space of possible unifications is, in the great majority of cases, astronomically large.

Since the space is normally so large, it is not feasible to search it exhaustively. For that reason, we cannot normally guarantee to find the theoretically ideal answer, and normally we cannot know whether or not we have found the theoretically ideal answer.

In general, we need to use heuristic methods in searching—conducting the search in stages and discarding all but the best results at the end of each stage—and we must be content with answers that are “reasonably good”.

A.2. Information, Compression of Information, Inductive Inference, and Probabilities

Solomonoff [17] seems to have been one of the first people to recognise the close connection that exists between IC and inductive inference (Section 2.5): predicting the future from the past and calculating probabilities for such inferences. The connection between them—which may at first sight seem obscure—lies in the redundancy-as-repetition-of-patterns view of redundancy and how that relates to IC (Section 2.1, [1, Section 2.2.11]):(i)Patterns that repeat within I represent redundancy in I, and IC can be achieved by reducing multiple instances of any pattern to one.(ii)When we make inductive predictions about the future, we do so on the basis of repeating patterns. For example, the repeating pattern “Spring, Summer, Autumn, Winter” enables us to predict that, if it is Spring time now, Summer will follow.

Thus IC and inductive inference are closely related to concepts of frequency and probability. Here are some of the ways in which these concepts are related:(i)Probability has a key role in Shannon’s concept of information. In that perspective, the average quantity of information conveyed by one symbol in a sequence iswhere is the probability of the th type in the alphabet of available alphabetic symbol types. If the base for the logarithm is 2, then the information is measured in “bits”.(ii)Measures of frequency or probability are central in techniques for economical coding such as the Huffman method [4, Section 5.6] or the Shannon-Fano-Elias method [4, Section 5.9].(iii)In the redundancy-as-repetition-of-patterns view of redundancy and IC, the frequencies of occurrence of patterns in I are a main factor (with the sizes of patterns) that determines how much compression can be achieved.(iv)Given a body of (binary) data that has been “fully” compressed (so that it may be regarded as random or nearly so), its absolute probability may be calculated as , where is the length (in bits) of the compressed data.

Probability and IC may be regarded as two sides of the same coin. That said, they provide different perspectives on a range of problems. In this research, the IC perspective—with redundancy-as-repetition-of-patterns—seems to be more fruitful than viewing the same problems through the lens of probability. In the first case, one can see relatively clearly how compression may be achieved by the primitive operation of unifying patterns, whereas these ideas are obscured when the focus is on probabilities.

A.3. Random-Dot Stereograms

A particularly clear example of the kind of search described in Appendix A.1 is what the brain has to do to enable one to see the figure in the kinds of random-dot stereogram described in Section 11.

In this case, assuming that the left image has the same number of pixels as the right image, the size of the search space iswhere is the number of possible patterns in each image, calculated in the same way as was described in Appendix A.1. The fact that the images are two-dimensional needs no special provision because the original equations cover all combinations of atomic symbols.

For any stereogram with a realistic number of pixels, this space is very large indeed. Even with the very large processing power represented by the neurons in the brain, it is inconceivable that this space can be searched in a few seconds and to such good effect without the use of heuristic methods.

David Marr [104, Chapter 3] describes two algorithms that solve this problem. In line with what has just been said, both algorithms rely on constraints on the search space and both may be seen as incremental search guided by redundancy-related metrics.

A.4. Coding and the Evaluation of SP-Multiple-Alignments in Terms of IC

Given an SP-multiple-alignment like one of the two shown in Figure 4 (Section 2.2.2), one can derive a code SP-pattern from the SP-multiple-alignment in the following way:(1)Scan the SP-multiple-alignment from left to right looking for columns that contain an SP-symbol by itself, not aligned with any other symbol.(2)Copy these SP-symbols into a code pattern in the same order that they appear in the SP-multiple-alignment.

The code SP-pattern derived in this way from the SP-multiple-alignment shown in Figure 4 is “S 0 2 4 3 7 6 1 #S”. This is, in effect, a compressed representation of those symbols in the New pattern which form hits with Old symbols in the SP-multiple-alignment.

Given a code SP-pattern derived in this way, we may calculate a “compression difference” asor a “compression ratio” aswhere is the total number of bits in those symbols in the New pattern which form hits with Old symbols in the SP-multiple-alignment and is the total number of bits in the code SP-pattern (the “encoding”) which has been derived from the SP-multiple-alignment as described above.

In each of these equations, is calculated aswhere is the size of the code for th symbol in a sequence, , comprising those symbols within the New pattern which form hits with Old symbols within the SP-multiple-alignment (Appendix A.5).

is calculated aswhere is the size of the code for th symbol in the sequence of symbols in the code pattern derived from the SP-multiple-alignment (Appendix A.5).

A.5. Encoding Individual Symbols

The simplest way to encode individual symbols in the New pattern and the set of Old patterns in an SP-multiple-alignment is with a “block” code using a fixed number of bits for each symbol. But the SP Computer Model uses variable-length codes for symbols, assigned in accordance with the Shannon-Fano-Elias coding scheme [4, Section 5.9], so that the shortest codes represent the most frequent alphabetic symbol types and vice versa. Although this scheme is slightly less efficient than the well-known Huffman scheme, it has been adopted because it avoids some anomalous results that can arise with the Huffman scheme.

For the Shannon-Fano-Elias calculation, the frequency of each alphabetic symbol type () is calculated aswhere is the (notional) frequency of the th pattern in the collection of Old SP-patterns (the grammar) used in the creation of the given SP-multiple-alignment, is the number of occurrences of the given symbol in the th SP-pattern in the grammar, and is the number of SP-patterns in the grammar.

A.6. Calculation of Probabilities Associated with any Given SP-Multiple-Alignment

As may be seen in [1, Chapter 7], the formation of SP-multiple-alignments in the SP framework supports a variety of kinds of probabilistic reasoning. The core idea is that any Old symbol in a SP-multiple-alignment that is not aligned with a New symbol represents an inference that may be drawn from the SP-multiple-alignment. This section describes how absolute and relative probabilities for such inferences may be calculated.

A.6.1. Absolute Probabilities

Any sequence of symbols, drawn from an alphabet of alphabetic types, represents one point in a set of points where is calculated asIf we assume that the sequence is random or nearly so, which means that the points are equi-probable or nearly so, the probability of any one point (which represents a sequence of length ) is close toIn the SP Computer Model, the value of is .

This equation may be used to calculate the absolute probability of the code, , derived from the SP-multiple-alignment as described in Appendix A.4. may also be regarded as the absolute probability of any inferences that may be drawn from the SP-multiple-alignment as described in [1, Section 7.2.2].

A.6.2. Relative Probabilities

The absolute probabilities of SP-multiple-alignments, calculated as described in the last subsection, are normally very small and not very interesting in themselves. From the standpoint of practical applications, we are normally interested in the relative values of probabilities, not their absolute values.

The procedure for calculating relative values for probabilities () is as follows:(1)For the SP-multiple-alignment which has the highest (which we shall call the reference SP-multiple-alignment), identify the symbols from New which are encoded by the SP-multiple-alignment. We will call these symbols the reference set of symbols in New.(2)Compile a reference set of SP-multiple-alignments which includes the SP-multiple-alignment with the highest and all other SP-multiple-alignments (if any) which encode exactly the reference set of symbols from New, neither more nor less.(3)The SP-multiple-alignments in the reference set are examined to find and remove any rows which are redundant in the sense that all the symbols appearing in a given row also appear in another row in the same order. (If Old is well compressed, this kind of redundancy amongst the rows of a SP-multiple-alignment should not appear very often.) Any SP-multiple-alignment which, after editing, matches another SP-multiple-alignment in the set is removed from the set.(4)Calculate the sum of the values for in the reference set of SP-multiple-alignments:where is the size of the reference set of SP-multiple-alignments and is the value of for the th SP-multiple-alignment in the reference set.(5)For each SP-multiple-alignment in the reference set, calculate its relative probability as

The values of calculated as just described seem to provide an effective means of comparing the SP-multiple-alignments in the reference set. Normally, this will be those SP-multiple-alignments which encode the same set of symbols from New as the SP-multiple-alignment which has the best overall .

A.7. Sifting and Sorting of SP-Patterns in Unsupervised Learning in the SP System

In the process of unsupervised learning in the SP System (Section 2.2.3, [1, Chapter 9]), which starts with a set of New SP-patterns, there is a process of sifting and sorting Old SP-patterns that are created by the SP System to develop one or more alternative collections of Old SP-patterns (grammars), each one of which scores well in terms of its capacity for the economical encoding of the given set of New SP-patterns.

When all the New SP-patterns have been processed in this way, there is a set of full SP-multiple-alignments, divided into disjoint subsets, one for each SP-pattern from the given set of New SP-patterns. From these SP-multiple-alignments, the program computes the frequency of occurrence of each of the Old SP-patterns aswhere is the maximum number of times that appears in any one of the SP-multiple-alignments in the subset .

The program also compiles an alphabet of the alphabetic symbol types, , in the Old SP-patterns and, following the principles just described, computes the frequency of occurrence of each alphabetic symbol type aswhere is the maximum number of times that appears in any one SP-multiple-alignment in subset . From these values, the encoding cost of each alphabetic symbol type is computed using the Shannon-Fano-Elias method as before [4, Section 5.9].

In the process of building alternative grammars, the tree of such alternatives is pruned periodically to keep it within reasonable bounds. Values for , , and (which we will refer to as ) are calculated for each grammar and, at each stage, grammars with high values for are eliminated.

For a given grammar comprising SP-patterns , the value of is calculated aswhere is the number of symbols in the th SP-pattern and is the encoding cost of the th SP-symbol in that SP-pattern.

Given that each grammar is derived from a set of SP-multiple-alignments (one SP-multiple-alignment for each pattern from New), the value of for the grammar is calculated aswhere is the size, in bits, of the code SP-pattern derived from the th SP-multiple-alignment.

A.8. Finding Good Matches between Two Sequences of Symbols

At the heart of the SP Computer Model is a process for finding good matches between two sequences of symbols, mentioned in Section 2.2.2 and described quite fully in [1, Appendix A]. What has been developed is a version of dynamic programming with the advantage that it can find two or more good matches between sequences, not just one good match.

The search process uses a measure of probability, , as its metric. This metric provides a means of guiding the search which is effective in practice and appears to have a sound theoretical basis. To define and to justify it theoretically, it is necessary first to define the terms and variables on which it is based:(i)A sequence of matches between two sequences, sequence1 and sequence2, is called a “hit sequence”.(ii)For each hit sequence , there is a corresponding series of gaps, . For any one hit, the corresponding gap is , where is the number of unmatched characters in the query between the query character for the given hit in the series and the query character for the immediately preceding hit; and is the equivalent gap in the database; is taken to be 0.(iii) is the size of the alphabet of symbol types used in sequence1 and sequence2.(iv) is the probability of a match between any one symbol in sequence1 and any one symbol in sequence2 on the null hypothesis that all hits are equally probable at all locations. Its value is calculated as .

Using these definitions, the probability of any hit sequence of length is calculated as

With this equation, it is relatively easy to calculate the probability of the hit sequence up to and including any hit by using the stored value of the hit sequence up to and including the immediately preceding hit.

B. Barlow’s Change of View about the Significance of IC in Mammalian Learning, Perception, and Cognition, with Comments

As noted in Section 3.1.1, Horace Barlow [34, p. 242] argued that “… the [compression] idea was right in drawing attention to the importance of redundancy in sensory messages … but it was wrong in emphasizing the main technical use for redundancy, which is compressive coding.” His main arguments follow, with my comments after each one, flagged with “JGW”.

B.1. “Redundancy Is Not Something Useless That Can Be Stripped off and Ignored”

“It is important to realize that redundancy is not something useless that can be stripped off and ignored. An animal must identify what is redundant in its sensory messages, for this can tell it about structure and statistical regularity in its environment that are important for its survival.” [34, p. 243], and “It is … knowledge and recognition of … redundancy, not its reduction, that matters.” [34, p. 244].

JGW: Barlow is right to say that knowledge of and recognition of redundancy is important “for this can tell [an animal] about structure and statistical regularity in its environment that are important for its survival”. In keeping with that remark, knowledge of the frequency of occurrence of any pattern may serve in the calculation of absolute and relative probabilities ([1, Section 3.7], [6, Section 4.4]) and it can be the key to the correction of errors, as Barlow mentions in the quote from him in the heading of Appendix B.2.

But, in the SP System, redundancy is not treated as “something useless that can be stripped off and ignored”. Patterns that repeat are reduced to a single instance and the frequency of occurrence of that single instance is recorded. The existence of single instances like that, each with a record of its frequency of occurrence, is very important, both in the way that the SP System builds its model of the world and also in the way that it makes inferences and calculates probabilities of those inferences.

As noted in Section 10, if we did not compress sensory information, “our brains would quickly become cluttered with millions of copies of things that we see around us—people, furniture, cups, trees, and so on—and likewise for sounds and other sensory inputs”. And, as noted in Section 3.1.1, Barlow himself has pointed out that the mismatch between the relatively large amounts of information falling on the retina and the relatively small transmission capacity of the optic nerve suggests that sensory information is likely to be compressed [31, p. 548]. And he has also pointed out that, in animals like cats, monkeys, and humans, “one obvious type of redundancy in the messages reaching the brain is the very nearly exact reduplication of one eye’s message by the other eye” [32, p. 213], and because we normally see one view, not two, the duplication implies that the two views are merged and thus compressed. In general, the evidence presented in Sections 4 to 21 points strongly to IC as a prominent feature of HLPC.

B.2. “Redundancy Is Mainly Useful for Error Avoidance and Correction”

JGW: The heading above, from [34, p. 244], implies that compression of information via the reduction of redundancy is relatively unimportant, in keeping with the quotes from Barlow in the previous subsection.

Redundancy can certainly be useful in the avoidance of or correction of errors (Appendix C.2). But experience in the development and application of the SP Computer Model has shown that compression of information via the reduction of redundancy is also needed for such tasks as the parsing of natural language, pattern recognition, and grammatical inference. And compression of information may on occasion be intimately related to the correction of errors of omission, commission, and substitution, as described in Appendix C.2 and illustrated in Figure 19 (see also [6, Section 4.2.2] and [1, Section 6.2]).

B.3. “There Are Very Many More Neurons at Higher Levels in the Brain” and “Compressed, Non-Redundant, Representation Would Not Be at All Suitable for the Kinds of Task That Brains Have to Perform”

Following the remark that “This is the point on which my own opinion has changed most, partly in response to criticism and partly in response to new facts that have emerged.” [34, p. 244], Barlow writes:

“Originally both Attneave and I strongly emphasized the economy that could be achieved by recoding sensory messages to take advantage of their redundancy, but two points have become clear since those early days. First, anatomical evidence shows that there are very many more neurons at higher levels in the brain, suggesting that redundancy does not decrease, but actually increases. Second, the obvious forms of compressed, non-redundant, representation would not be at all suitable for the kinds of task that brains have to perform with the information represented; …” [34, pp. 244–245].

and

“I think one has to recognize that the information capacity of the higher representations is likely to be greater than that of the representation in the retina or optic nerve. If this is so, redundancy must increase, not decrease, because information cannot be created.” [34, p. 245].

JGW: There seem to be two problems here:(i)The likelihood that there are “very many more neurons at higher levels in the brain [than at the sensory levels]” and that “the information capacity of the higher representations is likely to be greater than that of the representation in the retina or optic nerve” need not invalidate ICHLPC. It seems likely that many of the neurons at higher levels are concerned with the storage of one’s accumulated knowledge over the period from one’s birth to one’s current age ([1, Chapter 11], [8, Section 4]). By contrast, neurons at the sensory level would be concerned only with the processing of sensory information at any one time.Although knowledge in one’s long-term memory stores is likely to be highly compressed and only a partial record of one’s experiences, it is likely, for most of one’s life except early childhood, to be very much larger than the sensory information one is processing at any one time. Hence, it should be no surprise to find many more neurons at higher levels than at the sensory level.(ii)For reasons given in Appendix B.4, next, there are reasons for doubting the proposition that “the obvious forms of compressed, nonredundant, representation would not be at all suitable for the kinds of task that brains have to perform with the information represented”.

B.4. “Compressed Representations Are Unsuitable for the Brain”

Under the heading above, Barlow writes:

“The typical result of a redundancy-reducing code would be to produce a distributed representation of the sensory input with a high activity ratio, in which many neurons are active simultaneously, and with high and nearly equal frequencies. It can be shown that, for one of the operations that is most essential in order to perform brain-like tasks, such high activity-ratio distributed representations are not only inconvenient, but also grossly inefficient from a statistical viewpoint …” [34, p. 245].

JGW: With regard to these points,(i)It is not clear why Barlow should assume that a redundancy-reducing code would, typically, produce a distributed representation or that compressed representations are unsuitable for the brain. The SP System is dedicated to the creation of nondistributed compressed representations which work very well in several aspects of intelligence as outlined in Section 2.2.5 with pointers to where fuller information may be found. And in [8] it is argued that, in SP-Neural, such representations can be mapped on to plausible structures of neurons and their interconnections that are quite similar to Donald Hebb’s [9] concept of a “cell assembly”.(ii)With regard to efficiency,(a)It is true that deep learning in artificial neural networks [10], with their distributed representations, is often hungry for computing resources, with the implication that they are inefficient. But otherwise they are quite successful with certain kinds of task, and there appears to be scope for increasing their efficiencies [105].(b)The SP System demonstrates that the compressed localist representations in the system are efficient and effective in a variety of kinds of task, as outlined in Section 2.2.5 with pointers to where fuller information may be found.

C. Some Apparent Contradictions of ICHLPC and the SP Theory, and How They May Be Resolved

The apparent contradictions of ICHLPC, and the SP Theory as a theory of HLPC that were mentioned in Section 22, are discussed in the following three subsections, with suggested answers to those apparent contradictions.

C.1. Redundancy May Be Created by Human Brains and via Mathematics and Computing

Any person may create redundancy by simply repeating any action, including any portion of speech or writing. Although this seems to contradict the ICHLPC thesis, the contradiction may be resolved as described in the following subsections.

C.1.1. Creating Redundancy via IC

With a computer, it is very easy to create information containing large amounts of redundancy and to do it by a process which may itself be seen to entail the compression of information.

We can, for example, make a “call” to the function defined in Algorithm 1, using the pattern “oranges_and_lemons(100)”. The effect of that call is to print out a highly redundant sequence containing 100 copies of the expression “Oranges and lemons, Say the bells of St. Clement’s;”.

void oranges_and_lemons(int x)
printf("Oranges and lemons, Say the bells of St. Clements; ");
if (x > 1) oranges_and_lemons(x - 1);
.

Taking things step by step, this works as follows:(1)The pattern “oranges_and_lemons(100)” is matched with the pattern “void oranges_and_lemons(int x)” in the first line of the function.(2)The two instances of “oranges_and_lemons” are unified and the value is assigned to the variable . The assignment may also be understood in terms of the matching and unification of patterns but the details would be a distraction from the main point here.(3)The instruction “printf("Oranges and lemons,Say the bells of St.Clement”s; ");” in the function has the effect of printing out “Oranges and lemons, Say the bells of St.Clement”s; ”.(4)Then if , the instruction “oranges_and_lemons(x - 1)” has the effect of calling the function again but this time with as the value of (because of the instruction in the pattern “oranges_and_lemons(x - 1)”, meaning that is to be subtracted from the current value of ).(5)Much as with the first call to the function (item 1, above), the pattern “oranges_and_lemons(99)” is matched with the pattern “void oranges_and_lemons(int x)” in the first line of the function.(6)Much as before, the two instances of “oranges_and_lemons” are unified and the value is assigned to the variable .(7)This cycle continues until the value of is 0.

Where does compression of information come in? It happens mainly when one copy of “oranges_and_lemons” is matched and unified with another copy so that, in effect, two copies are reduced to one.

There is more about recursion in Appendix C.1.4 below.

C.1.2. A Simple Example of “Decompression by Compression”

In the retrieval of compressed information, the chunking-with-codes idea outlined in Section 2.1.2 provides a simple example of decompression by compression:(i)Compression of information. If, for example, a document contains many instances of the expression “Treaty on the Functioning of the European Union”, we may shorten it by giving that expression a relatively short name or code like “TFEU” and then replacing all but one instances of the long expression with its shorter code. This achieves compression of information because, in effect, multiple instances of “Treaty on the Functioning of the European Union” have been reduced to one via matching and unification.(ii)Retrieval of compressed information. We can reverse the process and thus decompress the document by searching for instances of “TFEU” and replacing each one with “Treaty on the Functioning of the European Union”. But to achieve that result, the search pattern “TFEU” needs to be matched and unified with each instance of “TFEU” in the document. And that process of matching and unification is itself a process of compressing information. Hence, decompression of information has been achieved via IC!

C.1.3. How the SP System May Achieve Decompression by Compression

How the SP System may, with appropriate input, achieve decompression by compression is described in [1, Section 3.8] and [6, Section 4.5]. There are two key points: decompression of a body of information I may be achieved by a process which is exactly the same as the process that achieved the original compression of I—there is no modification to the program of any kind; all that is needed to achieve decompression is to ensure that there is some residual redundancy in the compressed version of I, so that the program has something to work on.

Figure 17 shows a simple example. Here, the SP-multiple-alignment shown in Figure 17(a), the very simple sentence “j o h n r u n s”, in row 0 of the SP-multiple-alignment, has been recognised as a sentence comprising a noun followed by a verb.

A “code” for this analysis may be obtained by scanning the SP-multiple-alignment from left to right, picking out the SP-symbols that have not been aligned with any other symbol ([6, Section 4.1], [1, Section 3.5]). The result in this case is the SP-pattern “S s0 n1 v0 #S”. Without worrying about the details of how many bits are required for each SP-symbol (which has nothing to do with the textual size of each SP-symbol—see [6, Section 4.1] and [1, Section 3.5.2.1]), we can see that there has been a moderate compression of information because 8 SP-symbols in the sentence have been encoded with 5 other SP-symbols.

In Figure 17(b), the process is reversed. Now the code SP-pattern “S s0 n1 v0 #S” is supplied to the program as a New SP-pattern. Each of the SP-symbols in that SP-pattern are given extra bits of information to ensure that the program has some redundancy to work on, as mentioned above. The best SP-multiple-alignment that is created in this case contains “j o h n” followed by “r u n s”, which is of course the original sentence, recreated via its code SP-symbols.

In general, the SP Computer Model, which is devoted to the compression of information, can reverse the process without any modification. It achieves “decompression by compression” without any paradox or contradiction.

C.1.4. How the SP System May Create Redundancy via Recursion

The SP Computer Model may also create redundancy via recursion, as illustrated in Figure 18.

In this example, the SP Computer Model is supplied with two Old SP-patterns—”X b X #X 1 #X” and “X a 0 #X”—and a one-symbol New SP-pattern: “0”. The program processes this information like this:(1)The SP-symbol “0” in the New SP-pattern is matched with, and implicitly unified with, the same SP-symbol in the Old SP-pattern “X a 0 #X”, as shown in rows 0 and 1 in the figure.(2)The SP-symbols “X” and “#X” at the beginning and end of “X a 0 #X” are matched and unified with the same two symbols at the third and fourth positions in the SP-pattern “X b X #X 1 #X”, as shown in rows 1 and 2 in the figure.(3)The SP-symbols “X” and “#X” at the beginning and end of “X b X #X 1 #X” are matched and unified with the same two symbols at the third and fourth positions in that same SP-pattern, as shown in rows 2 and 3 in the figure.(4)After that, the process in step 3 repeats, as shown in rows 3 and 4 and rows 4 and 5 of the figure—and it may carry on like this, producing many SP-multiple-alignments, until the operator stops it, or computer memory is exhausted.

If the matching symbols in Figure 18 are all unified (merging each matching pair into a single symbol), the result is a single sequence like this: “X b X b X b X b X a 0 #X 1 #X 1 #X 1 #X 1 #X”, and likewise for all the many other SP-multiple-alignments that the program may produce. With all but the simplest of those SP-multiple-alignments, there would be redundancy in the repetition of the symbol “1” and likewise for other symbols in the figure. Hence, the SP Computer Model has created redundancy by a process which is devoted to the compression of information.

C.2. Redundancy Is Often Useful in the Detection and Correction of Errors and in the Storage and Processing of Information

The fact that redundancy—repetition of information—is often useful in the detection and correction of errors and in the storage and processing of information, and the fact that these things are true in biological systems as well as artificial systems, is the second apparent contradiction to ICHLPC and the SP Theory as a theory of HLPC. Here are some examples:(i)Backup copies. With any kind of database, it is normal practice to maintain one or more backup copies as a safeguard against catastrophic loss of the data. Each backup copy represents redundancy in the system.(ii)Mirror copies. With information on the Internet, it is common practice to maintain two or more mirror copies in different places to minimise transmission times and to spread processing loads across two or more sites, thus reducing the chance of overload at any one site. Again, each mirror copy represents redundancy in the system.(iii)Redundancies as an aid to the correction of errors. Redundancies in natural language can be a very useful aid to the comprehension of speech in noisy conditions.(iv)Redundancies in electronic messages. It is normal practice to add redundancies to electronic messages, in the form of additional bits of information together with checksums and also by repeating the transmission of any part of a message that has become corrupted. These things help to safeguard messages against accidental errors caused by such things as birds flying across transmission beams or electronic noise in the system and so on.

In information processing systems of any kind, uses of redundancy of the kind just described may coexist with ICMUP. For example, “… it is entirely possible for a database to be designed to minimise internal redundancies and, at the same time, for redundancies to be used in backup copies or mirror copies of the database … Paradoxical as it may sound, knowledge can be compressed and redundant at the same time” [1, Section 2.3.7].

As noted in Appendix C.1.3, the SP System, which is dedicated to the compression of information, will not work properly with totally random information containing no redundancy. It needs redundancy in its “New” data in order to achieve such things as the parsing of natural language, pattern recognition, and grammatical inference. Also, for the correction of errors in any incoming batch of New SP-patterns, it needs a repository of Old patterns that represent patterns of redundancy in a previously processed body of New information.

Figure 19 shows two SP-multiple-alignments that illustrate error correction by the SP Computer Model. Figure 19(a) shows, as a reference standard, a parsing of the sentence “t w o k i t t e n s p l a y” in row 0 where that New SP-pattern is free of errors. For comparison, Figure 19(b) shows a parsing in which the New SP-pattern in row 0 contains an error of omission (“t w o” is changed to “t o”), an error of substitution (“k i t t e n s” is changed to “k i t t e m s”), and an error of addition (“p l a y” is changed to “p l a x y”). Despite these three errors, the best SP-multiple-alignment created by the SP Computer Model is what would normally be regarded as correct.

This example illustrates the point, mentioned in Appendix B.2, that the exploitation of redundancy for the correction of errors may on occasion be intimately related to the exploitation of redundancy for the compression of information.

C.3. The Human Mind as a Kluge

As mentioned in Section 22, Gary Marcus has described persuasive evidence that, in many respects, the human mind is a kluge. To illustrate the point, here is a sample of what Marcus says:

“Our memory is both spectacular and a constant source of disappointment: we recognize photos from our high school year-books decades later—yet find it impossible to remember what we had for breakfast yesterday. Our memory is also prone to distortion, conflation, and simple failure. We can know a word but not be able to remember it when we need it … or we can learn something valuable and promptly forget it. The average high school student spends four years memorising dates, names, and places, drill after drill, and yet a significant number of teenagers can’t even identify the century in which World War I took place” [103, p. 18, emphasis as in the original].

Clearly, human memory is, in some respects, much less effective than a computer disk drive or even a book. And it seems likely that at least part of the reason for this and other shortcomings of the human mind is that “Evolution [by natural selection] tends to work with what is already in place, making modifications rather than starting from scratch” and “piling new systems on top of old ones” [103, p. 12].

The evidence that Marcus presents is persuasive: it is difficult to deny that, in certain respects, the human mind is a kluge. And evolution by natural selection provides a plausible explanation for anomalies and inconsistencies in the workings of the human mind.

Broadly in keeping with these ideas, Marvin Minsky has suggested that “each [human] mind is made of many smaller processes” called agents, each one of which “can only do some simple thing that needs no mind or thought at all. Yet when we join these agents in societies—in certain very special ways—this leads to true intelligence” [106, p. 17]. Perhaps errors here and there in a society of agents might explain the anomalies and inconsistencies in human thinking that Marcus has described.

Superficially, evidence and arguments presented by Marcus and Minsky seem to undermine the idea that there is some grand unifying principle—such as IC via SP-multiple-alignment—that governs the organisation and workings of the human mind. But those conclusions are entirely compatible with ICHLPC and the SP Theory as a theory of mind. As Marcus says, “I don’t mean to chuck the baby along with its bath—or even to suggest that kluges outnumber more beneficial adaptations. The biologist Leslie Orgel once wrote that “Mother Nature is smarter than you are,” and most of the time it is” [103, p. 16], although Marcus warns that in comparisons between artificial systems and natural ones, nature does not always come out on top.

In general, it seems that, despite the evidence for kluges in the human mind, there can be powerful organising principles too. Since ICHLPC and the SP Theory are well supported by evidence, they are likely to provide useful insights into the nature of human intelligence, alongside an understanding that there are likely to be kluge-related anomalies and inconsistencies too.

Minsky’s counsel of despair—“The power of intelligence stems from our vast diversity, not from any single, perfect principle” [106, p. 308]—is probably too strong. It is likely that there is at least one unifying principle for human-level intelligence, and there may be more. And it is likely that, with people, any such principle or principles operate alongside the somewhat haphazard influences of evolution by natural selection.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

Thanks are due to Dr. Isabel Forbes for the assignment of grammatical class symbols to words in a given text and for phrase-structure analyses, as described in Section 15.2.