Abstract

A developmental model of algorithmic concepts is proposed here for program comprehension. Unlike traditional approaches that cannot do anything beyond their predesigned representation, this model can develop its internal representation autonomously from chaos into algorithmic concepts by mimicking concept formation in the brain under an uncontrollable environment that consists of program source codes from the Internet. The developed concepts can be employed to identify what algorithm a program performs. The accuracy of such identification reached 97.15% in a given experiment.

1. Introduction

Our idea of autonomous development of algorithmic concepts for program comprehension is inspired by the autonomous development paradigm [1] for constructing developmental robots.

Program comprehension [2, 3], also known as program understanding, is concerned with ways to analyze source codes for achieving some intentions, such as code reuse [4], code plagiarism detection [5, 6], algorithm recognition [711], and programming tutoring [12]. Over the past decades, scientists have proposed many approaches. Most of them depend on predesigned representations, which include [13] mental models that describe human mental representations of the program to be understood, cognitive models that describe the cognitive processes and temporary information structures in the programmer’s head that are used to form mental models, and programming plans that are generic fragments of code that represent typical scenarios in programming. In these traditional approaches, a machine cannot do anything beyond the predesigned representation. For example, the traditional approaches of algorithm recognition are unable to recognize the algorithms whose programming plans or templates [710] are not defined in the library of algorithm templates.

In contrast, the autonomous development paradigm, called autonomous mental development (AMD), enables machines to develop their minds autonomously when they interact with their environments [1, 14]. With AMD, a robot can learn any tasks, including those whose representations are not defined before the robot is born. Our previous work shows that applying the AMD theory to program comprehension could avoid predefining templates for algorithm recognition [11].

Unlike predesigned representations that will not change when the machine runs, the representation proposed in this paper will change or develop gradually from randomness to algorithmic concepts when the machine interacts with its environment that consists of program source codes. This is similar to the internal representation in a human brain that develops from no idea of any apple at the birth of the human being to a concept for apples after the brain obtains enough information about apples.

Although the brain has no idea of any apple when the human being encounters an apple at the first time, some information about the apple (i.e., an image of the apple) will reside in the memory of the brain, which makes something change in the memory. When the human being encounters an apple again, the brain may not think this second apple so strange by recalling the image of the first apple. Moreover, the brain may be more familiar with apples after the human being encounters the second apple. This means that the image of the first apple is updated with some information obtained from the second apple, resulting in the fact that the updated image stands for both apples rather than just for the first one. In this way, the apple image in the memory of the brain is updated whenever the human being encounters an apple, leading to the brain being more and more familiar with apples. Finally, the brain is so familiar with apples that almost nothing of the apple image needs updating. At this time, the apple image in the brain represents all the apples in the world, and we think that a concept for apples is formed in the brain (i.e., the apple image becomes the concept for apples).

Our developmental approach, which is on a basis of a computational neural network, mimics the above procedure of concept formation in the brain to develop algorithmic concepts autonomously with algorithmic information extracted from program source codes. This might be an easy task if each of the program source codes implements one of two simple algorithms. For instance, one might apply a concept-formation neural network [15] to divide these program source codes into two groups with each group representing a concept for one of the two algorithms. In this case, the task becomes classifying (or clustering) the program source codes.

On the other hand, it might be a difficult task to mine program source codes that come from online judge systems [16] in the Internet. The algorithmic knowledge underlying these program source codes, which are submitted by college students from all over the world, is very valuable for programming tutoring. However, discovering such algorithmic knowledge seems a challenge because the space consisting of these program source codes is changing uncontrollably every day. We are hardly able to know what and how many algorithmic concepts lie in these program source codes in advance. We are hardly able to predict what new algorithmic concepts will emerge in the changing space consisting of these program source codes. In simple words, this changing environmental space is muddy, so that the developmental approach is necessary for such muddy task [14] in this case.

In this paper, we propose a developmental model for concept formation under such an environmental space that is unknown but is changing uncontrollably as described above. The algorithmic concept generated by this developmental model is readable for both machine and human. This feature is desirable in program comprehension. In a concept assignment task [17], for instance, a machine with this developmental model is able to associate its perceived program with one of its developed concepts that stands for the algorithm performed by the program, while at the same time the same machine is able to display the associated concept in a human-readable expression, making it possible for a human to understand what algorithm the program performs.

2. Algorithmic Concept Development

2.1. Algorithmic Concepts

A person understands a program at an algorithmic level when being able to explain the sequence of operations that the program performs; that is, a program could be understood as a sequence of operations.

Figure 1(a) shows a program that can be understood as a sequence of concrete operations, or concrete algorithm, as described by the flowchart in Figure 1(c). The idea behind this specific sequence of concrete operations is to use an array as a function table. The function table describes a function by tabulating all the arguments and their corresponding function values . In Figure 1(a), the array named is used as the function table. When obtaining an argument value (e.g., ) from its input, the program can return the corresponding function value (e.g., 8899) easily, just by outputting the value of (e.g., ).

The same idea of using an array as a function table is also behind the flowchart in Figure 1(d), which expresses the concrete algorithm of the program in Figure 1(b). The main distinction of the program in Figure 1(b) from the one in Figure 1(a) is that the outputs of the former are vectors; for example, the output is the vector . This implies that the two concrete algorithms described in Figures 1(c) and 1(d) could be conceptualized as an abstract algorithm, as expressed by the flowchart in Figure 1(e).

An algorithmic concept is a common idea behind all concrete algorithms that have the same characteristics. The idea behind the two concrete algorithms above, for example, is such an algorithmic concept, which also refers to the abstract algorithm expressed in Figure 1(e). Each of the two programs above is regarded as an instance of this algorithmic concept.

There are many algorithmic concepts in brains of human programmers. Each algorithmic concept is a common notion shared by programmers to refer to similar concrete algorithms. Thus, an algorithmic concept represents a class of similar concrete algorithms. To understand similar concrete algorithms as an algorithmic concept makes programmers manipulate programs easier. Such manipulation includes creating, maintaining, explaining, reengineering, and reusing [17]. It is convenient for programmers to communicate with each other by using algorithmic concepts.

Different algorithmic concepts represent different classes of concrete algorithms. Each concrete algorithm in a class is an instance of the algorithm concept that represents the class.

2.2. The Developmental Model for Concepts

From our developmental standpoint, the internal state of the brain depends on the environment that the agent of the brain explores, formally written as . More specifically, the environmental area that the brain has sensed from the very beginning to the current time determines the internal state of the brain at the next time ; that is, . Biologically, the brain runs by sending signals of discrete spikes. So we think that the brain works in discrete time (i.e., ).

In this work, the environment consists of program source codes that come from online judge systems [16] in the Internet. We assume that at each time instance the brain senses one and only one program . Thus, we have . The distinction of from is that the elements in are ordered in time whereas those in are not. In other words, denotes a sequence which consists of elements from . Note that may have duplicate elements; for example, if the brain encounters the program again at the time instance . Moreover, there are many possible sequences that may form the at the time instance for a given because the brain may explore the environment randomly.

The internal state is modeled as a set of elements, called images. Each image in the set will be recalled when the brain obtains the algorithmic information of a program in . This can be formally written as , where denotes the procedure of recalling triggered by . In addition, the image will be updated into ; that is, , where denotes the procedure of updating with the algorithmic information of the program . The effect of this updating is equivalent to adding the updated image into the set from which the image has been removed. Thus, the change of the internal state can be formulated as , where , , and denotes the program that the brain encounters at the time instance .

At the very beginning (i.e., ), the brain has no idea of any program but may remember some algorithmic information of any program that it encounters. For this reason, we assume that the initial state of the set consists of some things that will change to remember some algorithmic information of programs. In addition, for computational convenience, we use the same notation for something in that will change to remember some algorithmic information of the program , so that, for each time instance , there always exists one and only one element in that satisfies for each given program in . Biologically, the initial state refers to innateness [18] which is the result of evolution.

Being innate, each element in the initial state contains no algorithmic information at all. After updated by the procedure , however, the element turns into , which contains some algorithmic information of the program . More and more elements in the internal state will be updated to contain algorithmic information when the brain senses more and more programs, while some of them are updated many times to contain rich information about algorithms. When some element in is updated slightly by the procedure at a time instance (i.e., its updated image is almost the same as the element itself), the updated image is regarded as an algorithmic concept. Note that the algorithmic concept is developed from something having nothing to do with any algorithmic information.

This developmental model for concept formation is described formally as the following algorithm, where all developed concepts are collected in a set which is initialized empty, i.e., .

Step 1.

Step 2.

Step 3. At each time instance

Step 3a.

Step 3b.

Step 3c.

Step 3d.

Step 3e. if is almost the same as .

In summary, this developmental model consists of three elements: a brain , its internal state , and its environment . The behavior of the brain is characterized as its recalling procedure and updating procedure . Each time the brain senses a program in the environment , the procedure will recall an image in the internal state that satisfies , and then the procedure will update the image such that the updated image . When the difference between the image and its updated image becomes very small after much many updates, we regard the updated image as an algorithmic concept and put it into the set of developed concepts.

2.3. The Internal Representation for Development

We apply the theory of lobe component analysis (LCA) to implementation of our developmental model for concept formation. The LCA, which meets the idea of AMD, was proposed for developing feature detectors (neurons) in neural networks [19].

The brain in our developmental model is implemented as a one-layer network which consists of neurons. The internal state of the brain is composed of all the synaptic vectors that represent synaptic weights of neurons in the brain (see Figure 1(f)). Every synaptic vector in can contain the algorithmic information from programs in the environment . All these synaptic vectors of neurons are initialized with random values to represent the initial state of at the time . This means that each neuron in the brain is “born” with no relation to algorithmic information.

The inputs to the brain are also vectors, which have the same dimension as the synaptic vectors. Each input vector represents the algorithmic information of a program source code in the environment .

The LCA neural network contains two cell-centered mechanisms, called lateral inhibition and Hebbian learning, respectively. The former can be applied to implementation of our recalling procedure and the latter to implementation of our updating procedure .

The recalling procedure is implemented as follows. We computer the response of every synaptic vector in by , where denotes for the norm of the vector , and for the inner product of the two vectors and . The response is the projection of the input vector on the synaptic vector . The greater the response is, the closer the synaptic vector is to the input vector . A synaptic vector is closer to the input vector than another synaptic vector if its response is greater than the response of the synaptic vector . If there is no synaptic vector closer to the input vector than a synaptic vector , then the synaptic vector has the maximal response , where Thus, it is reasonable to think that the synaptic vector that satisfies represents the image that the brain will recall when receiving the input vector ; that is, . We also say that the procedure recalls the neuron that the synaptic vector belongs to.

The updating procedure is formulated as where denotes the retention rate and the learning rate and . Both the retention rate and the learning rate are functions of the age of the neuron that the synaptic vector s belongs to. The age keeps the number of times that the synaptic vector of the neuron has been updated.

The retention rate is a monotonically increasing function. Initially, the age is equal to 1, and the retention rate is equal to 0, so that , which means that no information in the synaptic vector s will remain in the updated synaptic vector of the neuron at its first update. When the age increases, the retention rate is no longer equal to 0, so that the first term in (2) is not zero, which means that some information in the synaptic vector will remain in the updated synaptic vector of the neuron . When the age becomes a large number, the retention rate will be approaching 1, leading to the fact that almost all information in the synaptic vector will remain in the updated synaptic vector of the neuron .

The learning rate is a monotonically decreasing function. When , the learning rate is equal to 1, which means the updated synaptic vector of the neuron will accept much information of the input vector . When the age becomes a large number, the learning rate will be approaching 0, which means the updated synaptic vector of the neuron will accept little new information from the current input vector at the update.

Thus, the effect of the updating procedure depends on the age of the neuron that the synaptic vector belongs to. When the age becomes infinite, we have This means that the resulting vector of the updating procedure will be almost the same as the synaptic vector when the age becomes a large number. For this reason, we say that the neuron is mature when its age becomes large enough.

2.4. Formation of Algorithmic Concepts

Each program in the environment is an instance of some algorithmic concept. For example, the program in Figure 1(a) is an instance of the algorithmic concept expressed in Figure 1(e). Initially, the brain has no idea of any algorithmic concept, so that the set of the developed concepts is empty. All the synaptic vectors in the internal state are initialized with random values to represent the initial state at the time , which are evidently independent of the environment .

Thus, the brain has no idea of any instance of an algorithmic concept (e.g., the one expressed in Figure 1(e)) when it encounters the first instance (e.g., in Figure 1(a)) of the algorithmic concept at the time . However, the brain will remember some algorithmic information of the instance by updating the synaptic vector of some neuron with the input vector that represents the algorithmic information of the instance. The synaptic vector is the closest one to the input vector in comparison with other synaptic vectors in the internal state , so that the procedure will recall the neuron ; that is, . Because this is the first time that the neuron is recalled to update its synaptic vector (i.e., its age is 1, , and ), we have the following result of the updating procedure by (2): where is the response of the synaptic vector . This means that the updated synaptic vector of the neuron will contain some algorithmic information of the instance from the input vector . This process is equivalent to . The age of the neuron is increased by one before the next update of its synaptic vector.

When the second instance (e.g., in Figure 1(b)) of the algorithmic concept arrives at the time , the brain will not think the instance so strange since the synaptic vector of the neuron contains some algorithmic information of the first instance which is similar to the second instance . The brain will recall the same neuron again, because its current synaptic vector is the most similar to the input vector that represents the algorithmic information of the instance ; that is, . Thus, the synaptic vector of the neuron will be updated again, and its age becomes 2. By (2) and (4), we have the following result of the updating procedure : Because and in (5), the newly updated synaptic vector of the neuron will contain algorithmic information of both instances and from vectors and , respectively. At this time, the neuron stands for both instances and rather than just for the first instance . This means that the brain becomes more familiar with the algorithmic concept .

For the reasons above, the same neuron in the brain will be recalled to update its synaptic vector whenever an instance of the algorithmic concept arrives, leading to its synaptic vector containing more and more information about the algorithmic concept . When the th instance of the algorithmic concept arrives at the time , the neuron will be recalled to update its synaptic vector for the th time. Obviously, the number is actually the age of the neuron at the time . By (2), the result of the th update is obtained as follows: where denotes the th-updated synaptic vector of the neuron and denotes the input vector that represents the algorithmic information of the th instance of the algorithmic concept . By inspecting (6), we can conclude that the synaptic vector of the neuron will be updated greatly (i.e., great difference between and ) when the age is small and slightly (i.e., slight difference between and ) when the age becomes large, because is an increasing function from 0 to 1 whereas is a decreasing function from 1 to 0. As the age becomes large enough, the synaptic vector of the neuron will be almost unchanged.

On the other hand, a large age means that the synaptic vector contains algorithmic information from a large number of instances of the algorithmic concept . The larger the age is, the more instances of the neuron represents. Finally, the neuron represents almost all instances of the algorithmic concept when its age is greater than a very large number. At this time, we think that the synaptic vector is developed into a representation for the algorithmic concept (i.e., an algorithmic concept is formed in the brain ). Thus, the neuron is collected as a developed concept, that is, , when its age is greater than a threshold which is a very large number.

From the concept formation above, it can be seen that the developed concept stands for an idea shared by a group of programs that have similar concrete algorithms. In other words, the developed concept represents a common idea that programmers have when they are writing similar programs. Thus, it is helpful to apply the developed concept to program understanding. Note that the developmental process of algorithmic concepts is unsupervised. This feature is desirable for understanding what algorithmic concepts are employed by the programs that come from online judge systems in the Internet.

3. Representation for Algorithmic Information

3.1. Algorithmic Signatures

An algorithm is a finite list for calculating a function [20]. There are many kinds of notation to express an algorithm, such as natural languages, flowcharts, pseudocode, and problem analysis diagrams. In Figure 1(c), for example, a flowchart describes a concrete algorithm. From the flowchart, we can find some algorithmic information, described as an algorithmic signature in Figure 1(g). This algorithmic signature consists of a control flow statement while and three noncontrolling language points array, input, and output. It characterizes the concrete algorithm of the program in Figure 1(a).

Generally, an algorithmic signature consists of several units. Each unit contains two parts: a control flow statement and its relevant language points. The control flow statements are the most important components, because they decide the control flow of an algorithm. There are four control flow statements: while, for, if, and switch. The noncontrolling language points in a control flow statement are regarded as relevant to the control flow statement. The algorithmic signature in Figure 1(g) has only one unit, where there are three relevant language points in the control flow statement while.

There are two relations between the units: sequence relation and nesting relation. The program in Figure 2(a) has four control flow statements, so that its algorithmic signature has four units as shown in Figure 2(b). The first unit consists of a control flow statement while with its three relevant language points equal to, input, and output. This first unit has three nesting relations to the three other units, respectively. The pair “{” and “}” of the while statement shows the nesting relation. The three units nested in the first unit have sequence relations with each other, arranged from up to down. They all have one and the same relevant language point greater than.

The algorithmic signature of each program can be generated from the parse tree of its source code. It is not difficult to identify control flow statements, noncontrolling language points, and their relationships in the parse tree. Every program can be converted into its algorithmic signature by a depth-first traversal of its parse tree.

3.2. Matrixes for Algorithmic Signatures

An algorithmic signature can be represented by a matrix. Each row of the matrix may consist of a controlling number followed by noncontrolling numbers to represent a unit in the algorithmic signature. The controlling number denotes the control flow statement in the unit, whereas the noncontrolling numbers indicate whether there are some noncontrolling language points in the unit or not.

The first row of the matrix in Figure 1(h) represents the unit of the algorithmic signature in Figure 1(g). The rest rows are all filled with zeros, meaning that there are no more units in this algorithmic signature. The first number 80 in the first row is a controlling number, which indicates that the control flow statement of the unit is a while statement. Following the controlling number are all noncontrolling numbers, where the three 1s indicate that there are three noncontrolling language points array, input, and output, respectively, in the control flow statement.

The matrix in Figure 2(c) represents the algorithmic signature in Figure 2(b). The top four rows represent the four units of the algorithmic signature, respectively. Because the second unit is nested in the first unit, all numbers except the last in the second row move right a position, whereas the last number moves onto the first position. Thus, the controlling number of the second row is in the second position from the left. The number 190 indicates that the control flow statement of the second unit is an if statement. The second row and the third row are the same because the second unit and the third unit are the same and they have a sequence relation to each other.

Each position for a noncontrolling number in a row is associated with a noncontrolling language point. Positions from the second to the sixth in the first row of the two matrixes above are associated with language points greater than, equal to, array, input, and output, respectively. There are only two values 0 and 1 for a noncontrolling number. The value 1 of a noncontrolling number indicates that its associated language point exists in the unit of the algorithmic signature, whereas the value 0 indicates that its associated language point does not exist in that unit.

The controlling number is designed to be greater than the noncontrolling number. For the controlling number, there are four values 80, 110, 190, and 220, which denote the four control flow statements while, for, if, and switch, respectively. Note that the two controlling numbers for while and for have a smaller difference because both of them are iteration statements.

3.3. Signatures of Developed Concepts

Figure 2(d) shows the procedure from program source codes to signatures of developed concepts. The program source code (e.g., in Figure 1(a)) is converted into its algorithmic signature (e.g., in Figure 1(g)) through its parse tree. The algorithmic signature will be converted into a matrix (e.g., in Figure 1(h)). The matrix will be converted into the input vector of the brain which is based on LCA. From the output of the brain are vectors which stand for developed concepts.

The relationship between a matrix and its corresponding vector is shown in Figure 2(e). The first column of the matrix maps into the first components of the vector, where is the number of rows in the matrix. The second column of the matrix maps into the second components of the vector, and so on. In this way, we can convert the matrix of an algorithmic signature into its corresponding vector as an input to the brain . Reversely, the first components of the vector map into the first column of the matrix, the second components of the vector map into the second column of the matrix, and so on. In this way, we can convert the developed vectors from the output of the brain into their corresponding matrixes.

Figure 2(f) shows a matrix which is derived from a developed concept. It is a simplified version of a matrix from our experimental results for convenience in discussion. We can see that the maximum number in the first row of the matrix is 78.86. We suppose that 2 is the threshold to determine whether a row represents a unit or not. Because the maximum number 78.86 is greater than the threshold number 2, we can treat this row as a unit. In the same way, we find four other units, whose maximum numbers are 180.32, 154.49, 160.09, and 154.17, respectively.

In order to get signatures of developed concepts, these maximum numbers should be replaced by their corresponding controlling numbers. We replace all maximum numbers with their closest controlling numbers. For example, we replace the maximum number 78.86 by the controlling number 80 because 80 is the controlling number closest to the maximum number 78.86. For the same reason, the maximum numbers 180.32, 154.49, 160.09, and 154.17 are replaced by the controlling number 190, respectively.

The next step is to determine noncontrolling numbers in a unit. Each nonmaximum number (e.g., 0.05 in the first row) in a unit should be replaced by a noncontrolling number. We supposed that 0.5 is the threshold: if a number is greater than 0.5, we replace it with 1, otherwise 0. Because all the five nonmaximum numbers in the first row are less than 0.5, they are replaced with 0, respectively. We do the same thing for the other four rows. Figure 2(g) shows the result of the conversion.

With these controlling numbers and noncontrolling numbers, we can convert the matrix into a signature (e.g., in Figure 2(h)). This signature is not an algorithmic signature of a program although both of them have the same format. The former characterizes a developed concept which is employed by a group of programs, whereas the latter only represents the concrete algorithm of one program. For example, the signature in Figure 2(i) characterizes the developed concept shared by a group including the two programs in Figures 1(a) and 1(b), whereas the algorithmic signature in Figure 1(g) represents the concrete algorithm of the program in Figure 1(a) only.

4. Experiment and Results

In this experiment, the brain is composed of a one-layer network of neurons, each of which has a 900-dimensional synaptic vector. For the updating procedure in (2), the retention rate is defined as and the learning rate , where is an amnesic function as follows [19]: The parameters in were set as follows: , , , and .

After a developing phase for concept formation, the same brain with all its 400 updated synaptic vectors was applied to a concept assignment task in program comprehension for an evaluation of the developed concepts.

4.1. Experimental Data

The environment is composed of 2341 C++ program source codes from an online judge system submitted by sixty college students for solving sixty simple problems. All of these program source codes were judged correct by the online judge system. The number of control flow statements is not greater than ten in each of these program source codes. The nesting level of control flow statements is not greater than six.

In the developing phase, each of these 2341 program source codes was chosen randomly 300 times to form the environmental sequence . It means that every program source code would occur 300 times randomly in the environmental sequence and the length of the sequence is 2341 × 300. During the test for the evaluation of the developed concepts, however, each of these 2341 program source codes was chosen randomly only once for the concept assignment task.

Each program source code was converted into a 30 × 30 matrix which represents the algorithmic signature of the program source code. In the algorithmic signature, each unit is composed of a control flow statement and at most 25 relevant language points. Because the control flow statement is more important than its relevant language points, its controlling number was allocated to occupy five adjacent positions in a row of the matrix, whereas each noncontrolling number was allocated to occupy only one position for a corresponding language point. The first five positions in the first row of a matrix, for example, are all filled with the same value of the controlling number that represents the control flow statement of the first unit in an algorithmic signature. The other 25 positions for noncontrolling language points in the first row are listed in the Table 1, where all positions in a row are numbered from the left to the right.

Controlling numbers were designed to be greater than noncontrolling numbers for distinguishing the control flow statement from noncontrolling language points. The values of controlling numbers are 80, 110, 190, and 220 to denote the four control flow statements while, for, if, and switch, respectively. However, each noncontrolling number has only two values: 1 for existence of the corresponding language point in a unit and 0 for nonexistence. It should be pointed out that numbers in a nesting unit should move right five positions but numbers in a sequence unit should not (see Section 3.2).

Finally, each matrix was converted into a 900-dimensional vector as an input vector to the brain .

4.2. Developmental Results

Figure 3(a) shows the ages of all the 400 neurons in the brain at the end of the developing phase. The neurons are numbered in the descending order of their ages. Those whose ages are greater than or equal to 3000 are regarded as mature enough to represent algorithmic concepts. We found 78 mature neurons and put them into the set of developed concepts. They are numbered in the same order as theirs in Figure 3(a).

The matrix of the first developed concept (i.e., numbered 1 in the set ) is shown partially in Figure 3(b), which presents the top left part of the matrix. This matrix implies a five-unit signature because each of the top five rows has numbers that are greater than the threshold number 2. In each of the top five rows, there are five numbers in five adjacent positions that are greater than 2. This means that the controlling number occupies five adjacent positions in each of these rows. The value of the controlling number is the one closest to the average of these five numbers. For example, the average of the five numbers in the first row is 78.86. Among the four values 80, 110, 190, and 220 for controlling numbers, the value 80 is the closest to the average 78.86. Thus, the controlling number of the first unit is 80, which means that a while statement is the control flow statement of the unit. In a similar way, we can know that the control flow statements of the other four units are all an if statement. All these four units are nested in the first unite because their first numbers (from the left) that are greater than the threshold number 2 are in the sixth column, which mean their corresponding controlling numbers are five positions right to the first row, indicating their nesting relations with the first unit. Moreover, these four units have sequence relations with each other because their corresponding controlling numbers are in the same columns.

Figure 3(c) shows the signature of the first developed concept. The four if statements are presented within the pair “{” and “}” of the while statement, indicating that their units are nested in the first unit. In addition, these four statements are presented from up to down, indicating that their units have sequence relations with each other. Following the symbol “” are the noncontrolling language points relevant to their corresponding control flow statement. It can be seen, for example, that the second unit has four relevant language points logical and, equal to, variable, and decimal integer constant because they follow the symbol “” within the pair “{” and “}” of its control flow statement if. This signature is readable in some sense, from which we can see that the first developed concept refers to four actions, each of which may or may not be executed according to its corresponding condition in every loop.

4.3. Concept Assignment Results

To test the developed concepts for evaluation, we stopped the development of the brain by disabling its updating procedure . When receiving an input vector e that represents a program in , the brain would still activate its recalling procedure and recall a neuron such that its synaptic vector . The recalled neuron would be assigned to (or associated with) the program if the recalled neuron was in the set of developed concepts (i.e., was a developed concept), which means that the program is supposed to perform the abstract algorithm that the developed concept represents.

In this way, the 78 developed concepts in the set were associated with 1229 out of the total 2341 program source codes in the environment . This proportion is 52.50% in all the program source codes. The continuous curve scaled by the right vertical axis in Figure 3(d) shows the number of program source codes that each developed concept was associated with. 77 developed concepts (i.e., 98.72% of the 78 developed concepts) have 10 or more recalls (associated program source codes). Only one has less than 10 recalls; that is, the 45th developed concept has only one associated program source code. By comparing the matrix of the 45th developed concept with the matrix of the 42nd developed concept, we found that they were almost the same. This probably led to the fact that some program source codes which should be assigned to the 45th developed concept were wrongly assigned to the 42nd developed concept, resulting in the fact that the 42nd developed concept had a great number of recalls whereas the 45th developed concept had only one recall.

4.4. Evaluation of Developed Concepts

We found by inspection that program source codes which are assigned to the same developed concept always perform the same abstract algorithm if they solve the same problem. Table 2 shows that the fourth developed concept had twenty-six associated program source codes. Among them, seventeen solve problem 1, eight solve problem 2, and one solves problem 60. By inspecting these programs one by one, we found that all the programs that solve problem 1 perform the same abstract algorithm and that all the programs that solve problem 2 also perform the same abstract algorithm. Moreover, we found that the abstract algorithms of these two groups of programs are the same. However, we also found that the program that solves problem 60 performs a different abstract algorithm. We think that this program had a wrong concept assignment and that the other twenty-five programs had their correct concept assignments. Thus, the fourth developed concept has its assignment accuracy of 25/26 ≈ 96.15%.

Table 3 lists the concept assignment accuracies of the first ten developed concepts as well as their recalls. The series of “” scaled by the left vertical axis in Figure 3(d) shows the assignment accuracy of each developed concept. Out of the 1229 program source codes that were associated with the 78 developed concepts, 1194 had their correct concept assignments. The overall accuracy of these concept assignments is 1194/1229 ≈ 97.15%.

5. Conclusion

An algorithmic concept is an idea behind a group of programs that have similar concrete algorithms. Human programmers have many algorithmic concepts in their minds, which guide them to design programs. To understand what algorithmic concept is used to design a given program is a muddy task, especially in an uncontrollable environment that consists of program source codes from online judge systems in the Internet.

Our developmental model, inspired by an autonomous development paradigm, aims at challenging this muddy task. Unlike traditional approaches that cannot do anything beyond their predesigned representations, our model is able to develop its internal representation from randomness into algorithmic concepts that are not designed in advance. On a basis of an LCA neural network, the developmental process mimics the recalling procedure and the updating procedure in the human brain. Every program arrival will trigger the recalling procedure to recall a neuron in the neural network. The updating procedure will modify the synaptic weights of the recalled neuron with some algorithmic information of the program that triggers the recalling procedure. The algorithmic information of a program is characterized as an algorithmic signature which will be converted into a vector as an input to the neural network. After a neuron has been recalled to update its synaptic weights for enough times, its synaptic weights become a representation of an algorithmic concept shared by the programs that trigger the recalls of the same neuron.

Such developed concepts are applicable to an algorithmic concept assignment task for program comprehension under the uncontrollable environment described above. In our experiment, 78 developed concepts were grown up under an environment consisting of 2341 simple programs that come from an online judge system. During an algorithmic concept assignment task, 1229 out of the 2341 simple programs were associated with the 78 developed concepts with an overall concept assignment accuracy of 97.15%. Moreover, each of developed concepts can be converted into its signature which is readable in some sense, so that a human could understand what algorithm its associated programs perform.

Since the initial state of the internal representation has no relation with any algorithmic concept, we believe that the principle of our model can also be applied to the development of concepts in other areas.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research is supported by the National Natural Science Foundation of China (NSFC) under Grant no. 60973121.