Abstract

This paper proposes a general method for robots to learn motions and corresponding semantic knowledge simultaneously. A modified ISOMAP algorithm is used to convert the sampled 6D vectors of joint angles into 2D trajectories, and the required movements for writing numbers are learned from this modified ISOMAP-based model. Using this algorithm, the knowledge models are established. Learned motion and knowledge models are stored in a 2D latent space. Gaussian Process (GP) method is used to model and represent these models. Practical experiments are carried out on a humanoid robot, named ISAC, to learn the semantic representations of numbers and the movements of writing numbers through imitation and to verify the effectiveness of this framework. This framework is applied into training a humanoid robot, named ISAC. At the learning stage, ISAC not only learns the dynamics of the movement required to write the numbers, but also learns the semantic meaning of the numbers which are related to the writing movements from the same data set. Given speech commands, ISAC recognizes the words and generated corresponding motion trajectories to write the numbers. This imitation learning method is implemented on a cognitive architecture to provide robust cognitive information processing.

1. Introduction

Robots are expected to generate human-like behaviors in dynamic environments [1, 2]. However, it is very difficult for robots to develop skills or behaviors totally from scratch without any initial knowledge. As stated in Sloman’s paper, robots should learn both altricial and precocial behaviors after their “birth” [3]. Therefore, it is reasonable that robots have some basic and simple initial knowledge with motion primitives [4], or some basic and simple initial skills to explore the world to develop new knowledge and skills to survive or complete tasks. Upon these initial knowledge and skills, humans can teach robots more complex behaviors or skills to complete much more complex tasks.

Imitation learning (also called learning from demonstration, programming by demonstration) is now considered as a powerful tool for transferring skills between robots (especially humanoid robots) [5]. Unlike the traditional teaching-executing mode, where robots simply record the trajectory programmed by human operators and move the angles and end effector along the trajectory, since 1970s, the researchers had tried to train robots to learn simple motion patterns [6]. In 1980s, Atkeson trained a robot to learn how to balance an inverted pendulum in an upright position through practice [7]. From then, many methods in imitation learning have been proposed in various areas [8]. In 2000s, researchers found biological evidence and models of imitation learning in animals [9]. Gradually, imitation learning has been divided into two parts [10]. one is to train robots to learn the dynamics of movements [11] and the other is to train robots to learn the primitives in a behavior sequence [12].

The motivation of this paper is to find a method that robots can learn motion models and semantic knowledge simultaneously in the current popular imitation learning framework. In the experiment part, a humanoid robot, named ISAC, is trained to learn writing numbers from a human teacher.

The rest of this paper is organized as follows. Section 2 introduces the current related work; Section 3 explains the system framework and the algorithms used in this framework; Section 4 explains the implementation on a cognitive architecture; Section 5 explains the experiment setup and the experimental results; Section 6 discusses the experimental results and the future work; Section  7 concludes the work in this paper.

2.1. Motion Learning

Demonstrations of motions are given by human teachers or other robots and a robot student tries to record the demonstrations. There are many different kinds of methods for demonstrating the motions: learning from observation [13], from joystick operation [14], by manually moving the arm of a robot [10], and from the sensor on the human body [15, 16].

Sometimes the dimension of the recorded data is reduced by projecting the data from a high-dimensional data space to a low-dimensional data space, named latent space. Correspondingly, the data needs to be reconstructed from the low-dimensional data space to the high-dimensional data space. The “dimension reduction” and the “reconstruction” are not always required in current imitation learning research. In some situations where the dynamics of the demonstrations need to be analyzed or several inner correlations need to be analyzed, the “Dimension Reduction” and the “Reconstruction” can be applied. Many dimension reduction methods are proposed to extract the features of the data such as principal component analysis [17], factor analysis [18], ISOMAP [19], local linear embedding [20], and MDS [21]. A typical example of using dimension reduction technology is [10], in which Calinon and Billard proposed a method to utilize the dimension reduction methods to establish a strong coupling relationship between the data in the latent space and the data space, and use the data distribution in the latent space to ensure the generated behavior has similar inner dynamics and constraints to the demonstrations.

The learned motion models are stored in the memory (database) of robots, where robots store the learned knowledge or skills. Linear Global Model (LGM) [22], Gaussian Process (GP) [23], Locally Weighted Regression (LWR) [24], Locally Weighted Projection Regression (LWPR) [25], Principal Curve (PC), Gaussian mixture model [26], and Artificial Neural Network (ANN) [27] are used for representing the models in the memory.

2.2. Semantic Knowledge Learning

Robots need to understand the learned motions, and it means that robots need to relate these motion models to corresponding semantic knowledge. This is normally done by labeling the motion models with a semantic name or with a semantic description of related tasks.

2.3. Generation

Given a similar and slightly different situation (where robots need to complete a same type of task with different parameters), a command, or an outside trigger (signal, image, etc.), required actions are planned and required motion models are retrieved by searching their corresponding behavior names in the “labeled behavior models.”

If needed, the parameters of the motions are modified to adapt to the similar but slightly different situation. The generated behaviors are described as actions with specified parameters. Dynamic Movement Primitives (DMPs) [16] are widely for generating motions which have similar dynamics to the demonstrations and can achieve various targets. Calinon et al. proposed a method to minimize the weighted distance between the generated motions and the learned motions both in the latent space and in the original data space [10]. Peters used reinforcement learning [28] methods for robots to adapt the parameters of motion models to generate similar motions in similar but slightly different situations [29]. Theodorou applied optimal control [30] in reinforcement learning environment for robots to learn the motion models of demonstrations and generate similar motions using DMP [31, 32].

If the data is stored in a latent space, the generated trajectories of motions need to be projected from the latent space to the original data space, for example, joint space.

2.4. Motivation

Robots need to learn the motions of the behaviors and the semantic meaning of the behaviors at two stages as described in former sections. One of the learning stages is still like a programming process, in which the behavior names are assigned to the motion models manually by human teachers.

An important problem is that robots can learn the motions of the demonstrations and use the learned motions in a similar but slightly different situation, but how can robots use such learned knowledge in other area besides executing movements, for example, recognition, semantic understanding, reasoning, and planning, Especially in writing, the learned movements should relate to semantic meanings of the letters, numbers or symbols, and robots may use the learned movements to find their higher level semantic meanings. When we see someone demonstrate how to a character, we can have direct thought about the meaning of that character in our brain, and when we have the method of writing a character in our brain, we can evaluate the results with a real character. A game which may be familiar with most of us is that someone writes letters on our back using fingers and we try to guess what he/she is writing. Obviously, we, humans, can use the sensing information on the back to construct the trajectory of the movement of the finger in our brain, and try to compare it with our learned knowledge about the letters. In this paper, the robot uses encoders to sense the movement of the joints in the joint space and tries to match the sensed movement of the joints with the learned knowledge about the numbers. Then, human teachers do not need to teach the robot that what the number is and the robot can automatically relate the learned movements of writing letters to the corresponding letters.

3. System Design

In this paper, we proposed a general framework, using which robots can complete two learning stages mentioned above simultaneously, which is shown in Figure 1.

From Figure 1, the contribution of this proposed framework is that the information for both motion models and the knowledge models is both from one single source. The robot uses the information of the demonstrations to learn motion models and the knowledge models. In this paper, ISAC learn how to write numbers and to automatically relate the motions of writing numbers to their semantic knowledge models.

3.1. Demonstration

Demonstrations are given by human teachers. In this paper, demonstrations are shown by manually moving the right arm of ISAC.

The recorded data is  𝜃𝑠={𝜃𝑣,𝑡}, which is an 𝑁×7 matrix. 𝜃𝑣 records the angles of six joints on the right arm of ISAC, and 𝑡 is the temporal information.

3.2. Feature Extraction

For most situations, robots need to learn two features: motions and semantic meaning. The extracted information is stored in corresponding models.

3.2.1. Motion Models Learning

As mentioned in the introduction section, there are many methods to represent the data of motions. In this paper, we use a modified ISOMAP algorithm [33] to project the 6-dimensional data space to a 2-dimensional space. The movements of the writings are in a 3-dimensional Cartesian space which is driven by 6 joints. However, the features of the writings are 2-dimensional because the characters are written on a 2-dimensional plane. So it is reasonable to use the features on a 2-dimensional plane for other use. The additional motivations of using this modified ISOMAP algorithm is to visualize the sampled data on 2-dimensional plane for researchers to find the features of motions easily, and to make the trajectory on the 2-dimensional plane does not have overlapped parts inside itself or have intersections inside itself. Using this algorithm, the spatial and temporal characteristic of the sample data can be visualized on a 2-dimensional plane.

We want to mention that dimension reduction is not necessary for all applications. In this paper, the dimension reduction is convenient for extracting the features and using the features in recognition.

The algorithm of the original ISOMAP algorithm is specified as follows.(1)Sample the points on the demonstration trajectory: 𝜃𝑠=𝜃𝑣,𝑡.(1)(2)Compute the geodesic distance matrix DMs: 𝐷Gs𝜃(𝑖,𝑗)=𝑣𝑖𝜃𝑣𝑗𝜃,if𝑣𝑖𝜃𝑣𝑗𝐷𝑑0,otherwiseMs𝐷(𝑖,𝑗)=minGs(𝑖,𝑗),𝐷Gs(𝑖,𝑘)+𝐷Gs,(𝑘,𝑗)𝑘=1,2,,𝑁,(2)𝐷Ms is iteratively calculated until the values of the elements are converged.

In the original ISOMAP algorithm, 𝜃𝑣𝑖𝜃𝑣𝑗 is defined as an Euclidean spatial distance between two points: 𝜃𝑣𝑖 and 𝜃𝑣𝑗. In our modified ISOMAP algorithm, 𝜃𝑣𝑖𝜃𝑣𝑗 is defined as a temporal distance between two points: 𝜃𝑣𝑖 and 𝜃𝑣𝑗. 𝜃𝑣 is used to record the angles of the six joints: 𝜃𝑣=𝜃𝑣1,𝜃𝑣2,,𝜃𝑣𝑁. (3) Compute the inner products, 𝜏𝐷Ms1=2𝐻𝑆𝐻,(3) where 𝑆𝑖𝑗=𝐷Ms2𝑖𝑗,and𝐻=𝛿𝑖𝑗(1/𝑁)(𝛿𝑖𝑗=1,when𝑖=𝑗,𝛿𝑖𝑗=0,when𝑖𝑗).(4) Compute the new coordinates of the sampled points in the latent space X: 𝑥𝑖=𝜆𝑠1𝛼𝑠1,𝜆𝑠2𝛼𝑠2𝑇,(4) where 𝜆1 and 𝜆2 are the two largest eigenvalues of 𝜏(𝐷𝑀) with two corresponding eigenvectors: 𝛼1 and 𝛼2.

The modified ISOMAP method, which reflects both the temporal and spatial relationship between sampled data points on a two-dimensional plane, is used to train robots to learn the motions of writing the letters on a two-Dimensional plane [33].

The original ISOMAP is an extension of the MDS, which constructs the distance matrix by connecting the sampled points through the neighbors. The original ISOMAP is used to describe the distance between the sampled neighbor points. In order to find the temporal information of the sampled trajectory, in our algorithm the neighbors are strictly defined as temporal neighbors. The spatial relationships are not defined but calculated by this modified ISOMAP algorithm. The modification is designed in 𝐷Gt𝜃(𝑖,𝑗)=𝑣𝑖𝜃𝑣𝑗||𝑡,if𝑖𝑡𝑗||𝑠,0,otherwise,(5) where 𝑠 is the temporal threshold value. In (1), 𝑑 is the spatial threshold value.

Using this method, 𝐷Gt and the corresponding 𝜏(𝐷Mt) are calculated. The sampled points in the latent space are represented as 𝑦𝑖=(𝜆𝑡1𝛼𝑡1,𝜆𝑡2𝛼𝑡2)𝑇, where 𝜆𝑡1 and 𝜆𝑡2 are the two largest eigenvalues of 𝜏(𝐷Mt) with two corresponding eigenvectors: 𝛼𝑡1 and 𝛼𝑡2.

Jenkins and Matarić proposed spatial-temporal ISOMAP algorithm in 2004 [34]. Their method is comprehensive and defines the types of neighbors in detail. For different neighbors, the construction of the distance matrix is different in their method. In our method, we simply add temporal constraints on the construction of the distance matrix and strictly assume all the distances should be temporal related. This kind of method is simple but convenient for computation. Both Jenkins’s method and our method are effective for describing the spatial-temporal characteristics of the sampled data points.

In current imitation learning, behaviors are special robotic movements in certain task-related situations. This means we can assume that the sampled data from demonstrations of one behavior always lie on the same manifold in the data space. The results of projecting data from the latent space to the data space must be on the same manifold as the demonstration. Therefore, it is reasonable to assume that there exists a relationship between the data in the data space and the latent space and it can be described as a function: 𝜃𝑣𝑖𝑥=𝑓𝑖,𝑊,(6) where 𝑥𝑖 is a data point in the latent space and 𝜃𝑣𝑖  is a corresponding data point in the original data space.

Therefore, 𝑓(𝑋,𝑊)  is designed as a generalized linear regression model: 𝜃𝑣𝑖𝑥=𝑊Φ𝑖.(7)

Φ(𝑥) is composed of 𝑅 basis functions: Φ𝑖𝑥(𝑥)=exp𝑖𝑐𝑖2Σ𝑖,𝑖=1,2,,𝑅,(8) where 𝑐𝑖 is the center of the ith basis function and Σ𝑖 is the bandwidth. The centers are uniformly distributed in the latent space and the bandwidth is designed for the basis functions to cover the latent space.

𝑊 is a (𝐷1)𝑅 matrix which projects the data from the latent space to the data space. However, Bishop has verified that the number of basis functions must typically grow exponentially with the dimensionality of the input space [35]. This means the advantage of dimension reduction in calculation and storage eventually arrives at a certain value as the number of the dimensionality increases. In Section 4, comparisons of results using different number of basis functions are given.

Assuming the projection matrix 𝑊 is known, the probabilities of the distributions of the points in data space are 𝑝𝜃𝑖𝑥𝑖=𝛽,𝑊,𝛽2𝐷𝛽exp2𝑓(𝑥,𝑊)𝜃2.(9)

The log likelihood of the probability of the distribution of points in data space is the multiplication of the distribution probability of each point: 𝐿(𝑊,𝛽)=𝑁𝑖=1𝑡ln𝑝𝑖𝑥𝑖.,𝑊,𝛽(10)

Maximizing the log likelihood function can be achieved by differentiating the log likelihood function with respect to 𝑊: 𝑁𝑖=1𝑥𝑊Φ𝑖𝜃𝑖Φ𝑥𝑖𝑇=0.(11)

Rewrite (11): 𝑊𝑁𝑖=1Φ𝑥𝑖Φ𝑥𝑖𝑇=𝑁𝑖=1𝜃𝑖Φ𝑥𝑖𝑇.(12)

Projection matrix 𝑊 can be calculated from (11), 𝑊=𝑁𝑖=1𝜃𝑖Φ𝑥𝑖𝑇𝑁𝑖=1Φ𝑥𝑖Φ𝑥𝑖𝑇,(13) where (𝑁𝑖=1Φ(𝑥𝑖)Φ(𝑥𝑖)𝑇) is the Moore-Penrosepseudo inverse matrix of 𝑁𝑖=1Φ(𝑥𝑖)Φ(𝑥𝑖)𝑇.

The sampled trajectory is projected from a 6-dimensional data space to a 2-dimensional space using the original ISOMAP algorithm and a modified ISOMAP algorithm in “feature extraction” block.

In the latent space, we have data points set 𝑋, which is a two-dimensional space. As stated in previous sections, in “dimension reduction,” the temporal information is only used to calculate the neighborhood graph. But in the “behavior planning” stage, the temporal information should be combined into the model and set as the enquiry point.

Data points in the latent space follow (14): 𝑥𝑖𝑡=𝑓𝑖,𝑖=1,2,,𝑁.(14)

Using Gaussian process [22], we can get a kernel method-based model of the demonstration in the latent space. The points on the two-dimensional plane are described as 𝑥𝑖={𝑥𝑖,𝑥𝑖}, and one GP model is used in one dimension in the latent space. GP has been widely used [3639] for representing the sampled data points because its robustness and nonparametric characteristics.

Assume the 𝑁 two-dimensional data points in the two-dimensional latent space has the following probabilistic distribution: 𝑝𝑧𝑥𝑧=𝒩𝑥,𝛽1𝐼,𝑝𝑧𝑥=𝒩𝑧𝑥,𝛽1𝐼.(15) Take the calculation in the first dimension as an example: 𝑝(𝑧)=𝒩(𝑧0,𝐶𝑁), where the covariance matrix 𝐶(𝑛,𝑚)=𝑘(𝑡𝑛,𝑡𝑚)+𝛽1𝛿𝑛𝑚, and 𝑧 is a vector of target values.

𝑘(𝑡𝑛,𝑡𝑚) is the kernel function. Normally, 𝑘𝑡𝑛,𝑡𝑚=𝜃0𝜃exp12𝑡𝑛𝑡𝑚2+𝜃2+𝜃3𝑡𝑛𝑇𝑡𝑚,(16) and 𝑥𝑛 is considered as the timing step in the demonstration.

In the “Generation” stage, a new time step 𝑡enquiry is given as an enquiry point and GP is used to calculate the corresponding data value 𝑧enquiry. 𝑝𝑧,𝑧enquiry𝑧=𝒩,𝑧enquiry0,𝐶𝑁+1.(17)

The covariance matrix is: 𝐶𝑁+1=𝐶𝑁𝑘𝑘𝑇𝑐,(18) where 𝑘=𝑘(𝑡𝑛,𝑡enquiry) for 𝑛=1,2,,𝑁.

Using Bayesian method, 𝑧enquiry is calculated using (19) 𝑧enquiry=𝑘𝑇𝐶𝑁1𝑧.(19)

Using the same method, 𝑧enquiry can be calculated using (20): 𝑧enquiry=𝑘𝑇𝐶𝑁1𝑧.(20)

In this part, the input is a data set including the sampled data points in a 6-dimensional joint space, and the output is a GP model for the trajectory in the latent space. Given an enquiry point (normally a timing point), the output of this model, which is (19) and (20), is a corresponding data point on the trajectory.

Using the following equation, we can project the data from the low-dimensional latent space to the original data space: 𝜃𝑣𝑖𝑥=𝑊Φ𝑖.(21)

These data points will be used for robot to learn how to generate the required movement trajectories. In robotic imitation learning, robots need to fit the recorded movement trajectory with the model used in a generator [16, 40]. The fitting process is considered as learning a pattern in a generator.

In our system, we used the Dynamic Movement Primitives (DMP) [11], proposed by Ijspeert, as the pattern generator.

DMP is configured as 𝜏̇𝑧=𝛼𝑧𝛽𝑧,(𝑔𝑦)𝑧𝜏̇𝑦=𝑧+𝑓.(22)

𝑔 is the goal state, 𝑧 is the internal state, 𝑓, an RFWR model, is calculated to record the dynamic of the demonstration and to guarantee convergence of the new generated trajectories, 𝑦 is the position generated by the DMP differential equations, and ̇𝑦 is the generated velocity correspondingly. 𝛼𝑧, 𝛽𝑧, and 𝜏 are the constants in this equation.

The fitting (or learning) is to train robots to learn the model: 𝑓=𝑁𝑖=1Ψ𝑖𝑤𝑖𝑣𝑁𝑖=1Ψ𝑖.(23)

and 𝑣 satisfies the following equation: 𝜏̇𝑣=𝛼𝑧𝛽𝑧,(𝑔𝑥)𝑣̇𝑥=𝑣.(24)

Ψ𝑖 is a receptive basis function, which is distributed in the space: Ψ𝑖1=exp2𝑖2𝑥𝑐𝑖2.(25)

𝑐𝑖 is the center of the basis function, which is distributed in the space, and 𝑖 is the bandwidth.

The target is to use the sampled points as 𝑥 and use iterative learning method for robots to adapt the parameters 𝑤𝑖. After the learning, the parameters are fixed and do not need to change at the generation stage: Δ𝑖(𝑛+1)1=exp2𝑥(𝑛+1)𝑐𝑖𝑇𝐷𝑥(𝑛+1)𝑐𝑖,𝑤𝑖(𝑛+1)=𝜆𝑤𝑖(𝑛)+Δ𝑖(𝑛+1),(26) The subscript (𝑛+1) denotes that this is the (𝑛+1) iteration, 𝑥(𝑛+1) is the data point used to update the model at the (𝑛+1) iteration, and Δ𝑖(𝑛+1) is computed as the weighted distance between the data point 𝑥(𝑛+1) and the center of the basis function, which is used to update the weight 𝑤𝑖.

3.2.2. Semantic Knowledge Learning

Recording and writing the characters are not enough for robots to interact with humans and robots should understand the semantic meaning of the motions and correlate the motion models and the semantic knowledge models automatically.

The general algorithm is shown in Figure 2.

The extracted feature of the demonstration is used to compare with the templates. The classification results automatically assign the semantic meaning of the template to the corresponding models. The learned motions should have relationship with its semantic-related templates. For example, the motions of writing characters should have relationships to the shapes or topologies of characters, and so on.

In this paper, ISAC is trained to write numbers and to learn the semantic meaning of the numbers automatically. As stated below, the demonstration is writing, the templates should be the shapes or topologies of the numbers. Correspondingly, in this part, the original ISOMAP method, which reflects the spatial topology of sampled data points, is used to train robots to learn semantic meaning of the motions. In order to reflect the overall spatial topology of the sampled data points, all of the neighbor points are considered as temporal neighbors. For simplicity, the modified ISOMAP algorithm can also be applied in this modeling part while the threshold value of temporal distance s is set as the size of the sampled data points.

Using the original ISOMAP algorithm, the demonstration of the motions of writing letters is projected onto a two-dimensional plane with corresponding projection matrices. The recorded trajectories in the latent space are normalized in the same scale. In this paper, the range of the x-axis is [0,1] and the range of the y-axis is also [0,1]. The reason for normalization is obvious because the given demonstration by human teachers could have different scaling. In order to compare the demonstrations with the commands shown at the generation stage, these processed demonstrations should be normalized.

The technology of establishing models of the recorded trajectories writing numbers and recognizing the numbers based on the templates are not the concentrations of this paper; readers may be interested in other literatures to find many advanced word segmentation and recognition methods.

In this paper, an Optical Character Recognition (OCR) software tool, TesseractOCR (developed by Hewlett-Packard and currently maintained by Google), is used for ISAC to recognize comparing the knowledge model with the characters in the database. In practical application, the results of recognizing a single number are not good by using Tesseract-OCR. Therefore, the picture of a knowledge model is resized to normal size of a letter and placed behind a sentence “This is.” Then the recognition is “This is **.” By abandoning “This is,” ISAC obtains the semantic knowledge of this recognized picture.

After recognition, the motion models are automated-assigned a semantic meaning and a corresponding template based on the recognition results.

The labeled motion models are stored in the “labeled behavior model” block.

In “Behavior Modeling” block, the projection matrices are calculated using a typical learning algorithm. The trajectories in the latent space and their corresponding projection matrices are stored in “Behavior Models” block. Given a command, ISAC parses the commands or recognizes the commands and converts them into actions in the “command analysis” block and retrieves the behavior models from the “Behavior Models” block. The “reconstruction” block projects the trajectory from the latent space to the joint space to generate new behaviors.

3.3. Generation

In the generation stage, the command is sent to robots and robots need to analyze the command and convert the command into actions with specified parameters. Then the required motion model is obtained by searching the semantic names of the actions with specified parameters. If the motion models are stored in the latent space, a reconstruction is needed to project the motion model from the latent space to the original data space; otherwise, the motion models will be used directly. At last, robots need to execute the motions to complete tasks.

3.3.1. Command Analysis

For most situations of imitation learning, robots are given a task-related situation. The starting states and the goal states are given in this situation and robots need to use learned behaviors to complete the task (achieving the goal state). In this paper, the speech command is used for robots to understand the task.

Using speech, robots needs to listen to the command from a human operator, recognize the required information, and convert the information into actions with specified parameters [41]: Action(parameter)+Action(parameter)++Action(parameter).(27)

Figure 3 displays the general method of analyzing the command.

Robots break down the commands into different parts by finding matching words in the lexicons using certain grammar rules. Subjects, actions, objects, targets, and environments are predefined in the lexicons. Using certain rules, the commands are converted into actions as shown in (7). This is a typical natural language processing method. Readers may refer to the book written by [42].

For the example of writing numbers, the lexicon design isSubject: ISACAction:   WriteObject: Zero, One, Two, Three, Four, Five, Six, Seven, Eight, Nine, and Ten.The grammar design is Action+Object.

Receiving a command from a human teacher, ISAC extracts the “object” information from the commands and retrieves the corresponding behavior model from the “Labeled Behavior Models” block by searching the required “Object” in the behavior names. The implementation is using the Microsoft speech recognition library.

3.3.2. Reconstruction and Execution

If the motion models are stored in the latent space, using (6), the models can be projected from the latent space to the original data space.

In this paper, a GP-based model is used to describe the motion models in the latent space. Therefore, the (6) is rewritten as: 𝜃=𝑊Φ(𝑧),(28) where 𝑧 is the data point obtained from the GP model given an enquiry point.

Then the required data in the joint space is obtained and robots move the actuator following the generated trajectory.

Using (28), required trajectories could be computed as 𝜃𝑑.

Using forward kinematics, positions and orientations can be computed as 𝑋𝑑𝜃=ForwardKinematics𝑑.(29)

The target of generating a similar motion trajectory to complete a task is to minimize the error between the demonstrated trajectory and the generated trajectory.

We define the quadratic cost at each timing step is: 𝐿𝑘=𝑋𝑑𝑘𝑋𝑔𝑘𝑇𝑊𝑘𝑋𝑑𝑘𝑋𝑔𝑘(30)

𝑋𝑑 is a desired trajectory (a demonstrated trajectory in this paper), 𝑋𝑔 is a generated trajectory, and 𝑘 is the timing step.

𝐿 represents the weighted error between the demonstrated trajectory and the generated trajectory at timing step 𝑘. The target is to minimize the overall cost: Φ(𝑁)+𝑁1𝑘=1𝐿𝑘.(31) while Φ(𝑁) is the terminal cost, which is normally defined as: 𝑋Φ(𝑁)=𝑑𝑁𝑋𝑔𝑁𝑇𝑊𝑁𝑋𝑑𝑁𝑋𝑔𝑁.(32)

For simplicity, in our algorithm, 𝑊𝑘 and 𝑊𝑁 are defined as a unity diagonal matrix.

The control process is an integration of sensing and planning. In this paper, we do not focus on the low level actuator control. Because the regulators on ISAC are commercial devices just like “black-boxes”, we assume that a regulator can automatically adjust control output to achieve the control goals when a required reference input is given.

The initial position and orientation are computed by: 𝑋0𝜃=ForwardKinematics0.(33)

At timing step 𝑘, 𝜃𝑠𝑘1 is the sensed joint angles at timing step 𝑘1 and assume the 𝜃𝑔𝑘 is planned based on the current sensing information: 𝑋𝑔𝑘𝑋𝑔𝑘1𝜃=𝐽𝑔𝑘𝜃𝑠𝑘1,(34) where 𝑋𝑔𝑘1 is computed by: 𝑋𝑔𝑘1𝜃=ForwardKinematics𝑠𝑘1,(35) and   𝐽 is the Jacobian matrix.

The target is to minimize 𝐿𝑘=𝑋𝑑𝑘𝑋𝑔𝑘𝑇𝑊𝑘𝑋𝑑𝑘𝑋𝑔𝑘.(36)

Rewrite (36): 𝐿𝑘=𝑋𝑑𝑘𝜃𝐽𝑔𝑘𝜃𝑠𝑘1+𝑋𝑔𝑘1𝑇×𝑋𝑑𝑘𝜃𝐽𝑔𝑘𝜃𝑠𝑘1+𝑋𝑔𝑘1.(37)

Minimize this cost function by differentiating 𝐿𝑘 with respect to 𝜃𝑔𝑘 and set the derivative to zero, we get 𝜃𝑔𝑘=12𝑋𝑑𝑘𝑋𝑔𝑘1+𝐽𝜃𝑠𝑘1𝑇12𝐽𝑇𝑋𝑑𝑘𝑋𝑔𝑘1+𝐽𝜃𝑠𝑘1.(38)

At each timing step, 𝜃𝑔𝑘 is given to the regulator as reference input for low-level actuator control.

3.4. Implementation

This framework is implemented on a cognitive architecture, named ISAC cognitive architecture, developed by the Center for Intelligent Systems of Vanderbilt University [4345].

Figure 4 displays the system design of the ISAC cognitive Architecture.

3.4.1. Perceptual Agents (PA)

The PA obtains the sensory information from environment. Normally, encoders on the joints of the robot, cameras on the head of the robot, and the force feedback sensor on the wrist of the robot are implemented in this agent.

3.4.2. Short-Term Memory (STM)

The obtained information is sent to and stored in the STM. The Sensory Ego Sphere (SES) is implemented in the STM, which performs spatio-termporal coincidence detection, mediates the salience of each percept, and facilitates perceptual binding.

3.4.3. Working Memory System (WMS)

The WMS stores the task-related information in chunks. This component is especially important in the generation stage.

3.4.4. Central Executive Agent (CEA)

The CEA provides central processing, decision making, and control policy generating for different task goals which is stored in the Goals Agent (GA). In hierarchy architecture, this component accesses all of the sensed information and makes decision for tasks.

3.4.5. Goal Agent (GA)

Correspondingly, the GA stores the motivations or goals of tasks in situations.

3.4.6. Long-Term Memory (LTM)

The LTM stores the memory especially the knowledge for long term use. Procedural, semantic, and episodic knowledge are stored in this component. In imitation learning, the learned skill or knowledge is stored as procedural and episodic knowledge using a mathematical model.

3.4.7. Internal Rehearsal System (IRS)

The IRS evaluates the results of the decisions made from the CEA through internal rehearsal.

3.4.8. Work Loops

Using this architecture, we can develop three work loops: reactive, routine, and deliberative.

Reactive Loop is inside the perception-action agent. The perceptual agent collects the sensory information from the environment. Using the first-order agent, necessary actions are taken by the actuator agent to affect the environment and the robotic body. This control loop is used for robots to process the emergency or unexpected change in the environment.

The Routine Loop is within the the Perception-Action Agent, filtering and focusing Agent, the STM, and the WMS. This loop completes routine tasks which are well defined in the WMS. The robot obtains the task-related information from the WMS and sends this information to the actuators through Filtering and Focusing Agent. Actuators are driven by the received information to complete tasks. The Routine Loop also involves the Reactive Loop to avoid the unexpected changes in the environment.

The Deliberative Loop is used for robots to learn new behaviors or skills through modeling, knowledge coupling, and so forth, and to complete new tasks or select behaviors to complete tasks using reasoning, decision making, and so forth. The CEA is the central component in this loop. It retrieves the stored knowledge from the LTM, receives the environmental information from the STM and the WMS, and uses the IRS to evaluate current situation to make decisions or establish models for the sensed information. When the decision is made, the task-related information is sent to the WMS and the system will use the Routine Loop to complete the task. The Deliberative Loop involves the Reactive Loop and the Routine Loop. Our system is largely based on the Deliberative Loop.

Figure 5 displays the relationships among the tree work loops.

3.4.9. Learning Stage

At the learning stage as shown in Figure 6, ISAC collects the information from the encoders using the PA and sends the sensory information to the CEA. The CEA obtains the original ISOMAP algorithm and the modified ISOMAP algorithm from the LTM and calculates the motion models and the knowledge models. Using the RMS, ISAC establishes the labeled behavior models upon the prelearned semantic knowledge and stores the models in the LTM.

The work loop could be displayed as in Figure 7.

3.4.10. Generation Stage

At the Generation Stage as shown in Figure 8, given a speech command, ISAC collects the speech information using the PA, and sends the speech information to the CEA through the STM. By analyzing the speech command, the CEA generates the corresponding actions with specified parameters. The required behavior model is obtained from the LTM through searching the RMS. Upon the obtained behavior model, the CEA plans the motions according to the goal and sends the motion information to the WMS. The WMS stores the task-related information and sends the control commands to the AA to execute the motions.

The work loop could be displayed as in Figure 9.

4. Experimental Results

A humanoid robot, named ISAC, is used to validate out proposed system (as shown in Figure 10). ISAC is a stationary pneumatic driven humanoid robot, which has seven Degrees-of-Freedom (DOFs) on each of its arm (including a freedom (open and close) for the end effector). In this system, we only used the right of ISAC to demonstrate the movement trajectories of writing and to write the required numbers. The pen is always grasped using the end-effector, and we only use six DOFs of the right arm of ISAC. Two cameras mounted on the robot are used for ISAC to observe the environment, and we developed an OpenCV-based program to capture and process the images obtained from the cameras. A personal computer, with a 1 GHz CPU, is used to control the arm of ISAC, a personal computer, with a 2.4 GHz CPU, is used to process the images, and a laptop, with a 2.4 GHz CPU, is used to store the semantic knowledge models and the movement trajectory models.

ISAC is shown how to write letters by manually moving its right arm as shown in Figure 12. Figure 11 displays the letters used in the demonstrations. The topologies of the movements of writing letters in the Cartesian space are also the same as shapes of the letters.

The collected data is projected onto a 2-dimensional plane using the original ISOMAP algorithm and the modified ISOMAP algorithm. Figure 13 displays the obtained model using the original ISOMAP algorithm, and Figure 14 displays the obtained model using the modified ISOMAP algorithm.

In practical application, in order to use the dimension reduction results in the recognition part, the image on the 2-dimensional plane has been dilated.

In Figure 13 the dimension reduction results display the shapes and topologies of the distributions of the sampled joint angles from the demonstrations in the latent space. From this figure, the shapes and the topologies of the data distributions are similar to the real letters on the paper and the movement of the end-effector for writing letters in the Cartesian space. It is necessary to emphasize here that without using the kinematics model which calculates the position of the end-effector in the Cartesian space using the joint angles, the dimension reduction results can still get approximate descriptions of the letters on the paper.

In Figure 14, the motion models are represented in the latent space. These models are obtained using the modified ISOMAP algorithm. From the experimental results, we can find two features of the models using this algorithm. (1) Each trajectory does not overlap with itself; (2) each trajectory, which is generated automatically using the modified ISOMAP, always starts from one side and ends at another side. The second feature is guaranteed by the definition of the neighbors in the algorithm. Intuitively, because neighbors are defined as temporal neighbors, the temporal distance of the first point and the last point is the largest in the distance matrix. Therefore, the algorithm always put the starting point and the ending point at two opposite sides of the graph.

Through recognition, the knowledge models in Figure 13 are recognized with the pre-learned numbers using the Tesseract-OCR.

In practical application, the pictures in Figure 13 should be preprocessed in order to be compatible with the Tesseract-OCR. There are several steps of preprocessing.(1)Picture i is rotated 0 degree, the result is picture i_1.(2)Picture i is rotated 90 degree, the result is picture i_2.(3)Picture i is rotated 180 degree, the result is picture i_3.(4)Picture i is rotated 270 degree, the result is picture i_4.(5)Picture i is flipped horizontally, the result is picture i_5.(6)Picture i_5 is rotated 90 degree, the result is picture i_6.(7)Picture i_5 is rotated 180 degree, the result is picture i_7.(8)Picture i_5 is rotated 270 degree, the result is picture i_8.

The obtained eight pictures are all recognized using the Tesseract-OCR. If the recognized result is included in the predefined white list: {0,1,,9}, it is accepted.

If the recognized result is (6) or (9), it needs to be further processed. Our method is to determine the starting point of writing (6) or (9). If the starting point is near the edge of the image, it is (6); otherwise, it is (9).

Upon the recognition results, the labeled behavior models are established. Figure 15 is an example of the models.

Upon receiving a command from humans, ISAC analyzes the command and converts the command into the actions with specified parameters: Write (Six) as shown in Figure 16. The required motion model is obtained from the labeled behavior models and ISAC executes the motions.

Figure 17 displays the numbers written by ISAC on the papers.

5. Discussion

In this paper, we proposed a framework for robots to learn the motion models and semantic knowledge models simultaneously and only one dataset from the demonstrations is used for both learning stages.

In the current imitation learning frameworks, the motion learning has been highlighted for robots to learn to complete some interesting tasks.

Some researchers are working on incorporating concepts and ideas from cognitive science into robotics research. A typical application using cognitive architectures to implement cognitive processes or cognitive control loops for robotic control and learning [46].

If we consider the whole robotic learning frameworks as hierarchy architectures, the emergent problem is that there seems a gap exists between the utilization of cognitive architecture and the motion learning. As we know, the reasoning and the planning in the cognitive architecture are often implemented in a symbolic way and traditional Artificial Intelligence (AI) methods are used. Therefore, how to connect the symbolic representation and the mathematical models-based motion models?

In our paper, a framework is proposed, in which the connection is based on the natural language processing, which labels the motion models with suitable behavior names and analyzes the command in the cognitive architecture with the behavior names. In this paper, we further propose to train robots to learn the semantic (or symbolic) knowledge by using the same dataset from the demonstrations automatically, which enhance our proposed framework.

As we know, when humans see, listen, and feel the behaviors from other humans, we can relate what we see, listen, and feel to our learned procedural, episodic, and semantic knowledge. The framework proposed in this paper is inspired from the daily cognitive brain work of humans.

6. Future Study and Conclusion

For the application of “writing,” using the same methods in this paper (dimension reduction using the original and modified ISOMAP algorithm and a letter recognition technology) robots can learn how to write letters and relate the motion models and knowledge models to their corresponding semantic knowledge models.

In other areas, for example, “music playing,” robots can learn how to play music (hitting drums, playing guitar, and playing piano) and relate the motions required to play music to their corresponding semantic knowledge models. The ISOMAP algorithms may not complete such learning in these areas. However, readers can simply find that the tempos of hitting drums are in correspondence to the tempos of moving hands up and down. If the tempos of the sound of hitting drums can be extracted as the templates and the tempos of moving hands can be extracted as the knowledge model, this framework can also be used for robots to learn the motions of play music and the semantic meaning of these motions simultaneously and automatically.

The crucial point of the application of this framework is to find the features of the motions which are in correspondence with the inner features of behavior which is strongly related to the semantic models.

This paper proposes a framework for robots to learn the motion models and semantic knowledge models simultaneously using one data set from the demonstrations. A modified ISOMAP algorithm is used for robots to extract semantic information from the demonstrations. The implementation is on a cognitive architecture with several extensions of current algorithms. Semantic analysis of the command is also implemented in this framework. The experiments are carried out on a humanoid, and the experimental results demonstrate the effectiveness of this framework.