Abstract

Many studies have been conducted on the application of reinforcement learning (RL) to robots. A robot which is made for general purpose has redundant sensors or actuators because it is difficult to assume an environment that the robot will face and a task that the robot must execute. In this case, -space on RL contains redundancy so that the robot must take much time to learn a given task. In this study, we focus on the importance of sensors with regard to a robot’s performance of a particular task. The sensors that are applicable to a task differ according to the task. By using the importance of the sensors, we try to adjust the state number of the sensors and to reduce the size of -space. In this paper, we define the measure of importance of a sensor for a task with the correlation between the value of each sensor and reward. A robot calculates the importance of the sensors and makes the size of -space smaller. We propose the method which reduces learning space and construct the learning system by putting it in RL. In this paper, we confirm the effectiveness of our proposed system with an experimental robot.

1. Introduction

In recent years, reinforcement learning (RL) [1] has been actively studied, and many studies on its application to robots have been conducted [24]. A matter of concern in RL is the learning time. In RL, information from sensors is projected onto a state space. A robot learns the correspondence between each state action in the state space and determines the best correspondence. When the state space expands according to the number of sensors, the number of correspondences learned by the robot is also increased. In addition, the robot needs considerable much experience in each state to perform a task. Therefore, learning the best correspondence becomes time-consuming.

To overcome this problem, many studies have investigated accelerated RL [515] for which there are two approaches: a multirobot system and autonomous construction of the state space. In the former approach, multiple robots exchange experience information [59], so that each robot augments its own knowledge. Therefore, in this system robots can find the best correspondence between each state and action faster than an individual robot in a single-robot system. In addition Nishi et al. [10] proposed a learning method in which a robot learns behavior through observations of the behavior of other robots, constructing its own relationships between state and behavior. However, in this approach, a robot needs other robots with whom to exchange experience information, and hence, if there are no additional robots in the system, this approach becomes irrelevant. We focus on the state construction of a single robot.

In contrast to the above approaches, in the approach that applies autonomous state space construction [1116], a single robot is sufficient. The robot constructs a suitable state space based on its experience. Moreover, it can reduce the state space and learn correspondences faster. However, in the studies on this approach, all the sensors installed in the robot were considered to be equally important, and their number of states was the same. The installed sensors, which influence how well a robot executes a task, can be divided into important and unimportant sensors according to the task to be performed. However, in this approach, the robot has to learn using unnecessary inputs because all the sensors are considered equally important. For example, Takahashi et al. [16] proposed a state that is constructed autonomously from state space by incremental state segmentation. In this method, the division rule of state is applied to the consecutiveness of the sensor data. Ishiguro et al. [12] proposed a state construction method using empirically obtained perceivers (EOPs). These methods are not focused on the importance of each sensor for performing a task. In fact, although the sensors installed on a robot have varying levels of importance in terms of performing a task, few studies focused on this aspect. We focus on the importance of a sensor for a particular task and propose a novel efficient learning method.

In this paper, we propose a system in which the robot constructs a temporary -space for decision making based on which sensors are considered important for the execution of a particular task, which facilitates high-speed learning. Since very important sensors affect the performance of a task significantly, they should sense the environment circumstantially. Thus, the number of their states is increased. On the other hand, since less important sensors do not affect the performance of a task, they may sense their environment less exactly. Thus, the number of their states, which is determined based on the importance of the sensors, is decreased. In this study, the importance of the sensors is defined as the correlation between the sensor value and the reward.

A temporary -space is constructed from the -space of the robot based on the importance of the sensors. The number of states of the -space is the maximum number that the sensors can describe. A -space is reduced by merging -values according to the number of the states of the sensors. Using the reduced -space, the robot can efficiently select an action, which is based on more information because the -values of low importance are merged. Therefore, the robot can learn correspondences using fewer experiences.

This system is effective for a variety of tasks. When this method is implemented, the amount of information that the robot requires in order to learn correspondences is reduced. As a result, when our proposed system is applied, a robot can learn correspondences faster than when an ordinary RL is applied.

2. Concept of Importance of Sensors

To select the sensors that are important to a certain task, a robot needs to measure the importance of each sensor. We focus on the correlation of the sensor value and the reward, which is specific to each task, as the measure of the importance of each sensor for a task. For example, in a garbage collection task, a robot is expected to approach a garbage heap and lift it. For this task, the reward is expressed by the distance between the robot and garbage heap and increases as the robot moves closer to the garbage heap. This implies that there is a correlation between the reward and the distance between the robot and garbage heap.

We show in Figure 1 an outline of the determination of the importance of the sensors where two types of sensor are installed on the robot. Via its sensors, the robot recognizes its environment, which is expressed as a group of all the sensor values. The robot collects the sensor value of each sensor and the reward of the task to be performed and then determines the correlation between them. In Figure 1, the robot conducts this determination for sensors 1 and 2. The robot estimates the importance of the sensors according to the two types of correlation between the sensor value and reward: negative and positive.

Very important sensors affect the performance of a task. Therefore they should sense their environment circumstantially, and thus the number of their states is increased. On the other hand, less important sensors do not affect the performance of a task, and therefore they may sense the environment less exactly, and thus, the number of their states is decreased. The number of states is therefore determined based on the importance of the sensors.

3. Decision Making in Reinforcement Learning Using a -Space Based on the Importance of Sensors

3.1. Outline of the Proposed System

We show our proposed system in Figure 2. In the figure, the proposed system is divided into two stages. The first stage constitutes the proposed method whereby the robot determines the importance of its sensors. The next stage is RL.

In the first stage, the robot calculates the importance of the sensors for a task based on the correlations between each sensor and reward. The robot first estimates each sensor value and the reward and then calculates the coefficient of the correlation between them. Finally, it determines the important sensors based on this coefficient of correlation.

In RL, a robot learns the actions that are suitable for each state. This stage consists of an action evaluation element, wherein a temporary -space is constructed based on the determination of important sensors, and an action selection element. In the action evaluation element, each pair of state and action is evaluated and updated. A state contains the value of all the sensors. The robot constructs a temporary -space, by adding to it only those sensors that have been determined to be important. In the action selection element, the robot selects the action for a state recognized by the sensor based on the temporary -space.

We show the workflow of our proposed system in Figure 3. This workflow is executed by the robot for each action. We define this flow as one trial.

3.2. Determination of the Importance of Sensors

In the proposed method, the robot determines the importance of its sensors based on the correlation between each sensor value and reward, which it experiences. The robot stores each sensor value and the averaged reward as a list called a knowledge list, an example of which is shown in Figure 4. When a robot identifies a state that is not in the list, it adds it to the list. Then, the robot calculates the averaged reward and adds it to the list. On the other hand, when a recognized state is already in the list, it calculates the averaged reward and updates the list.

A state is defined as (1), where is the state ID in Figure 4, is the value of sensor , and is a group of values that describe a sensor:

In this study, the rewards for the state experienced by the robot are weight averaged. Weighted averaging gives a greater weight to more recently obtained rewards. The averaged reward in state is denoted by and it is updated as (2), where is the reward obtained by the robot at time :

The robot calculates the importance of each sensor based on the knowledge list. We use an equation for multiple coefficient correlation to calculate the sensor's level of importance. In multiple coefficients, each partial regression coefficient represents the importance of each sensor. A sensor with a higher regression coefficient has a higher importance level.

The multiple regression equation is defined by (3), where are the regression coefficients for each sensor, is the averaged reward in state , and is the constant term:

A robot needs to calculate the regression coefficients of each sensor to calculate its importance. Each regression coefficient is the solution resulting from the multiple simultaneous equation (4). Here, , calculated by (5), is the covariance of sensor and sensor , and , calculated by (6), is the variance of sensor . The average of sensor values is calculated by (7), and the and in (5) are calculated by (7). , calculated by (8), is the covariance of sensor and reward , and , calculated by (9), is the average value of the averaged reward:

3.3. Determination of the Number of States of Each Sensor

The number of states is determined based on the number of states of each sensor, which is determined based on the regression coefficient of each sensor. When the absolute value of the regression coefficient of a sensor is higher, the number of states of the sensor is increased.

We use a variation property, shown in Figure 5, as the number of states based on the regression coefficient. When the regression coefficient of a sensor is less than , the minimum number of states is 1; when it is greater than , the maximum number of states is ; and when it is from to , the number of state increases gradually. The parameters and are determined by a human. The formulation of the property is determined by (10). is the number of states of sensor : Here, is determined based on the performances of the sensor. We focus on the resolution and maximum range as the performance of the sensor. When the performance of the resolution and maximum range of a sensor are high, the robot can describe more states. is the state number, which can be calculated based on the resolution and maximum range of the sensor, as in (11), where is the maximum range of sensor , is the minimum range of sensor , is resolution of sensor , and is calculated for each sensor. We define a state of a sensor on as a “state unit”:

3.4. Construction of a Temporary -Space for Action Selection

The robot constructs a temporary -space based on . The temporary -space consists of a unit, which is several state units merged together. We show an example of a merged state space in Figure 6. In this example, is 9 and the number of state units is 9. When is 3, the temporary -space is constructed of three states obtained by merging three state units. This example is focused on one sensor. All the installed sensors are merged according to .

Here, when the number of states is , the number of state units in a merged state is calculated by the following equation:

The -value of each state unit in merged targets is averaged, as shown in Figure 7, where an example of the temporary -space when and are 6, is 3, and is 3 is depicted. For sensor , each of the states is three merged state units. For sensor , each of the states is two merged state units. Merged state units are averaged -values.

The robot selects an action based on the temporary -space for the current state. It recognizes the current state as state unit . When the current state is , the merged target group of the state unit is , where in the group are the state units of a sensor. Similarly, the target group of sensor is , and the target group of sensor is . Here, the -value of the merged target in -space is defined by (13), where is the total reward of merged state units, defined as (14), where is the total reward at state and action and is the total number of the experiences of the state action pair , defined as (15). Merging is performed for all the actions in state :

3.5. Action Selection

The robot selects an action based on the temporary -space. We apply the -greedy method for action selection. This method selects the action that has the highest -value in the current state unit . However, the method selects an action randomly with probability .

3.6. Action Evaluation

We apply the weighted averaging method as the action evaluation method. This method evaluates actions by assigning a weight to each reward recently obtained by the robot. When the current state unit of the robot is and the selected action is , -value () is updated by (16). is a step size parameter ():

4. Experiment to Confirm the Effectiveness of the Proposed System

4.1. Outline of the Experiment

In this section, we describe our evaluation of the effectiveness of the proposed system via an experimental robot. The experimental environment is shown in Figure 8. This environment is surrounded by walls of length 1100 mm. We prepared an experimental robot, as shown in Figure 9. The robot has two distance sensors, which measure the distance between the current position of the robot and the walls. It can recognize the current state of a sensor as the state value , as shown in Figure 10. Its sensors are divided into 11 states every 70 mm, and each state is given a state value. In this experiment, each sensor had 11 states and the total number of states was 121. The robot has omniwheels and can move in the forward, back, left, right, and each diagonal direction but cannot turn. It can move 70 mm in one action when it is not moving in a diagonal direction. When the robot moves in a diagonal direction, it moves in the cross direction and then in the lengthwise direction.

In the experiment, the task of the robot was to move close to wall A. The robot could obtain rewards according to its distance only from wall A. It was placed at the lower right corner of the environment and when it reached wall A, an episode was considered. The experiment was concluded at episodes.

This task seems simple at first glance. However, it is difficult for an RL robot. In this task, when a robot strays into an unexperienced area, its action selection becomes random, because the -values of the states in the area are at their initial value. Therefore, the robot strays more by repeating a random action selection and takes time to estimate the -values of the states and to leave the area. In addition, in this experiment, the task was being performed by an experimental robot. Therefore, it is possible that the sensor noise affected its learning of, for example, gap state recognition.

In real world use scenes, a robot is rarely required to move in a maze-like environment with many obstacles. Most use scenes are open spaces, such as a warehouse or park. Therefore, the environment we adopted is appropriate for this experiment.

We confirmed the effectiveness of the proposed system by comparing it with a conventional RL. Therefore, we prepared two types of robot for comparison, to one of which the proposed system was applied, and to the other RL. The number of states of each sensor of the robot to which RL was applied was . We compared these agents in terms of the total reward obtained after episodes were completed.

4.2. Experimental Setup

In this section, we explain the reward for each task and discuss the parameter settings. In task A, the robot can obtain a higher reward by increasing the distance between its current position and wall A, as defined in (17), where is the state value shown in Figure 10. is determined based on the actual measurement value of the distance between the current position and wall A:

A list of the parameter settings of this experiment is given in Table 1. In this experiment, the settings of the maximum range, minimum range, and resolution of sensors remained the same. was 11 according to (11). When the robot started a new episode, the -space and state knowledge list from previous episodes were adopted.

4.3. Experimental Results

The experimental results are shown in Figures 1115. Figure 11 shows the importance of the sensors for the final action in each episode. The importance of the sensors is represented by the regression coefficients. In first episode, the regression coefficients of each sensor are converged in the early phase of learning. The regression of the coefficient of sensor A is greater than the threshold . On the other hand, the regression of the coefficient of sensor B is smaller than the threshold . In this task, only wall A is related to reward and its importance is high. It is valid that the regression of the coefficient of the wall A sensor is high and that of wall B is low.

Figure 12 shows the number of the states of the sensors in the final action in each episode. The number of states of sensor A is the maximum, 11. The number of states of sensor B is the minimum, 1. Using these results, the robot can construct a correct temporary -space.

Figures 13 and 14 show the importance of the sensors and the number of states in each action in the first episode, respectively. Until the 30th action, the importance of the sensors is unstable. The reason is that the robot has insufficient knowledge to calculate the regression of the coefficient correctly. After the 31st action, the robot has sufficient knowledge and can therefore calculate the regression of the coefficient correctly. Until the 30th action, because of the instability in the number of the states of the sensors, the robot's ability to calculate the importance of the sensors is negatively affected by its insufficient knowledge.

Figure 15 shows the total number of actions in each episode. The robot using the proposed method achieves a high convergence of the number of actions as compared to that using RL. This is because the robot using RL strays into an area for which it has no experience and takes time to leave it. On the other hand, this does not occur when the proposed method is used. This is because, when the robot strays into an area for which it has no experience, it can use other -values learned in other states based on the constructed temporary -space, which consists of only important sensors. Therefore, the robot does not need to take time for learning in the unexperienced area. Thus, the robot focuses on only the important sensors and selects the suitable action by using this -space. These results show that the proposed method is effective for learning.

5. Conclusion

In this paper, we proposed a method in which a robot selects an action by using a temporary -space based on the importance of its sensors. This method assumes that there is a correlation between the sensor value and reward. The robot calculates the regression coefficient using a multiple regression equation of the sensor value and reward. The robot determines the importance of its sensors according to the regression coefficient. The higher the level of importance, the larger the number of states. To select an action, the robot constructs a temporary -space based on the importance of the sensors. It then selects actions based on the temporary -space. Thus, the robot is able to learn faster.

We examined the effectiveness of the proposed system using an experimental robot. We investigated a task for the execution of which only one sensor of a sensor pair was important. We compared the proposed system with conventional RL. The robot used sensors whose number of states is common in the case of conventional RL.

The results showed that using the proposed system the robot could calculate the importance of the sensors correctly. In addition, convergence speed was faster than that in conventional RL. Thus, we confirmed the effectiveness of the constructed system and the proposed method.

In future studies, first, we will examine the effectiveness of the proposed system by comparing it with other autonomous state construction systems. In this study, we examined only normal reinforcement learning. It is necessary to examine the proposed system by comparing it with those proposed in related studies. Then, we will modify the proposed method. Currently, the proposed method cannot be applied in a delayed reward task, because the regression coefficient is calculated using the immediate reward in each state. When a task is a delayed reward task, the robot cannot calculate the importance of its sensors. Therefore, we will modify the proposed method so that it can be applied in a delayed reward task by using information that does not include the reward.