1-dimensional generic feature.

Source publication

Imitative Reinforcement Learning for Soccer Playing Robots

Conference Paper

Full-text available

Jun 2006

In this paper, we apply Reinforcement Learning (RL) to a real-world task. While complex problems have been solved by RL in simulated worlds, the costs of obtaining enough training examples often prohibits the use of plain RL in real-world scenarios. We propose three approaches to reduce training expenses for real-world RL. Firstly, we replace the r...

Context 1

... into uniform intervals with the borders a 0 , . . . , a e , where e is the total number of intervals on this dimension (the resolution). In one dimension each receptive field covers two adjacent intervals with its peak exactly on their border. For each pair of intervals there is one feature plus two extra features at the border of the domain. Fig. 1 shows the generated features for a resolution of e = 4. For each input value there are two active features with a total activation of 1. Each feature contributes to the approximation according to its activation. This can be interpreted as the feature's responsibility for a certain input. The systematic overlapping within a generic ...

View in full-text

Context 2

... whole state space is covered completely by a group of features. We call this group a generic feature. The dimensions are partitioned into uniform intervals with the borders a 0 , . . . , a e , where e is the total number of intervals on this dimension (the resolution). In one dimension each receptive field covers two adjacent intervals with its peak exactly on their border. For each pair of intervals there is one feature plus two extra features at the border of the domain. Fig. 1 shows the generated features for a resolution of e = 4. For each input value there are two active features with a total activation of 1. Each feature contributes to the approximation according to its activation. This can be interpreted as the feature’s responsibility for a certain input. The systematic overlapping within a generic feature corresponds to a smooth shift of responsibility between two adjacent features. The resulting approximation is a linear interpolation of the adjacent weights. Imitation is an important mechanism for the transfer of knowledge and skills between similar agents. An imitating agent can observe the behavior of an experienced agent and thus draw conclusions about the consequences of certain actions. In this paper, the imitating agent has full access to experiences of a teacher. These experiences are provided as sequences of states and actions in the learning agent’s representation along with the corresponding rewards. Our imitation approach relies on the evaluation of these sequences according to own criteria. One possibility to do so, is the application of the Q -learning algorithm with the stored sequence of actions. The imitation effect can be increased by processing the sequences in temporally inversed order. This is possible because the whole sequence of transitions and rewards is known in advance. One advantage of storing and evaluating such sequences is the fact that the same algorithm can be applied to processing own experiences. This can be seen as a form of episodic memory, which allows to ”revive” own experiences when their utility is easier to access. If, e.g., the agent chooses a very good action by chance, but does not see immediately that it was a good action or why it was good, then it can be helpful to revive that experience later in the learning process when increased knowledge allows for a better assessment of this observation. The evaluation of observations in temporally inversed order is related to another technique for faster information propagation in temporal-difference methods. Eligibility traces [10, 12] use the last local error to update the value not only for the current state but for the recently visited states as well. So there is no individual local error used to update the previous states of the stored sequence. In contrast to that, we calculate a new local error for each transition in the stored sequence. Thus, we treat each step of the sequence as if it was the actual observation. Convergence proofs for table-based Q -learning (e.g. [13]) require to visit all states and to choose all actions infinitely often, but they do not make any assumptions about the ordering of these observations. Thus, the guarantees on convergence remain unchanged under the appropriate conditions. The proposed concepts of function approximation, imitation, and memory are evaluated for a humanoid toy robot called RoboSapien, which has been augmented with a camera and a Pocket PC as to allow for autonomous behavior as proposed in [14]. The task of the agent is to dribble the ball from different positions into the empty goal. Note that dribbling is achieved by walking against the ball. There is no explicit movement to kick the ball. The exact task setup is taken from the scoring test defined in [14]. In this test, the robot stands on the most distant point of the center circle, facing the empty goal. The ball is placed at ten different positions on the other half of the center circle in the robot’s field of view. One advantage of the adoption of this existing scoring test is the possibility to compare the performance to an existing hard coded behavior. According to the given task, suitable state variables have to be chosen that contain all relevant information for learning a good policy. Obviously, the positions of the ball and the goal are necessary to perform the task. While the ball position can be expressed by two variables (e.g. angle and distance), this is not sufficient for the goal position. Since the goal has a significant width, an additional variable is required (e.g. left post angle and right post angle instead of one angle). These five variables define the unam- biguous position of all relevant objects and their orientation to each other. There are many possibilities to represent this information. The previous paragraph implicitly assumes an egocentric representation of the information, i.e. the relative positions of the ball and the goal from the agent’s point of view. Another possibility is the transformation to an allocentric representation, i.e. the absolute positions of ball and robot on the field (including the robot’s orientation). Fur- ther possibilities are e.g. egocentric Cartesian coordinates ( x and y ) instead of polar coordinates (angle and distance). In this work, we use a combination of these representations and additional variables like the square roots of the distances to the ball and to the goal. In total we use 27 variables to represent the state space. This redundant representation allows the simultaneous consideration of different aspects of the situation. Each single representation is well-suited for the detection of some properties and similarities of situations whereas others can hardly be distinguished. The combination of several representations allows to benefit from the advantages of each representation. The additional state variables do not cause significant additional costs of computation since they just provide another perspective on the same data. As our function approximation method relies on many partitionings of the state space anyway, there is no difference (concerning the number of features) in using two partitionings of the same representation and using one parti- tioning of two different representations. The action space of our robot consists of four possible actions: walk forward , walk backward , turn right , and turn left , that are executed for 3.2s each and separated by a short break of 0.4s. This pause is required to ensure that the actions’ effects do not depend on previous actions but only on the current state. As explained above, we use Q -learning with linear gradient descent to learn the policy. The function approximator consists of multiple generic features (see Section 4) that consider only a small subset of the available state variables each. First, a minimal representation is chosen from the redundant state representation, i.e. five independent variables that unambiguously describe the positions of all relevant objects and their orientation to each other. Then a generic feature is created for all possible combinations of two and of three dimensions out of these five. This is done for several minimal representations resulting in a total number of 140 generic features, that only consider the state space. These features are important to estimate the value of the current situation. However, they do not distinguish between the different possible next actions. Hence, another 140 features are added that additionally consider the action space. To reduce the complexity of the search space, we exploit symmetry. This technique is widely used in search or optimization problems [15, 16]. Here, we use mirroring along the horizontal axis of the field. The value of turning left is equivalent to the value of turning right in the mirrored situation. Thus, it is sufficient to learn the value of turning right. Similarly, for walking forward or backward it is sufficient to learn the value for only one half of the state space. We evaluate the performance of the proposed concepts as follows. The task is described as an episodic RL task. Each action introduces an immediate reward of -1 per time step. The slight punishment of each action is sometimes called the costs of the actions. Each episode has three possible outcomes with the ...

View in full-text

Agents used as experimental stimuli. The android was designed to have...

Heat maps of mean gaze count across all trials of all participants,...

Proportions of total looking times at each AOI of each agent across the...

Infant discrimination of humanoid robots

Article

Full-text available

Sep 2015

Recently, extremely humanlike robots called “androids” have been developed, some of which are already being used in the field of entertainment. In the context of psychological studies, androids are expected to be used in the future as fully controllable human stimuli to investigate human nature. In this study, we used an android to examine infant d...

Cylinder Detection and Grasping Application Using Humanoid Robot

Article

Full-text available

Jan 2006

Researches on humanoid robots have become more and more active in various fields. One of the purposes is to construct a coexistent and cooperative society between humanoid robots and human. The body of humanoid robots is similar to human beings enabling them to walk with a pair of legs and to work with their hands. Therefore, humanoid robots have a...

A Survey of Behavior Learning Applications in Robotics -- State of the Art and Perspectives

Preprint

Jun 2019

Recent success of machine learning in many domains has been overwhelming, which often leads to false expectations regarding the capabilities of behavior learning in robotics. In this survey, we analyze the current state of machine learning for robotic behaviors. We will give a broad overview of behaviors that have been learned and used on real robots. Our focus is on kinematically or sensorially complex robots. That includes humanoid robots or parts of humanoid robots, for example, legged robots or robotic arms. We will classify presented behaviors according to various categories and we will draw conclusions about what can be learned and what should be learned. Furthermore, we will give an outlook on problems that are challenging today but might be solved by machine learning in the future and argue that classical robotics and other approaches from artificial intelligence should be integrated more with machine learning to form complete, autonomous systems.

Towards Setplays Learning in a Multiagent Robotic Soccer Team

Conference Paper

Nov 2018

Learning and Generalisation of Primitive Skills for Robust Dual-arm Manipulation

Thesis

Full-text available

Aug 2018

Èric Pairet

Robots are becoming a vital ingredient in society. Some of their daily tasks require dual-arm manipulation skills in the rapidly changing, dynamic and unpredictable real-world environments where they have to operate. Given the expertise of humans in conducting these activities, it is natural to study humans motions to use the resulting knowledge in robotic control. With this in mind, this work leverages human knowledge to formulate a more general, real-time, and less task-specific framework for dual-arm manipulation. Particularly, the proposed architecture first learns the dynamics underlying the execution of different primitive skills. These are harvested in a one-at-a-time fashion from human demonstrations, making dual-arm systems accessible to non-roboticists-experts. Then, the framework exploits such knowledge simultaneously and sequentially to confront complex and novel scenarios. Current works in the literature deal with the challenges arising from particular dual-arm applications in controlled environments. Thus, the novelty of this work lies in (i) learning a set of primitive skills in a one-at-a-time fashion, and (ii) endowing dual-arm systems with the ability to reuse their knowledge according to the requirements of any commanded task, as well as the surrounding environment. The potential of the proposed framework is demonstrated with several experiments involving synthetic environments, the simulated and real iCub humanoid robot. Apart from evaluating the performance and generalisation capabilities of the different primitive skills, the framework as a whole is tested with a dual-arm pick-and-place task of a parcel in the presence of unexpected obstacles. Results suggest the suitability of the method towards robust and generalisable dual-arm manipulation.

End-to-End Deep Imitation Learning: Robot Soccer Case Study

Preprint

Full-text available

Jun 2018

In imitation learning, behavior learning is generally done using the features extracted from the demonstration data. Recent deep learning algorithms enable the development of machine learning methods that can get high dimensional data as an input. In this work, we use imitation learning to teach the robot to dribble the ball to the goal. We use B-Human robot software to collect demonstration data and a deep convolutional network to represent the policies. We use top and bottom camera images of the robot as input and speed commands as outputs. The CNN policy learns the mapping between the series of images and speed commands. In 3D realistic robotics simulator experiments, we show that the robot is able to learn to search the ball and dribble the ball, but it struggles to align to the goal. The best-proposed policy model learns to score 4 goals out of 20 test episodes.

A Study of Layered Learning Strategies Applied to Individual Behaviors in Robot Soccer

Conference Paper

Jul 2015

Hierarchical task decomposition strategies allow robots and agents in general to address complex decision-making tasks. Layered learning is a hierarchical machine learning paradigm where a complex behavior is learned from a series of incrementally trained sub-tasks. This paper describes how layered learning can be applied to design individual behaviors in the context of soccer robotics. Three different layered learning strategies are implemented and analyzed using a ball-dribbling behavior as a case study. Performance indices for evaluating dribbling speed and ball-control are defined and measured. Experimental results validate the usefulness of the implemented layered learning strategies showing a trade-off between performance and learning speed.

Ball Dribbling for Humanoid Biped Robots: A Reinforcement Learning and Fuzzy Control Approach

Article

May 2015

In the context of the humanoid robotics soccer, ball dribbling is a complex and challenging behavior that requires a proper interaction of the robot with the ball and the floor. We propose a methodology for modeling this behavior by splitting it in two sub problems: alignment and ball pushing. Alignment is achieved using a fuzzy controller in conjunction with an automatic foot selector. Ball-pushing is achieved using a reinforcement-learning based controller, which learns how to keep the robot near the ball, while controlling its speed when approaching and pushing the ball. Four different models for the reinforcement learning of the ball-pushing behavior are proposed and compared. The entire dribbling engine is tested using a 3D simulator and real NAO robots. Performance indices for evaluating the dribbling speed and ball-control are defined and measured. The obtained results validate the usefulness of the proposed methodology, showing asymptotic convergence in around fifty training episodes, and similar performance between simulated and real robots.

Embodied Imitation-Enhanced Reinforcement Learning in Multi-Agent Systems

Article

Full-text available

Feb 2014
ADAPT BEHAV

Imitation is an example of social learning in which an individual observes and copies another’s actions. This paper presents a new method for using imitation as a way of enhancing the learning speed of individual agents that employ a well-known reinforcement learning algorithm, namely Q-learning. Compared with other research that uses imitation with reinforcement learning, our method uses imitation of purely observed behaviours to enhance learning, with no internal state access or sharing of experiences between agents. The paper evaluates our imitation-enhanced reinforcement learning approach in both simulation and with real robots in continuous space. Both simulation and real robot experimental results show that the learning speed of the group is improved.

Reinforcement Learning in Robotics: A Survey

Article

Full-text available

Sep 2013
INT J ROBOT RES

Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value-function-based and policy-search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and we note throughout open questions and the tremendous potential for future research.

Towards imitation-enhanced Reinforcement Learning in multi-agent systems

Conference Paper

Full-text available

Apr 2011

Imitation, in which an individual observes and copies another's actions, is a powerful means of learning. This paper presents a way of using imitation to enhance the learning capability of individual agents. The agents employ Q-learning and we show that agents with imitation enhanced Q-learning learn faster than those with Q-learning alone.

Imitative Learning Based Emotional Controller for Unknown Systems with Unstable Equilibrium

Article

Full-text available

Jun 2010

Intelligent control for unidentified systems with unstable equilibriums is not always a proper control strategy, which results in inferior performance in many cases. Because of the existing trial and error manner of the procedure in former duration of learning, this exploration for finding the appropriate control signals can lead to instability. However, the recent proposed emotional controllers are capable of learning swiftly; the use of these controllers is not an efficient solution for the mentioned instability problems. Therefore, a solution is needed to evade the instability in the preliminary phase of learning. The purpose of this paper is to propose a novel approach for controlling unstable systems or systems with unstable equilibrium by model free controllers. An existing controller (model-based controller) with limited performance is used as a mentor for the emotional learning controller in the first step. This learning phase prepares the controller to control the plant as well as mentor, while it prevents any instability. When the emotional controller can imitate the behavior of model based one properly, the employed controller is gently switched from model based one to an emotional controller using a Fuzzy Inference System (FIS). Also, the emotional stress is softly switched from the mentor-imitator output difference to the combination of the objectives. In this paper, the emotional stresses are generated once by using a nonlinear combination of objectives and once by employing different stresses to a FIS modulating the stresses, and makes a subset of these objectives salient regarding the contemporary situation. The proposed model free controller is employed to control an inverted pendulum system and an oscillator with unstable equilibrium. It is noticeable that the proposed controller is a model free one, and does not use any knowledge about the plant. The experimental results on two benchmarks show the superiority of proposed imitative and emotional controller with fuzzy stress generation mechanism in comparison with model based originally supplied controllers and emotional controller with nonlinear stress generation unit – in control of pendulum system – in all operating conditions. There are two test beds for evaluating the proposed model free controller performance which are discussed in this paper: a laboratorial inverted pendulum system, which is a well-known system with unstable equilibrium, and Chua’s circuit, which is an oscillator with two stable and one unstable equilibrium point. The results show that the proposed controller with the mentioned strategy can control the systems with satisfactory performance. In this paper, a novel approach for controlling unstable systems or systems with unstable equilibrium by model free controllers is proposed. This approach is based on imitative learning in preliminary phase of learning and soft switching to an interactive emotional learning. Moreover, FISs are used to model the linguistic knowledge of the ascendancy and situated importance of the objectives. These FISs are used to attentionally modulate the stress signals for the emotional controller. The results of proposed strategy on two benchmarks reveal the efficacy of this strategy of model free control.

1-dimensional generic feature.

Contexts in source publication

Similar publications

Citations