Fig 1 - uploaded by Sven Behnke
Content may be subject to copyright.
1-dimensional generic feature. 

1-dimensional generic feature. 

Source publication
Conference Paper
Full-text available
In this paper, we apply Reinforcement Learning (RL) to a real-world task. While complex problems have been solved by RL in simulated worlds, the costs of obtaining enough training examples often prohibits the use of plain RL in real-world scenarios. We propose three approaches to reduce training expenses for real-world RL. Firstly, we replace the r...

Contexts in source publication

Context 1
... into uniform intervals with the borders a 0 , . . . , a e , where e is the total number of intervals on this dimension (the resolution). In one dimension each receptive field covers two adjacent intervals with its peak exactly on their border. For each pair of intervals there is one feature plus two extra features at the border of the domain. Fig. 1 shows the generated features for a resolution of e = 4. For each input value there are two active features with a total activation of 1. Each feature contributes to the approximation according to its activation. This can be interpreted as the feature's responsibility for a certain input. The systematic overlapping within a generic ...
Context 2
... whole state space is covered completely by a group of features. We call this group a generic feature. The dimensions are partitioned into uniform intervals with the borders a 0 , . . . , a e , where e is the total number of intervals on this dimension (the resolution). In one dimension each receptive field covers two adjacent intervals with its peak exactly on their border. For each pair of intervals there is one feature plus two extra features at the border of the domain. Fig. 1 shows the generated features for a resolution of e = 4. For each input value there are two active features with a total activation of 1. Each feature contributes to the approximation according to its activation. This can be interpreted as the feature’s responsibility for a certain input. The systematic overlapping within a generic feature corresponds to a smooth shift of responsibility between two adjacent features. The resulting approximation is a linear interpolation of the adjacent weights. Imitation is an important mechanism for the transfer of knowledge and skills between similar agents. An imitating agent can observe the behavior of an experienced agent and thus draw conclusions about the consequences of certain actions. In this paper, the imitating agent has full access to experiences of a teacher. These experiences are provided as sequences of states and actions in the learning agent’s representation along with the corresponding rewards. Our imitation approach relies on the evaluation of these sequences according to own criteria. One possibility to do so, is the application of the Q -learning algorithm with the stored sequence of actions. The imitation effect can be increased by processing the sequences in temporally inversed order. This is possible because the whole sequence of transitions and rewards is known in advance. One advantage of storing and evaluating such sequences is the fact that the same algorithm can be applied to processing own experiences. This can be seen as a form of episodic memory, which allows to ”revive” own experiences when their utility is easier to access. If, e.g., the agent chooses a very good action by chance, but does not see immediately that it was a good action or why it was good, then it can be helpful to revive that experience later in the learning process when increased knowledge allows for a better assessment of this observation. The evaluation of observations in temporally inversed order is related to another technique for faster information propagation in temporal-difference methods. Eligibility traces [10, 12] use the last local error to update the value not only for the current state but for the recently visited states as well. So there is no individual local error used to update the previous states of the stored sequence. In contrast to that, we calculate a new local error for each transition in the stored sequence. Thus, we treat each step of the sequence as if it was the actual observation. Convergence proofs for table-based Q -learning (e.g. [13]) require to visit all states and to choose all actions infinitely often, but they do not make any assumptions about the ordering of these observations. Thus, the guarantees on convergence remain unchanged under the appropriate conditions. The proposed concepts of function approximation, imitation, and memory are evaluated for a humanoid toy robot called RoboSapien, which has been augmented with a camera and a Pocket PC as to allow for autonomous behavior as proposed in [14]. The task of the agent is to dribble the ball from different positions into the empty goal. Note that dribbling is achieved by walking against the ball. There is no explicit movement to kick the ball. The exact task setup is taken from the scoring test defined in [14]. In this test, the robot stands on the most distant point of the center circle, facing the empty goal. The ball is placed at ten different positions on the other half of the center circle in the robot’s field of view. One advantage of the adoption of this existing scoring test is the possibility to compare the performance to an existing hard coded behavior. According to the given task, suitable state variables have to be chosen that contain all relevant information for learning a good policy. Obviously, the positions of the ball and the goal are necessary to perform the task. While the ball position can be expressed by two variables (e.g. angle and distance), this is not sufficient for the goal position. Since the goal has a significant width, an additional variable is required (e.g. left post angle and right post angle instead of one angle). These five variables define the unam- biguous position of all relevant objects and their orientation to each other. There are many possibilities to represent this information. The previous paragraph implicitly assumes an egocentric representation of the information, i.e. the relative positions of the ball and the goal from the agent’s point of view. Another possibility is the transformation to an allocentric representation, i.e. the absolute positions of ball and robot on the field (including the robot’s orientation). Fur- ther possibilities are e.g. egocentric Cartesian coordinates ( x and y ) instead of polar coordinates (angle and distance). In this work, we use a combination of these representations and additional variables like the square roots of the distances to the ball and to the goal. In total we use 27 variables to represent the state space. This redundant representation allows the simultaneous consideration of different aspects of the situation. Each single representation is well-suited for the detection of some properties and similarities of situations whereas others can hardly be distinguished. The combination of several representations allows to benefit from the advantages of each representation. The additional state variables do not cause significant additional costs of computation since they just provide another perspective on the same data. As our function approximation method relies on many partitionings of the state space anyway, there is no difference (concerning the number of features) in using two partitionings of the same representation and using one parti- tioning of two different representations. The action space of our robot consists of four possible actions: walk forward , walk backward , turn right , and turn left , that are executed for 3.2s each and separated by a short break of 0.4s. This pause is required to ensure that the actions’ effects do not depend on previous actions but only on the current state. As explained above, we use Q -learning with linear gradient descent to learn the policy. The function approximator consists of multiple generic features (see Section 4) that consider only a small subset of the available state variables each. First, a minimal representation is chosen from the redundant state representation, i.e. five independent variables that unambiguously describe the positions of all relevant objects and their orientation to each other. Then a generic feature is created for all possible combinations of two and of three dimensions out of these five. This is done for several minimal representations resulting in a total number of 140 generic features, that only consider the state space. These features are important to estimate the value of the current situation. However, they do not distinguish between the different possible next actions. Hence, another 140 features are added that additionally consider the action space. To reduce the complexity of the search space, we exploit symmetry. This technique is widely used in search or optimization problems [15, 16]. Here, we use mirroring along the horizontal axis of the field. The value of turning left is equivalent to the value of turning right in the mirrored situation. Thus, it is sufficient to learn the value of turning right. Similarly, for walking forward or backward it is sufficient to learn the value for only one half of the state space. We evaluate the performance of the proposed concepts as follows. The task is described as an episodic RL task. Each action introduces an immediate reward of -1 per time step. The slight punishment of each action is sometimes called the costs of the actions. Each episode has three possible outcomes with the ...

Similar publications

Article
Full-text available
Recently, extremely humanlike robots called “androids” have been developed, some of which are already being used in the field of entertainment. In the context of psychological studies, androids are expected to be used in the future as fully controllable human stimuli to investigate human nature. In this study, we used an android to examine infant d...
Article
Full-text available
Researches on humanoid robots have become more and more active in various fields. One of the purposes is to construct a coexistent and cooperative society between humanoid robots and human. The body of humanoid robots is similar to human beings enabling them to walk with a pair of legs and to work with their hands. Therefore, humanoid robots have a...

Citations

... It can be used, for example, in basketball, handball, or soccer. Latzke et al. (2007) learned dribbling for soccer with a humanoid toy robot by "walking against the ball". The walking behavior is very simple because it only uses three motors. ...
Preprint
Recent success of machine learning in many domains has been overwhelming, which often leads to false expectations regarding the capabilities of behavior learning in robotics. In this survey, we analyze the current state of machine learning for robotic behaviors. We will give a broad overview of behaviors that have been learned and used on real robots. Our focus is on kinematically or sensorially complex robots. That includes humanoid robots or parts of humanoid robots, for example, legged robots or robotic arms. We will classify presented behaviors according to various categories and we will draw conclusions about what can be learned and what should be learned. Furthermore, we will give an outlook on problems that are challenging today but might be solved by machine learning in the future and argue that classical robotics and other approaches from artificial intelligence should be integrated more with machine learning to form complete, autonomous systems.
... Reinforcement learning has been used to learn some individual behaviors or part of an individual behavior such as pass selection [2], action selection within a setplay [16], basic individual skills [22], [23], high-level cooperation in abstract robotics MAS [24]. ...
... This avoids inconsistencies associated with human users and lets the initial policy be just close enough to the desired task but still imperfect to let the self-learning algorithm improve the model. Similarly, some authors acquired demonstrations from already working systems [28,35]. ...
... However, learning a policy might not always be possible, for what there might be hybrid approaches such as getting information from already functional robots. In this regard, Latzke et al. transferred the knowledge from a similar robot to speed up the learning process, thus reducing the required own trial and error to master a fundamental soccer skill [28]. Likewise, Nakanishi et al. leveraged the experience of another robot with successful robot locomotion in order to learn biped locomotion in a new system. ...
... Under these circumstances, a direct transfer of knowledge might only succeed in limited applications. Acknowledging this fact, letting the real robot practice the new acquired skill is still essential to master a task [18,28]. In any case, simulation becomes an interesting alternative to go through in order to narrow the gap between biological systems and real robots. ...
Thesis
Full-text available
Robots are becoming a vital ingredient in society. Some of their daily tasks require dual-arm manipulation skills in the rapidly changing, dynamic and unpredictable real-world environments where they have to operate. Given the expertise of humans in conducting these activities, it is natural to study humans motions to use the resulting knowledge in robotic control. With this in mind, this work leverages human knowledge to formulate a more general, real-time, and less task-specific framework for dual-arm manipulation. Particularly, the proposed architecture first learns the dynamics underlying the execution of different primitive skills. These are harvested in a one-at-a-time fashion from human demonstrations, making dual-arm systems accessible to non-roboticists-experts. Then, the framework exploits such knowledge simultaneously and sequentially to confront complex and novel scenarios. Current works in the literature deal with the challenges arising from particular dual-arm applications in controlled environments. Thus, the novelty of this work lies in (i) learning a set of primitive skills in a one-at-a-time fashion, and (ii) endowing dual-arm systems with the ability to reuse their knowledge according to the requirements of any commanded task, as well as the surrounding environment. The potential of the proposed framework is demonstrated with several experiments involving synthetic environments, the simulated and real iCub humanoid robot. Apart from evaluating the performance and generalisation capabilities of the different primitive skills, the framework as a whole is tested with a dual-arm pick-and-place task of a parcel in the presence of unexpected obstacles. Results suggest the suitability of the method towards robust and generalisable dual-arm manipulation.
... To be able to construct a learning problem and apply it in real-world, we need to have robot perception algorithm that is able to calculate the pose of the robot and the position of the ball. Most of the behavior learning approaches in the literature uses this level of abstraction in learning [3,10,11,13]. However, this level of abstraction requires solving complex perception problems such as determining its own pose using the features extracted from the camera image. ...
... One of the early studies that aim to learn to dribble the ball in robot soccer is done by Latzke et al. [10]. They use imitation learning to improve the performance of reinforcement learning algorithm. ...
Preprint
Full-text available
In imitation learning, behavior learning is generally done using the features extracted from the demonstration data. Recent deep learning algorithms enable the development of machine learning methods that can get high dimensional data as an input. In this work, we use imitation learning to teach the robot to dribble the ball to the goal. We use B-Human robot software to collect demonstration data and a deep convolutional network to represent the policies. We use top and bottom camera images of the robot as input and speed commands as outputs. The CNN policy learns the mapping between the series of images and speed commands. In 3D realistic robotics simulator experiments, we show that the robot is able to learn to search the ball and dribble the ball, but it struggles to align to the goal. The best-proposed policy model learns to score 4 goals out of 20 test episodes.
... Ball-dribbling is a complex behavior where a robot play- er attempts to maneuver the ball in a very controlled way while moving towards a desired target. Very few works have addressed ball dribbling behavior with humanoid biped robots [5][6][7][8][9]. Furthermore, few details are mentioned in these works concerning specific dribbling modeling [10,11], performance evaluations for ball-control, or obtained accuracy to the desired path. ...
Conference Paper
Hierarchical task decomposition strategies allow robots and agents in general to address complex decision-making tasks. Layered learning is a hierarchical machine learning paradigm where a complex behavior is learned from a series of incrementally trained sub-tasks. This paper describes how layered learning can be applied to design individual behaviors in the context of soccer robotics. Three different layered learning strategies are implemented and analyzed using a ball-dribbling behavior as a case study. Performance indices for evaluating dribbling speed and ball-control are defined and measured. Experimental results validate the usefulness of the implemented layered learning strategies showing a trade-off between performance and learning speed.
... Since this work is more focused to the theoretical models and controllers of the gait, there is not included a dribbling engine final performance evaluation. On the other hand, [2] presents an approach that uses imitative reinforcement learning for dribbling the ball from different positions into the empty goal, meanwhile [3] proposes an approach that uses corrective human demonstration for augmenting a hand-coded ball dribbling task performed against stationary defender robots. Since these two works are not addressing explicitly the dribbling behavior, not many details about the specific dribbling modeling, or performance evaluations for the ball-control or accuracy to the desired target are mentioned. ...
... Although several strategies can be used to tackle the dribbling problem, we classify these in three main groups: (i) based on human experience and/or hand-code [2,3], (ii) based on identification of the system dynamics and/or kinematics and mathematical models [1,10,11], and (iii) based on the on-line learning of the system dynamics [6][7][8][9]. In order to develop the dribbling behavior, each of these alternatives has advantages and disadvantages: (i) is initially faster to implement but vulnerable to errors and difficult to debug and re-tune when parameters change or while the system complexity increases; (ii) could be solved completely off-line by analytical or heuristic methods since robot and ball kinematics are known, but to identify the interaction between the robot's foot while it is walking, with a dynamic ball and the floor, could be anfractuous; in this way those strategies from (iii) which are capable to learn about that robot-ballfloor interaction, while find an optimal policy for the ball-pushing behavior, as RL, is a promise and attractive approach. ...
Article
In the context of the humanoid robotics soccer, ball dribbling is a complex and challenging behavior that requires a proper interaction of the robot with the ball and the floor. We propose a methodology for modeling this behavior by splitting it in two sub problems: alignment and ball pushing. Alignment is achieved using a fuzzy controller in conjunction with an automatic foot selector. Ball-pushing is achieved using a reinforcement-learning based controller, which learns how to keep the robot near the ball, while controlling its speed when approaching and pushing the ball. Four different models for the reinforcement learning of the ball-pushing behavior are proposed and compared. The entire dribbling engine is tested using a 3D simulator and real NAO robots. Performance indices for evaluating the dribbling speed and ball-control are defined and measured. The obtained results validate the usefulness of the proposed methodology, showing asymptotic convergence in around fifty training episodes, and similar performance between simulated and real robots.
... distributions for actions that are possible in a state so that an agent acts according to its policy. The value of a policy is defined as a linear function of the features of the environment and the learning agent has an estimate of the expert's feature expectations. Their algorithm is proven to attain performance close to that of the expert agent. Latzke et al. (Latzke et al., 2006) utilized Q-learning which uses the experience of an expert agent as training data. The imitating agent has full access to the experience of the expert and these experiences are provided as sequences of states and actions along with their rewards. Price and Boutilier (Price and Boutilier, 2003) devised implicit imitation to accelerate r ...
Article
Full-text available
Imitation is an example of social learning in which an individual observes and copies another’s actions. This paper presents a new method for using imitation as a way of enhancing the learning speed of individual agents that employ a well-known reinforcement learning algorithm, namely Q-learning. Compared with other research that uses imitation with reinforcement learning, our method uses imitation of purely observed behaviours to enhance learning, with no internal state access or sharing of experiences between agents. The paper evaluates our imitation-enhanced reinforcement learning approach in both simulation and with real robots in continuous space. Both simulation and real robot experimental results show that the learning speed of the group is improved.
... Input from a teacher need not be limited to initial instruction. The instructor may provide additional demonstrations in later learning stages (Latzke et al., 2007;Ross et al., 2011a) and which can also be used as differential feedback (Argall et al., 2008). This combination of imitation learning with reinforcement learning is sometimes termed apprenticeship learning (Abbeel and Ng, 2004) to emphasize the need for learning both from a teacher and by practice. ...
... et al. (1996);Bakker et al. (2003);Benbrahim et al. (1992);Benbrahim and Franklin (1997);Birdwell and Livingston (2007);Bitzer et al. (2010);Conn and Peters II (2007);Duan et al. (2007Duan et al. ( , 2008;Fagg et al. (1998);Gaskett et al. (2000);Gräve et al. (2010);Hafner and Riedmiller (2007);Huang and Weng (2002);Huber and Grupen (1997);Ilg et al. (1999);Katz et al. (2008);Kimura et al. (2001);Kirchner (1997);Konidaris et al. (2011aKonidaris et al. ( , 2012;Kroemer et al. (2009Kroemer et al. ( , 2010;Kwok and Fox (2004);Latzke et al. (2007);Mahadevan and Connell (1992);Matarić (1997);Morimoto and Doya (2001);Nemec et al. (2009Nemec et al. ( , 2010;Oßwald et al. (2010);Paletta et al. (2007);Pendrith (1999);Platt et al. (2006);Riedmiller et al. (2009);Rottmann et al. (2007); Kaelbling (1998, 2002);Soni and Singh (2006);Tamošiūnaitė et al. (2011);Thrun (1995);Tokic et al. (2009);Touzet (1997);Uchibe et al. (1998);Wang et al. (2006);Willgoss and Iqbal (1999) ...
Article
Full-text available
Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value-function-based and policy-search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and we note throughout open questions and the tremendous potential for future research.
... Their algorithm is proved to attain performance close to that of the expert agent. Latzke et al. [13] utilized Q-learning which uses the experience of an expert as training data. The imitating agent has full access to the experience of the expert and these experiences are provided as sequences of states and actions along with their rewards. ...
Conference Paper
Full-text available
Imitation, in which an individual observes and copies another's actions, is a powerful means of learning. This paper presents a way of using imitation to enhance the learning capability of individual agents. The agents employ Q-learning and we show that agents with imitation enhanced Q-learning learn faster than those with Q-learning alone.
... For example, an imitation framework based on concept learning is presented by Mobahi et al. (2007). Moreover, the application of imitative learning for soccer playing robots are discussed in Behenke and Bennewitz (2005) and Latzke et al. (2006). ...
Article
Full-text available
Intelligent control for unidentified systems with unstable equilibriums is not always a proper control strategy, which results in inferior performance in many cases. Because of the existing trial and error manner of the procedure in former duration of learning, this exploration for finding the appropriate control signals can lead to instability. However, the recent proposed emotional controllers are capable of learning swiftly; the use of these controllers is not an efficient solution for the mentioned instability problems. Therefore, a solution is needed to evade the instability in the preliminary phase of learning. The purpose of this paper is to propose a novel approach for controlling unstable systems or systems with unstable equilibrium by model free controllers. An existing controller (model-based controller) with limited performance is used as a mentor for the emotional learning controller in the first step. This learning phase prepares the controller to control the plant as well as mentor, while it prevents any instability. When the emotional controller can imitate the behavior of model based one properly, the employed controller is gently switched from model based one to an emotional controller using a Fuzzy Inference System (FIS). Also, the emotional stress is softly switched from the mentor-imitator output difference to the combination of the objectives. In this paper, the emotional stresses are generated once by using a nonlinear combination of objectives and once by employing different stresses to a FIS modulating the stresses, and makes a subset of these objectives salient regarding the contemporary situation. The proposed model free controller is employed to control an inverted pendulum system and an oscillator with unstable equilibrium. It is noticeable that the proposed controller is a model free one, and does not use any knowledge about the plant. The experimental results on two benchmarks show the superiority of proposed imitative and emotional controller with fuzzy stress generation mechanism in comparison with model based originally supplied controllers and emotional controller with nonlinear stress generation unit – in control of pendulum system – in all operating conditions. There are two test beds for evaluating the proposed model free controller performance which are discussed in this paper: a laboratorial inverted pendulum system, which is a well-known system with unstable equilibrium, and Chua’s circuit, which is an oscillator with two stable and one unstable equilibrium point. The results show that the proposed controller with the mentioned strategy can control the systems with satisfactory performance. In this paper, a novel approach for controlling unstable systems or systems with unstable equilibrium by model free controllers is proposed. This approach is based on imitative learning in preliminary phase of learning and soft switching to an interactive emotional learning. Moreover, FISs are used to model the linguistic knowledge of the ascendancy and situated importance of the objectives. These FISs are used to attentionally modulate the stress signals for the emotional controller. The results of proposed strategy on two benchmarks reveal the efficacy of this strategy of model free control.