Figure 6 - uploaded by Richard Sutton
Content may be subject to copyright.
8-state version of the ring world. On the left is a representation of the ring world. One of the states has an observation bit of 1, all of the others are 0. There are two actions in this world, one that moves the agent clockwise (call it 'right' or just R) and one that moves the agent counter-clockwise ('left' or L). The question network on the right side of this figure is a sparse action conditional network that can represent a solution to this world. This question network has 8 levels, at each level there is a question about action L and a question about action R.

8-state version of the ring world. On the left is a representation of the ring world. One of the states has an observation bit of 1, all of the others are 0. There are two actions in this world, one that moves the agent clockwise (call it 'right' or just R) and one that moves the agent counter-clockwise ('left' or L). The question network on the right side of this figure is a sparse action conditional network that can represent a solution to this world. This question network has 8 levels, at each level there is a question about action L and a question about action R.

Source publication
Conference Paper
Full-text available
Temporal-difference (TD) networks have been introduced as a formalism for expressing and learning grounded world knowledge in a predic- tive form (Sutton & Tanner, 2005). Like con- ventional TD(0) methods, the learning algorithm for TD networks uses 1-step backups to train prediction units about future events. In conven- tional TD learning, the TD(...

Contexts in source publication

Context 1
... second experimental domain is the n-state ring world ). The general structure of the ring world is shown in Figure 6. This domain is more complex than the cycle world because it has multiple actions. ...
Context 2
... ex- perimentation, we have found a smaller question network that can represent the ring world. This question network is shown in Figure 6, and scales linearly with the number of states in the ring. In this experiment, TD(0) could not find a solution to the 8-state ring world in any of the configura- tions that we tried. ...

Similar publications

Article
Full-text available
The technical and practical challenges of implementing the Integrated Urban Water Management paradigm provide the setting for the IUWM modelling review presented in this paper, which presents the findings of a review of the current "state-of-the-art" in IUWM modelling. It was found that in recent years there have been significant advances in the de...
Article
Full-text available
The precise knowledge of (anti-)neutrino fluxes is one of the largest limitation in accelerator-based neutrino experiments. The main limitations arise from the poorly known production properties of neutrino parents in hadron-nucleus interactions. Strategies used by neutrino experiment to constrain their fluxes using external hadroproduction data wi...
Technical Report
Full-text available
We seem to be approaching some kind of breakthrough in our knowledge that will hopefully lead to a greater understanding of this fundamental natural phenomenon. Whether studying gravitational waves or the accretion disk around massive a black hole, we strive to create models that let us grasp space and time. Will gravity prove to be an emergent phe...
Article
Full-text available
This note provides the first documented record of the Silvery-cheeked Antshrike Sakesphorus cristatus (Wied-Neuwied, 1831) for the state of Sergipe in Brazil, based on a field record and extensive search in the literature, museums, and online databases. The new record of this species presented here may contribute to the knowledge of its occurrence...

Citations

... Parallel off-policy learning of value functions has even been proposed as a way of learning general, policydependent, world knowledge [8], [9], [10]. Finally, numerous ideas in machine learning rely on off-policy learning, including learning temporally abstract world models [11], predictive representations of state [12], [13], auxiliary tasks [14], lifelong learning [9], and learning from historical data [15]. Other than these, off-policy learning has had many use cases that go beyond the classic online prediction case, such as in control systems and optimization [16], [17], [18]. ...
Article
Full-text available
Off-policy prediction—learning the value function for one policy from data generated while following another policy—is one of the most challenging problems in reinforcement learning. This article makes two main contributions: 1) it empirically studies 11 off-policy prediction learning algorithms with emphasis on their sensitivity to parameters, learning speed, and asymptotic error and 2) based on the empirical results, it proposes two step-size adaptation methods called and that help the algorithm with the lowest error from the experimental study learn faster. Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. In this article, we empirically compare 11 off-policy prediction learning algorithms with linear function approximation on three small tasks: the Collision task, the task, and the task. The Collision task is a small off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. The and tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as $2^{14}$ . To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which, in turn, slows down learning. The task is more extreme in that the product of the ratios can become as large as $2^{14}$ $\times$ $25$ . The algorithms considered are Off-policy TD, five Gradient-TD algorithms, two Emphatic-TD algorithms, Vtrace, and variants of Tree Backup and ABQ that are applicable to the prediction setting. We found that the algorithms’ performance is highly affected by the variance induced by the importance sampling ratios. Tree Backup, Vtrace, and ABTDare not affected by the high variance as much as other algorithms, but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TDtends to have lower asymptotic error than other algorithms but might learn more slowly in some cases. Based on the empirical results, we propose two step-size adaptation algorithms, which we collectively refer to as the Ratchet algorithms, with the same underlying idea: keep the step-size parameter as large as possible and ratchet it down only when necessary to avoid overshoot. We show that the Ratchet algorithms are effective by comparing them with other popular step-size adaptation algorithms, such as the Adam optimizer.
... Learning about many different policies in parallel has long been a primary motivation for off-policy learning, and this usage suggested that perhaps prior corrections are not essential. Several approaches require learning many value functions or policies in parallel, including approaches based on option models (Sutton et al., 1999b), predictive representations of state (Littman and Sutton, 2002;Tanner and Sutton, 2005;Sutton et al., 2011), and auxiliary tasks (Jaderberg et al., 2016). In a parallel learning setting, it is natural to estimate the future reward achieved by following each target policy until termination from the states encountered during training-the value of taking excursions from the behavior policy. ...
Preprint
Full-text available
Many reinforcement learning algorithms rely on value estimation. However, the most widely used algorithms -- namely temporal difference algorithms -- can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation which are sound under linear function approximation, based on the linear mean-squared projected Bellman error (PBE). Extending these methods to the non-linear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective, called the mean-squared Bellman error (BE), which naturally facilities nonlinear approximation. In this work, we build on these insights and introduce a new generalized PBE, that extends the linear PBE to the nonlinear setting. We show how this generalized objective unifies previous work, including previous theory, and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective which is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation.
... Namely from this perspective, spiking of N af f should be considered in a wider temporal span and not be limited to closely related spike times [10]. In other words, spikes having participated at some earlier point in the efferent neuron's firing should be entitled to some reward since there is a possibility that their action influenced the spike, this is called eligibility trace [31]. Here causality is tackled in terms of probability of participation. ...
... Introduced first in [4] and later used by Ekker to play 5x5 Go game in [12], TD(µ) algorithm tries to separate changes caused by bad play from the changes caused by bad evaluation function. The algorithm is based on TD(λ) presented by Sutton in [26]. Ekker used TD(µ) combined with Resilient Back Propagation neural network to play Go with great effects. ...
Conference Paper
Full-text available
VCMI is a new, open-source project that could become one of the biggest testing platform for modern AI algorithms in the future. Its complex environment and turn-based gameplay make it a perfect system for any AI driven solution. It also has a large community of active players which improves the testa-bility of target algorithms. This paper explores VCMI's environment and tries to assess its complexity by providing a base solution for battle handling problem using two global optimization algorithms: Co-Evolution of Genetic Programming Trees and TD(1) algorithm with Back Propagation neural network. Both algorithms have been used in VCMI to evolve battle strategies through a fully autonomous learning process. Finally, the obtained strategies have been been tested against existing solutions and compared with players' best tactics. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-19728-9_7
... There are two main categories of predictive state models: Temporal Difference Networks (TDNets) (Sutton and Tanner, 2004;Tanner and Sutton, 2005a) and linear predictive models, which include linear PSRs (Littman et al., 2001) and transformed PSRs (Rosencrantz et al., 2004). The primary difference between the categories is their different state update mechanisms. ...
... The state update mechanism defines how the TDNet takes the state at time t and the most recent action/observation (at time t + 1) and uses them to compute the state at time t + 1. TDNets are not constrained to use any specific form for updating state. However, in practice, the following form is commonly used (Sutton and Tanner, 2004;Tanner and Sutton, 2005b;Sutton et al., 2005;Tanner and Sutton, 2005a). Let s t be the state vector at time t and let 1 t S + p r e d be the part of state at time t + 1 that consists of predictions about tests. ...
... Each 1 t s + P r e d is simply the value for the state at time t given the current TDNet parameters θ i , concatenated with the indicators for the action and observation at time t+1. The targets 1 t s + P r e d are computed using the TD(λ) algorithm (Algorithm 1)for learning TDNets with λ = 1, which is consistently the best value of λ (Tanner and Sutton, 2005a). ...
Article
Developing general purpose algorithms for learning an accurate model of dynamical systems from example traces of the system is still a challenging research problem. Predictive State Representation (PSR) models represent the state of a dynamical system as a set of predictions about future events. Our work focuses on improving Temporal Difference Networks (TD Nets), a general class of predictive state models. We adapt the internal structure of the TD Net and we present an improved algorithm for learning a TD Net model from experience in the environment. The new algorithm accepts a set of known facts about the environment and uses those facts to accelerate the learning. These facts can come from another learning algorithm (as in this study) or from a designer's prior knowledge about the environment. Experiments demonstrate that using the new structure and learning algorithm improves the accuracy of the TD Net models. When tested in an in finite environment, our new algorithm outperforms all of the standard PSR learning algorithms.
... Therefore, the TD algorithm learns to predict the future instantaneous observations or rewards [24]. When the TD method uses a single step back-up for future predictions, then it is known as TD(0) [2,5]. ...
... Alternatively, TD(λ) is used to represent the existence of eligibility traces [19,[24][25][26]. ...
Thesis
Full-text available
Reinforcement learning is a method for learning sequential control actions or decisions using an instantaneous reward signal which implicitly defines a long term value function. It has been proposed to solve complex learning control problems without requiring explicit knowledge of the system’s dynamics. Moreover, it has also been used as a model of cognitive learning in humans and applied to systems, such as humanoid robots, to study embodied cognition. However, there are relatively few results which describe the actual performance of such learning algorithms, even on relatively simple problems. In this thesis, simple test problems are used to investigate issues associated with the value function’s representation and parametric convergence. In particular, the terminal convergence problem is analyzed with a known optimal (bang-bang) control policy where aim is to accurately learn the value function. For certain initial conditions, the closed form solution for the value function is calculated and it is shown to have a polynomial form. It is parameterized by terms which are functions of the unknown plant’s parameters and the value function’s discount factor and their convergence properties are analyzed. It is shown that the temporal difference error introduces a null space associated with the finite horizon basis function during the experiment. This is only non-singular when the experiment is terminated correctly and a number of (equivalent) solutions are described. It is also demonstrated that, in general, the test problem’s dynamics are chaotic for random initial states and this causes a digital offset in the value function. Methods for estimating the offset are described and a dead-zone is proposed to switch off learning in the chaotic region. Another value function estimation test problem is then proposed which uses a saturated piecewise linear control signal. This is a more realistic control scenario and it is also shown to address the chaotic dynamics problem. It is shown that the condition of the learning problem depends on both the saturation threshold and the value function’s discount factor and that a badly conditioned learning problem may result. Moreover, it is proved that the temporal difference error introduce a trajectory null space associated with the differenced higher order bases until the saturation threshold of the saturated piecewise linear control signal. These results are then used to explain the behaviour of reinforcement learning algorithms when higher order systems are used and the impact of function approximation algorithms and exploration noise is discussed. Finally, a central pattern generator based reinforcement learning algorithm is applied to a single leg of a robot where the target is to generate appropriate control signals for each joint.
... Given a trace initialized at time step t−k whose action conditions over the last k time steps have been satisfied , the weights W are updated at time step t using the temporal difference information y t − y t−1 and the past input vector x t−k to improve the past prediction y t−k , with the update scaled appropriately by λ t−k−1 . We have only presented essential notation and intuition here, and refer the reader to Tanner and Sutton (2005) for the full details of the algorithm. In the following section we present our modified TD(λ) algorithm , which allows for continuous observations and actions. ...
... end for if p t−k (i) = observation then traceCondition(i, k) ← c k newT races ← newT races ∪ (i, k) end if end for for i ∈ y do traceCondition(i, t) ← 1 newT races ← newT races ∪ (i, t) end for T races ← newT races end forFigure 3: Psuedo-code for the TD(λ) network learning algorithm for continuous dynamical systems. Modified from Tanner and Sutton (2005). specific actions in a discrete TD network. ...
... Each observation emitted by the system was corrupted by mean-zero Gaussian noise with standard deviation 0.05. This is essentially a noisy, continuous analog of the cycle world presented in Tanner and Sutton (2005). We let n = 4, and σ = 0.3 for this experiment. ...
Article
Temporal-difference (TD) networks are a class of predictive state representations that use well-established TD methods to learn models of partially observable dynamical systems. Previous research with TD networks has dealt only with dynamical systems with finite sets of observations and actions. We present an algorithm for learning TD network representations of dynamical systems with continuous observations and actions. Our results show that the algorithm is capable of learning accurate and robust models of several noisy continuous dynamical systems. The algorithm presented here is the first fully incremental method for learning a predictive representation of a continuous dynamical system.
... Generating the learning examples We build learning examples for the probability tree very similarly to the TD(1) learning scheme described in [3]. On a very high level, TD (1) generates target values for predictions looking as far into the future as possible. ...
... TD(1) will avoid the second source of noise when possible. Experiments reported in [3] show that TD(1) gives the best learning performance for a logistic regression approach too. ...
... We compare the original logistic regression approach with the prob- ability trees approach. To this extend, we implemented our own ver- sion of the original TD(λ) learning algorithm as described in [3]. We perfomed experiments in two different environments: a ring world and a simple grid world. ...
Conference Paper
Full-text available
State representation for intelligent agents is a continuous challenge as the need for abstraction is unavoidable in large state spaces. Pre- dictive representations offer one way to obtain state abstraction by replacing a state with a set of predictions about future interactions with the world. One such formalism is the Temporal-Difference Net- works framework (2). It splits the representation of knowledge in the question network and the answer network. The question network defines which questions (interactions) about future experience are of interest. It contains nodes, each correspond- ing to a single scalar prediction about a future observation given a certain sequence of interactions with the environment. The nodes are connected by links, annotated with action-labels, which represent temporal relationships between the predictions made by the nodes, conditioned on the action-labels on the links (more details in (2)). The answer network provides the predictive models to update the answers to the defined questions, which are expected values of the scalar quantities in the nodes. These values can be seen as estimates of probabilities. With each executed action of the agent, the pre- dictions are updated using the answer network models to obtain a description of the new state. In classical TD-networks, logistic re- gression models are used, whose weight vector is obtained using a gradient learning approach. We propose the use of probability-valued decision trees (1) in the answer network of TD-Nets. We believe that decision trees are a par- ticular good choice to investigate, as they offer a different yet power- ful form of generalization. Moreover, this aids in a better understand- ing of the strengths and weaknesses of TD-Nets and represents an im- portant first step towards using them in worlds with more extensive observations. Furthermore, decision tree induction can be regarded as a prototypical example of a non-gradient learning approach.
... The proposed temporally extended concepts in this paper can be utilized to build the structure of temporaldifference (TD) networks [9]. TD networks are a formalism for expressing and learning grounded knowledge about dynamical systems [12]. A TD network is a network of nodes, each representing a single scalar prediction. ...
Conference Paper
In this paper, we propose a novel approach whereby a reinforcement learning agent attempts to understand its environment via meaningful temporally extended concepts in an unsupervised way. Our approach is inspired by findings in neuroscience on the role of mirror neurons in action-based abstraction. Since there are so many cases in which the best decision cannot be made just by using instant sensory data, in this study we seek to achieve a framework for learning temporally extended concepts from sequences of sensory-action data. To direct the agent to gather fertile information for concept learning, a reinforcement learning mechanism utilizing experience of the agent is proposed. Experimental results demonstrate the capability of the proposed approach in retrieving meaningful concepts from the environment. The concepts and the way of defining them are thought such that they not only can be applied to ease decision making but also can be utilized in other applications as elaborated in the paper. Keywords Concept learning, Markov decision processes, mirror neurons, reinforcement learning.
Preprint
Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data. When the environment only provides observations giving partial information about the state of the environment, the agent must learn the agent state based on the data stream of experience. We refer to the state learned directly from the data stream of experience as the agent state. Recurrent neural networks can learn the agent state, but the training methods are computationally expensive and sensitive to the hyper-parameters, making them unideal for online learning. This work introduces methods based on the generate-and-test approach to learn the agent state. A generate-and-test algorithm searches for state features by generating features and testing their usefulness. In this process, features useful for the agent's performance on the task are preserved, and the least useful features get replaced with newly generated features. We study the effectiveness of our methods on two online multi-step prediction problems. The first problem, trace conditioning, focuses on the agent's ability to remember a cue for a prediction multiple steps into the future. In the second problem, trace patterning, the agent needs to learn patterns in the observation signals and remember them for future predictions. We show that our proposed methods can effectively learn the agent state online and produce accurate predictions.