8-state version of the ring world. On the left is a representation of the ring world. One of the states has an observation bit of 1, all of the others are 0. There are two actions in this world, one that moves the agent clockwise (call it 'right' or just R) and one that moves the agent counter-clockwise ('left' or L). The question network on the right side of this figure is a sparse action conditional network that can represent a solution to this world. This question network has 8 levels, at each level there is a question about action L and a question about action R.

Source publication

TD( ) Networks: Temporal-Difference Networks with Eligibility Traces

Conference Paper

Full-text available

Aug 2005

Temporal-difference (TD) networks have been introduced as a formalism for expressing and learning grounded world knowledge in a predic- tive form (Sutton & Tanner, 2005). Like con- ventional TD(0) methods, the learning algorithm for TD networks uses 1-step backups to train prediction units about future events. In conven- tional TD learning, the TD(...

Context 1

... second experimental domain is the n-state ring world ). The general structure of the ring world is shown in Figure 6. This domain is more complex than the cycle world because it has multiple actions. ...

View in full-text

Context 2

... ex- perimentation, we have found a smaller question network that can represent the ring world. This question network is shown in Figure 6, and scales linearly with the number of states in the ring. In this experiment, TD(0) could not find a solution to the 8-state ring world in any of the configura- tions that we tried. ...

View in full-text

Integrated Urban Water Modelling - Past, Present, and Future

Article

Full-text available

The technical and practical challenges of implementing the Integrated Urban Water Management paradigm provide the setting for the IUWM modelling review presented in this paper, which presents the findings of a review of the current "state-of-the-art" in IUWM modelling. It was found that in recent years there have been significant advances in the de...

Figure 1: Schematic view of a standard neutrino beamline. Here, the...

Hadroproduction experiments to constrain accelerator-based neutrino fluxes

Article

Full-text available

Sep 2017

Laura Zambelli

The precise knowledge of (anti-)neutrino fluxes is one of the largest limitation in accelerator-based neutrino experiments. The main limitations arise from the poorly known production properties of neutrino parents in hadron-nucleus interactions. Strategies used by neutrino experiment to constrain their fluxes using external hadroproduction data wi...

Gravity emerges from history

Technical Report

Full-text available

Apr 2017

Jeff J Patmore

We seem to be approaching some kind of breakthrough in our knowledge that will hopefully lead to a greater understanding of this fundamental natural phenomenon. Whether studying gravitational waves or the accretion disk around massive a black hole, we strive to create models that let us grasp space and time. Will gravity prove to be an emergent phe...

Figure 1. Silvery-cheeked Antshrike, Sakesphorus cristatus...

Figure 2. New record (star) of Silvery-cheeked Antshrike, Sakesphorus...

First documented record of the Silvery-cheeked Antshrike Sakesphorus cristatus (Wied-Neuwied, 1831) (Passeriformes: Thamnophilidae) for the state of Sergipe, Brazil

Article

Full-text available

Nov 2017

This note provides the first documented record of the Silvery-cheeked Antshrike Sakesphorus cristatus (Wied-Neuwied, 1831) for the state of Sergipe in Brazil, based on a field record and extensive search in the literature, museums, and online databases. The new record of this species presented here may contribute to the knowledge of its occurrence...

Ocean minerals knowledge data bases. Present status and future capacities

Conference Paper

Full-text available

Apr 2014

Jean-Marie Monget

Off-Policy Prediction Learning: An Empirical Study of Online Algorithms

Article

Full-text available

Jun 2024

Off-policy prediction—learning the value function for one policy from data generated while following another policy—is one of the most challenging problems in reinforcement learning. This article makes two main contributions: 1) it empirically studies 11 off-policy prediction learning algorithms with emphasis on their sensitivity to parameters, learning speed, and asymptotic error and 2) based on the empirical results, it proposes two step-size adaptation methods called and that help the algorithm with the lowest error from the experimental study learn faster. Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. In this article, we empirically compare 11 off-policy prediction learning algorithms with linear function approximation on three small tasks: the Collision task, the task, and the task. The Collision task is a small off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. The and tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as $2^{14}$ . To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which, in turn, slows down learning. The task is more extreme in that the product of the ratios can become as large as $2^{14}$ $\times$ $25$ . The algorithms considered are Off-policy TD, five Gradient-TD algorithms, two Emphatic-TD algorithms, Vtrace, and variants of Tree Backup and ABQ that are applicable to the prediction setting. We found that the algorithms’ performance is highly affected by the variance induced by the importance sampling ratios. Tree Backup, Vtrace, and ABTDare not affected by the high variance as much as other algorithms, but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TDtends to have lower asymptotic error than other algorithms but might learn more slowly in some cases. Based on the empirical results, we propose two step-size adaptation algorithms, which we collectively refer to as the Ratchet algorithms, with the same underlying idea: keep the step-size parameter as large as possible and ratchet it down only when necessary to avoid overshoot. We show that the Ratchet algorithms are effective by comparing them with other popular step-size adaptation algorithms, such as the Adam optimizer.

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

Preprint

Full-text available

Apr 2021

Many reinforcement learning algorithms rely on value estimation. However, the most widely used algorithms -- namely temporal difference algorithms -- can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation which are sound under linear function approximation, based on the linear mean-squared projected Bellman error (PBE). Extending these methods to the non-linear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective, called the mean-squared Bellman error (BE), which naturally facilities nonlinear approximation. In this work, we build on these insights and introduce a new generalized PBE, that extends the linear PBE to the nonlinear setting. We show how this generalized objective unifies previous work, including previous theory, and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective which is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation.

A critical survey of STDP in Spiking Neural Networks for Pattern Recognition

Conference Paper

Full-text available

Jul 2020

The Application of Co-Evolutionary Genetic Programming and TD(1) Reinforcement Learning in Large-Scale Strategy Game VCMI

Conference Paper

Full-text available

Jun 2015

VCMI is a new, open-source project that could become one of the biggest testing platform for modern AI algorithms in the future. Its complex environment and turn-based gameplay make it a perfect system for any AI driven solution. It also has a large community of active players which improves the testa-bility of target algorithms. This paper explores VCMI's environment and tries to assess its complexity by providing a base solution for battle handling problem using two global optimization algorithms: Co-Evolution of Genetic Programming Trees and TD(1) algorithm with Back Propagation neural network. Both algorithms have been used in VCMI to evolve battle strategies through a fully autonomous learning process. Finally, the obtained strategies have been been tested against existing solutions and compared with players' best tactics. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-19728-9_7

Incorporating prior knowledge into temporal difference networks

Article

Nov 2014
J Comput Sci

Developing general purpose algorithms for learning an accurate model of dynamical systems from example traces of the system is still a challenging research problem. Predictive State Representation (PSR) models represent the state of a dynamical system as a set of predictions about future events. Our work focuses on improving Temporal Difference Networks (TD Nets), a general class of predictive state models. We adapt the internal structure of the TD Net and we present an improved algorithm for learning a TD Net model from experience in the environment. The new algorithm accepts a set of known facts about the environment and uses those facts to accelerate the learning. These facts can come from another learning algorithm (as in this study) or from a designer's prior knowledge about the environment. Experiments demonstrate that using the new structure and learning algorithm improves the accuracy of the TD Net models. When tested in an in finite environment, our new algorithm outperforms all of the standard PSR learning algorithms.

On the Reinforcement Learning Analysis and Learning the Control of Humanoid Robot Leg

Thesis

Full-text available

Jan 2014

Onder Tutsoy

Reinforcement learning is a method for learning sequential control actions or decisions using an instantaneous reward signal which implicitly defines a long term value function. It has been proposed to solve complex learning control problems without requiring explicit knowledge of the system’s dynamics. Moreover, it has also been used as a model of cognitive learning in humans and applied to systems, such as humanoid robots, to study embodied cognition. However, there are relatively few results which describe the actual performance of such learning algorithms, even on relatively simple problems. In this thesis, simple test problems are used to investigate issues associated with the value function’s representation and parametric convergence. In particular, the terminal convergence problem is analyzed with a known optimal (bang-bang) control policy where aim is to accurately learn the value function. For certain initial conditions, the closed form solution for the value function is calculated and it is shown to have a polynomial form. It is parameterized by terms which are functions of the unknown plant’s parameters and the value function’s discount factor and their convergence properties are analyzed. It is shown that the temporal difference error introduces a null space associated with the finite horizon basis function during the experiment. This is only non-singular when the experiment is terminated correctly and a number of (equivalent) solutions are described. It is also demonstrated that, in general, the test problem’s dynamics are chaotic for random initial states and this causes a digital offset in the value function. Methods for estimating the offset are described and a dead-zone is proposed to switch off learning in the chaotic region. Another value function estimation test problem is then proposed which uses a saturated piecewise linear control signal. This is a more realistic control scenario and it is also shown to address the chaotic dynamics problem. It is shown that the condition of the learning problem depends on both the saturation threshold and the value function’s discount factor and that a badly conditioned learning problem may result. Moreover, it is proved that the temporal difference error introduce a trajectory null space associated with the differenced higher order bases until the saturation threshold of the saturated piecewise linear control signal. These results are then used to explain the behaviour of reinforcement learning algorithms when higher order systems are used and the impact of function approximation algorithms and exploration noise is discussed. Finally, a central pattern generator based reinforcement learning algorithm is applied to a single leg of a robot where the target is to generate appropriate control signals for each joint.

Temporal-Difference Networks for Dynamical Systems with Continuous Observations and Actions

Article

May 2012

Christopher M. Vigorito

Temporal-difference (TD) networks are a class of predictive state representations that use well-established TD methods to learn models of partially observable dynamical systems. Previous research with TD networks has dealt only with dynamical systems with finite sets of observations and actions. We present an algorithm for learning TD network representations of dynamical systems with continuous observations and actions. Our results show that the algorithm is capable of learning accurate and robust models of several noisy continuous dynamical systems. The algorithm presented here is the first fully incremental method for learning a predictive representation of a continuous dynamical system.

Using Decision Trees as the Answer Networks in Temporal Difference-Networks

Conference Paper

Full-text available

Jan 2008

State representation for intelligent agents is a continuous challenge as the need for abstraction is unavoidable in large state spaces. Pre- dictive representations offer one way to obtain state abstraction by replacing a state with a set of predictions about future interactions with the world. One such formalism is the Temporal-Difference Net- works framework (2). It splits the representation of knowledge in the question network and the answer network. The question network defines which questions (interactions) about future experience are of interest. It contains nodes, each correspond- ing to a single scalar prediction about a future observation given a certain sequence of interactions with the environment. The nodes are connected by links, annotated with action-labels, which represent temporal relationships between the predictions made by the nodes, conditioned on the action-labels on the links (more details in (2)). The answer network provides the predictive models to update the answers to the defined questions, which are expected values of the scalar quantities in the nodes. These values can be seen as estimates of probabilities. With each executed action of the agent, the pre- dictions are updated using the answer network models to obtain a description of the new state. In classical TD-networks, logistic re- gression models are used, whose weight vector is obtained using a gradient learning approach. We propose the use of probability-valued decision trees (1) in the answer network of TD-Nets. We believe that decision trees are a par- ticular good choice to investigate, as they offer a different yet power- ful form of generalization. Moreover, this aids in a better understand- ing of the strengths and weaknesses of TD-Nets and represents an im- portant first step towards using them in worlds with more extensive observations. Furthermore, decision tree induction can be regarded as a prototypical example of a non-gradient learning approach.

Learning Concepts from a Sequence of Experiences by Reinforcement Learning Agents

Conference Paper

Feb 2007

In this paper, we propose a novel approach whereby a reinforcement learning agent attempts to understand its environment via meaningful temporally extended concepts in an unsupervised way. Our approach is inspired by findings in neuroscience on the role of mirror neurons in action-based abstraction. Since there are so many cases in which the best decision cannot be made just by using instant sensory data, in this study we seek to achieve a framework for learning temporally extended concepts from sequences of sensory-action data. To direct the agent to gather fertile information for concept learning, a reinforcement learning mechanism utilizing experience of the agent is proposed. Experimental results demonstrate the capability of the proposed approach in retrieving meaningful concepts from the environment. The concepts and the way of defining them are thought such that they not only can be applied to ease decision making but also can be utilized in other applications as elaborated in the paper. Keywords Concept learning, Markov decision processes, mirror neurons, reinforcement learning.

Learning Agent State Online with Recurrent Generate-and-Test

Preprint

Dec 2021

Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data. When the environment only provides observations giving partial information about the state of the environment, the agent must learn the agent state based on the data stream of experience. We refer to the state learned directly from the data stream of experience as the agent state. Recurrent neural networks can learn the agent state, but the training methods are computationally expensive and sensitive to the hyper-parameters, making them unideal for online learning. This work introduces methods based on the generate-and-test approach to learn the agent state. A generate-and-test algorithm searches for state features by generating features and testing their usefulness. In this process, features useful for the agent's performance on the task are preserved, and the least useful features get replaced with newly generated features. We study the effectiveness of our methods on two online multi-step prediction problems. The first problem, trace conditioning, focuses on the agent's ability to remember a cue for a prediction multiple steps into the future. In the second problem, trace patterning, the agent needs to learn patterns in the observation signals and remember them for future predictions. We show that our proposed methods can effectively learn the agent state online and produce accurate predictions.

Contexts in source publication

Similar publications

Citations