Fig 3 - uploaded by Shimon Whiteson
Content may be subject to copyright.
The model-free approach. The dashed box occurs off-line while the solid boxes occur on-line, during actual test episodes 

The model-free approach. The dashed box occurs off-line while the solid boxes occur on-line, during actual test episodes 

Source publication
Article
Full-text available
This article presents an extended case study in the application of neuroevolution to generalized simulated helicopter hovering, an important challenge problem for reinforcement learning. While neuroevolution is well suited to coping with the domain's complex transition dynamics and high-dimensional state and action spaces, the need to explore effic...

Contexts in source publication

Context 1
... MDPs were evolved using the procedure described in Sect. 3.1 Then, for each test MDP of the competition, the first 10 episodes were spent evaluating each of these specialized policies in that test MDP. Finally, whichever specialized policy performed the best was used for the remaining 990 episodes of that test MDP. This strategy, depicted in Fig. 3, allows the agent to adapt on-line to each test MDP in a sample-effi- cient way, without needing an accurate model. Figure 4 (left) shows the results of the generalized helicopter hovering event at the 2008 RL Competition, in which this model-free approach won first place. Of the six entries that successfully completed test runs, only ...
Context 2
... quality of learning. Lower values of k make faster progress at the beginning because they can evaluate policies more quickly; higher values of k plateau higher because they can more accurately make fine distinctions between policies. While values of k [ 1 perform best in the long run, good performance is still possible with k = 1. In contrast, Fig. 13 shows results when r = 0.3. In this case, k = 1 performs poorly, as a single fitness evaluation is not enough to guide evolution. When k = 2, evolution makes significant progress but plateaus early. To achieve good performance, k C 5 is ...

Similar publications

Article
Full-text available
Face recognition (FR) methods report significant performance by adopting the convolutional neural network (CNN) based learning methods. Although CNNs are mostly trained by optimizing the softmax loss, the recent trend shows an improvement of accuracy with different strategies, such as task-specific CNN learning with different loss functions, fine-t...

Citations

... LITERATURE REVIEW An important work in this field [9][10] [11], It made it possible to make an Advanced Neuroevolutionary Genetic Algorithm (ANGA) to improve protection. As the world of defense changes all the time, Smith et al. said that we need to find new ways to make AI models more adaptable and useful. ...
Article
Full-text available
Different methods have been created to make optimization processes more efficient and effective, which is a big step forward in the area of evolutionary optimization. This abstract talks about four well-known methods: Neuroevolution of Augmenting Topologies (NEAT), Genetic Algorithms (GAs), Genetic Programming (GP), and Advanced Neuroevolutionary Genetic Algorithm (ANGA). It focuses on important performance indicators like Fitness Metrics, Generalization, Efficiency and Speed, and Overall Performance. With scores of 90% in Fitness Metrics and 88% in Generalization, NEAT, a neuroevolutionary program, shows strong success in competitive tasks. With an 80% score, it does poorly in Efficiency and Speed, though. GAs are known for using a population-based method. They do very well in Efficiency and Speed (90%), but they do a little worse in Fitness Metrics and Generalization (89% and 85%, respectively). With a focus on updated computer programs, GP gets marks that are equal in Fitness Metrics, Generalization, and Efficiency and Speed (88%, 85%, and 85%, respectively). The new ANGA algorithm stands out as a top worker, doing exceptionally well in all tests. ANGA gets great marks of 93% in Fitness Metrics, 94% in Generalization, and 93% in Efficiency and Speed. This shows how well it can optimize everything. Overall Performance score of 97.78% shows how well it works as a whole, making ANGA a potential method for genetic optimization.
... Approaches to safe RL broadly fall into two categories: (i) modification of the exploration process, e.g., with risk metrics, and (ii) modification of the optimality criterion with a safety factor, e.g., penalties. The first category of RL methods generally requires external knowledge, e.g., to leverage imitation learning and/or engineer risk metrics for exploration (Abbeel et al., 2010;Koppejan and Whiteson, 2011;Marchesini et al., 2022). Even though some of these methods are not originally developed with safety in mind, the added guidance can often empirically reduce the number of catastrophic events during the learning process (Garcia and Fernández, 2012). ...
... It comprises RL techniques which, in addition to maximizing reward, satisfy certain criteria regarding the performance (safety) of the system during the learning and/or deployment processes. This is especially important for real-world environments with critical safety requirements, such as helicopter flight [102] and gas turbine control [70]. One possible approach to safe RL is to transform the reward function so that it includes some notion of risk. ...
Preprint
Full-text available
The field of Sequential Decision Making (SDM) provides tools for solving Sequential Decision Processes (SDPs), where an agent must make a series of decisions in order to complete a task or achieve a goal. Historically, two competing SDM paradigms have view for supremacy. Automated Planning (AP) proposes to solve SDPs by performing a reasoning process over a model of the world, often represented symbolically. Conversely, Reinforcement Learning (RL) proposes to learn the solution of the SDP from data, without a world model, and represent the learned knowledge subsymbolically. In the spirit of reconciliation, we provide a review of symbolic, subsymbolic and hybrid methods for SDM. We cover both methods for solving SDPs (e.g., AP, RL and techniques that learn to plan) and for learning aspects of their structure (e.g., world models, state invariants and landmarks). To the best of our knowledge, no other review in the field provides the same scope. As an additional contribution, we discuss what properties an ideal method for SDM should exhibit and argue that neurosymbolic AI is the current approach which most closely resembles this ideal method. Finally, we outline several proposals to advance the field of SDM via the integration of symbolic and subsymbolic AI.
... One approach that has been proposed to address this problem has been to include an explicit safety factor in the stage costs of the sequential decision making problem, that is a term that estimates the probability the state will transition into an unsafe state given the current control action (Sato et al., 2001;Gaskett, 2003;Geibel and Wysotzki, 2005). In addition to modifications of the cost function, a different stream of literature has considered explicitly modifying the exploration process of states to mitigate the risks of reaching an unsafe state due to random exploration (Martín H and Lope, 2009;Koppejan and Whiteson, 2011;Garcia and Fernández, 2012). Concepts of safety have also been applied in bandit problems (Sui et al., 2015;Sun et al., 2017), here the notion of risk 5 comes from a set of constraints set on the rewards of the arms and less from a predefined unsafe set. ...
Preprint
A key challenge in sequential decision making is optimizing systems safely under partial information. While much of the literature has focused on the cases of either partially known states or partially known dynamics, it is further exacerbated in cases where both states and dynamics are partially known. Computing heparin doses for patients fits this paradigm since the concentration of heparin in the patient cannot be measured directly and the rates at which patients metabolize heparin vary greatly between individuals. While many proposed solutions are model free, they require complex models and have difficulty ensuring safety. However, if some of the structure of the dynamics is known, a model based approach can be leveraged to provide safe policies. In this paper we propose such a framework to address the challenge of optimizing personalized heparin doses. We use a predictive model parameterized individually by patient to predict future therapeutic effects. We then leverage this model using a scenario generation based approach that is capable of ensuring patient safety. We validate our models with numerical experiments by comparing the predictive capabilities of our model against existing machine learning techniques and demonstrating how our dosing algorithm can treat patients in a simulated ICU environment.
... More recent methods have introduced methods to directly optimize the reward function subject to constraints [7], or to satisfy constraints with high probability [8,9,10]. Alternatively, some safe RL techniques use external knowledge, e.g., imitation learning, and/or risk metrics during exploration [11,12,13]. Even though some of these methods were not originally developed with safety in mind, the added guidance can often empirically reduce the number of catastrophic events during the learning process [14]. ...
Preprint
Full-text available
Enabling reinforcement learning (RL) to explicitly consider constraints is important for safe deployment in real-world process systems. This work exploits recent developments in deep RL and optimization over trained neural networks to introduce algorithms for safe training and deployment of RL agents. We show how optimization over trained neural-network state-action value functions (i.e., a critic function) can explicitly incorporate constraints and describe two corresponding RL algorithms: the first uses constrained optimization of the critic to give optimal actions for training an actor, while the second guarantees constraint satisfaction by directly implementing actions from optimizing a trained critic model. The two algorithms are tested on a supply chain case study from OR-Gym and are compared against state-of-the-art algorithms TRPO, CPO, and RCPO.
... Changes to the exploration process are made by risk-aware heuristics [6] or by initial or on-the-fly knowledge demonstrations [7,8]. Risk-aware heuristics are computationally cheap, but cannot guarantee strong confidence bounds around the estimated safety. ...
... Interestingly, neuroevolutionary methods achieve promising results using simple evolutionary algorithms [7], and proposing new strategies for these simple algorithms can significantly improve the convergence of training a neural network and its prediction performance. Although neuroevolutionary methods are generally computationally expensive, especially in deep learning [8], they are still preferred because they also offer flexibility in choosing the cost function (e.g., reward maximization [9]). ...
Article
Full-text available
Neural networks have demonstrated their usefulness for solving complex regression problems in circumstances where alternative methods do not provide satisfactory results. Finding a good neural network model is a time-consuming task that involves searching through a complex multidimensional hyperparameter and weight space in order to find the values that provide optimal convergence. We propose a novel neural network optimizer that leverages the advantages of both an improved evolutionary competitive algorithm and gradient-based backpropagation. The method consists of a modified, hybrid variant of the Imperialist Competitive Algorithm (ICA). We analyze multiple strategies for initialization, assimilation, revolution, and competition, in order to find the combination of ICA steps that provides optimal convergence and enhance the algorithm by incorporating a backpropagation step in the ICA loop, which, together with a self-adaptive hyperparameter adjustment strategy, significantly improves on the original algorithm. The resulting hybrid method is used to optimize a neural network to solve a complex problem in the field of chemical engineering: the synthesis and swelling behavior of the semi- and interpenetrated multicomponent crosslinked structures of hydrogels, with the goal of predicting the yield in a crosslinked polymer and the swelling degree based on several reaction-related input parameters. We show that our approach has better performance than other biologically inspired optimization algorithms and generates regression models capable of making predictions that are better correlated with the desired outputs.
... Conversely, the applications in real-world robotic domains also pose important new challenges for RL research. In particular, many real-world robotic environments and tasks, such as human-related robotic environments [6], helicopter manipulation [7,8], autonomous vehicle [9], and aerial delivery [10], have very low tolerance for violations of safety constraints, as such violation can cause severe consequences. This raises a substantial demand for safe reinforcement learning techniques. ...
... H et al.[7] and Koppejan et al.[8] studied the RL application on helicopters based on strict safety assumptions. Wen et al. developed a safe RL approach called Parallel Constrained Policy Optimization (PCPO) to specifically enhance the safety of autonomous vehicles[9]. ...
Preprint
As safety violations can lead to severe consequences in real-world robotic applications, the increasing deployment of Reinforcement Learning (RL) in robotic domains has propelled the study of safe exploration for reinforcement learning (safe RL). In this work, we propose a risk preventive training method for safe RL, which learns a statistical contrastive classifier to predict the probability of a state-action pair leading to unsafe states. Based on the predicted risk probabilities, we can collect risk preventive trajectories and reshape the reward function with risk penalties to induce safe RL policies. We conduct experiments in robotic simulation environments. The results show the proposed approach has comparable performance with the state-of-the-art model-based methods and outperforms conventional model-free safe RL approaches.
... A genetic algorithm [277] Simulation experiments Webots mobile robots [292] A Particle Swarm Optimization algorithm [200] Simulation experiments NAO robots [174] A learning algorithm [184] Real-world experiments The Aldebaran Nao of Humanoid robots [311] An iterative optimization algorithm [82] Real-world experiments The Aldebaran Nao of Humanoid robots [311] A kinetics teaching method [203] Real-world experiments NAO robots [174] A policy gradient method [138] Simulation experiments The Webots simulation package [185] A Deep RL method [74] Simulation Experiments MuJoCo robots [276] A Neuroevolutionary RL method [140] Simulation Experiments Helicopter control [19] A policy search method [136] Real-world experiments A Barrett robot arm Table 6: Different types of robot applications using different methods. ...
Preprint
Full-text available
Reinforcement learning has achieved tremendous success in many complex decision making tasks. When it comes to deploying RL in the real world, safety concerns are usually raised, leading to a growing demand for safe reinforcement learning algorithms, such as in autonomous driving and robotics scenarios. While safety control has a long history, the study of safe RL algorithms is still in the early stages. To establish a good foundation for future research in this thread, in this paper, we provide a review for safe RL from the perspectives of methods, theory and applications. Firstly, we review the progress of safe RL from five dimensions and come up with five problems that are crucial for safe RL being deployed in real-world applications, coined as "2H3W". Secondly, we analyze the theory and algorithm progress from the perspectives of answering the "2H3W" problems. Then, the sample complexity of safe RL methods is reviewed and discussed, followed by an introduction of the applications and benchmarks of safe RL algorithms. Finally, we open the discussion of the challenging problems in safe RL, hoping to inspire more future research on this thread. To advance the study of safe RL algorithms, we release a benchmark suite, an open-sourced repository containing the implementations of major safe RL algorithms, along with tutorials at the link: https://github.com/chauncygu/Safe-Reinforcement-Learning-Baselines.git.
... In reinforcement learning (RL), an agent perceives consecutive states of its environment and acts after each observation to maximize long-term cumulative expected reward [1]. One key challenge to the widespread deployment of RL in safetycritical systems is the difficulty of ensuring that an RL agent's policies are safe, especially when the system model and its surrounding environment are both unknown and subject to noise [2], [3], [4]. In this work, we consider the case of RL for guaranteed-safe navigation of mobile robots, such as autonomous cars or delivery drones, where safety means collision avoidance. ...
... 1 Ain Shams University, Cairo, Egypt. 2 Jacobs University, Bremen, Germany. 3 Stanford University, Stanford, CA, USA. 4 KTH Royal Institute of Technology, Stockholm, Sweden. Corresponding author: mahmoud.selim@eng.asu.edu.eg. ...
... We ensure safety by computing our robot's forward reachable set (FRS) for a given motion plan, then adjusting the Algorithm 1: Safe RL with BRSL 1 initialize the RL agent with a random policy π θ , environment model µ ϕ , empty replay buffer B, max number of time steps n iter , and a safe plan p 0 2 for each episode do 3 initialize task with reward function ρ 4x 1 ← observe initial environment state ...
Preprint
Full-text available
Reinforcement learning (RL) is capable of sophisticated motion planning and control for robots in uncertain environments. However, state-of-the-art deep RL approaches typically lack safety guarantees, especially when the robot and environment models are unknown. To justify widespread deployment , robots must respect safety constraints without sacrificing performance. Thus, we propose a Black-box Reachability-based Safety Layer (BRSL) with three main components: (1) data-driven reachability analysis for a black-box robot model, (2) a trajectory rollout planner that predicts future actions and observations using an ensemble of neural networks trained online, and (3) a differentiable polytope collision check between the reachable set and obstacles that enables correcting unsafe actions. In simulation, BRSL outperforms other state-of-the-art safe RL methods on a Turtlebot 3, a quadrotor, and a trajectory-tracking point mass with an unsafe set adjacent to the area of highest reward.