Figure 1 - uploaded by Gabriel Victor de la Cruz
Content may be subject to copyright.
Atari 2600 game screenshot of Pong, Freeway and Beamrider, from left to right. 

Atari 2600 game screenshot of Pong, Freeway and Beamrider, from left to right. 

Source publication
Poster
Full-text available
Deep reinforcement learning (deep RL) has achieved superior performance in complex sequential tasks by using a deep neural network as its function approximator and by learning directly from raw images. A drawback of using raw images is that deep RL must learn the state feature representation from the raw images in addition to learning a policy. As...

Context in source publication

Context 1
... test our approach in three Atari games: Pong, Freeway, and Beamrider, as shown in Figure 1. The games have 6, 3, and 9 actions, respectively. ...

Similar publications

Article
Full-text available
The current work has limitations in using GDL to represent domain knowledge for Automated Negotiations, which does not support imperfect information games in negotiation scenarios. In this paper, we expand the GDL and improve the automatic negotiation model so that the framework can describe the negotiation scenarios of imperfect information, and e...

Citations

... The latter does not take into account when or how often the advice is used. [10], [12] action dynamic [2], [6] [9], [14], [17] action dynamic [15], [16] Q-values dynamic [13] action dynamic [18] policy dynamic [19], [21], [22] RL-tuple fixed [3], [20] [ 4] action & fixed Q-values EF-OnTL RL-tuple fixed Our work differs from the above as replaces the need of an expert agent by dynamically selecting a learning agent as teacher to exploit different knowledge gained in different parts of a multi-agent system. Furthermore, while most of the related work continuously override target policy by advising actions, Q-values and policy, EF-OnTL transfers less frequently an experience batch designed upon the gap between source and target beliefs. ...
Preprint
Full-text available
Transfer learning in Reinforcement Learning (RL) has been widely studied to overcome training issues of Deep-RL, i.e., exploration cost, data availability and convergence time, by introducing a way to enhance training phase with external knowledge. Generally, knowledge is transferred from expert-agents to novices. While this fixes the issue for a novice agent, a good understanding of the task on expert agent is required for such transfer to be effective. As an alternative, in this paper we propose Expert-Free Online Transfer Learning (EF-OnTL), an algorithm that enables expert-free real-time dynamic transfer learning in multi-agent system. No dedicated expert exists, and transfer source agent and knowledge to be transferred are dynamically selected at each transfer step based on agents' performance and uncertainty. To improve uncertainty estimation, we also propose State Action Reward Next-State Random Network Distillation (sars-RND), an extension of RND that estimates uncertainty from RL agent-environment interaction. We demonstrate EF-OnTL effectiveness against a no-transfer scenario and advice-based baselines, with and without expert agents, in three benchmark tasks: Cart-Pole, a grid-based Multi-Team Predator-Prey (mt-pp) and Half Field Offense (HFO). Our results show that EF-OnTL achieve overall comparable performance when compared against advice-based baselines while not requiring any external input nor threshold tuning. EF-OnTL outperforms no-transfer with an improvement related to the complexity of the task addressed.
... Early techniques of RLfD typically extract prior knowledge by pre-training expert policies with supervised learning methods [15,16,17,18,19] or pushing them into memory replay buffer for value evaluation [20]. Once a expert policy is trained, it will be used in the subsequent RL as is, without differentiating how helpful it is in assisting the agent to learn or adapting how it can assist in different states. ...
Preprint
Full-text available
The current work on reinforcement learning (RL) from demonstrations often assumes the demonstrations are samples from an optimal policy, an unrealistic assumption in practice. When demonstrations are generated by sub-optimal policies or have sparse state-action pairs, policy learned from sub-optimal demonstrations may mislead an agent with incorrect or non-local action decisions. We propose a new method called Local Ensemble and Reparameterization with Split and Merge of expert policies (LEARN-SAM) to improve efficiency and make better use of the sub-optimal demonstrations. First, LEARN-SAM employs a new concept, the lambda-function, based on a discrepancy measure between the current state to demonstrated states to "localize" the weights of the expert policies during learning. Second, LEARN-SAM employs a split-and-merge (SAM) mechanism by separating the helpful parts in each expert demonstration and regrouping them into new expert policies to use the demonstrations selectively. Both the lambda-function and SAM mechanism help boost the learning speed. Theoretically, we prove the invariant property of reparameterized policy before and after the SAM mechanism, providing theoretical guarantees for the convergence of the employed policy gradient method. We demonstrate the superiority of the LEARN-SAM method and its robustness with varying demonstration quality and sparsity in six experiments on complex continuous control problems of low to high dimensions, compared to existing methods on RL from demonstration.
... The RL fine-tuning phase extends the NPG gradient to optimize by following where is the trajectory obtained by following policy , and w(s, a) is the weighting function where 0 and 1 are hyper-parameters, and k is the iteration counter. Cruz et al. (2018) used a multiclass-classification deep neural network trained from human demonstrations in the pre-training phase; the neural network implicitly learns the underlying features by only using the cross-entropy loss function. The network uses raw images of the domain as input, and it has a single output for each valid action. ...
... Human-guided RL is not only limited to approaches like IRL, Batch RL, or RLED. Other approaches can provide information to guide the RL process, as mentioned by Taylor (2018). Other forms of human in the loop can be used as guidance for RL, where the knowledge is available through a human in the online learning process, which can be seen as feedback. ...
Article
Full-text available
Reinforcement learning from expert demonstrations (RLED) is the intersection of imitation learning with reinforcement learning that seeks to take advantage of these two learning approaches. RLED uses demonstration trajectories to improve sample efficiency in high-dimensional spaces. RLED is a new promising approach to behavioral learning through demonstrations from an expert teacher. RLED considers two possible knowledge sources to guide the reinforcement learning process: prior knowledge and online knowledge. This survey focuses on novel methods for model-free reinforcement learning guided through demonstrations, commonly but not necessarily provided by humans. The methods are analyzed and classified according to the impact of the demonstrations. Challenges, applications, and promising approaches to improve the discussed methods are also discussed.
... Computational tools, architecture, and implementation as well as optimization techniques are improving rapidly. For example, pretrained models [42][43][44] (eg, AlexNet-T, 45 GoogLeNet-T 46 ) perform measurably better than untrained models (eg, AlexNet-U 47 ). This has been successful in developing AI classifiers for the adult retina, most prominently for diabetic retinopathy. ...
Article
Artificial intelligence (AI) applications are diverse and serve varied functions in clinical practice. The most successful products today are clinical decision tools used by physicians, but autonomous AI is gaining traction. Widespread use of AI is limited in part because of concerns about bias, fault-tolerance, and specificity. Adoption of AI often depends on removing cost and complexity in clinical workflow integration, providing clear incentives for use, and providing clear demonstration of clinical outcome. Existing wide-angle photographic screening could be integrated into the clinical workflow based on prior implementations for premature babies and linked with AI interpretation with existing technology. Incidence of retinal abnormality, clinical considerations, AI performance, grading variation for AI-augmented human grading, and cost and policy aspects play a significant role. Improved outcomes for newborns and a relatively high estimated incidence of abnormality have been named as benefits to counterweigh costs in the long term. [ Ophthalmic Surg Lasers Imaging Retina . 2021;52:S17–S22.]
... [6] and [7] propose to use demonstrations to initialize the replay buffer so that the training of a deep Q-Net ( [6]) or a deep deterministic policy gradient (DDPG) network ( [7]) can be benefited. [12], [9], [13] propose to directly use demonstrations to initialize neural network parameters by pretraining it with an imitation learning objective. Besides pretraining a DRL model, imitation learning can also be used to construct an auxiliary imitation learning loss ( [7], [14]). ...
Preprint
Full-text available
Our work aims at efficiently leveraging ambiguous demonstrations for the training of a reinforcement learning (RL) agent. An ambiguous demonstration can usually be interpreted in multiple ways, which severely hinders the RL-Agent from learning stably and efficiently. Since an optimal demonstration may also suffer from being ambiguous, previous works that combine RL and learning from demonstration (RLfD works) may not work well. Inspired by how humans handle such situations, we propose to use self-explanation (an agent generates explanations for itself) to recognize valuable high-level relational features as an interpretation of why a successful trajectory is successful. This way, the agent can provide some guidance for its RL learning. Our main contribution is to propose the Self-Explanation for RL from Demonstrations (SERLfD) framework, which can overcome the limitations of traditional RLfD works. Our experimental results show that an RLfD model can be improved by using our SERLfD framework in terms of training stability and performance.
... goal task. de la Cruz Jr et al. (2018) and de la Cruz Jr et al. (2019) collect labeled datasets from human demonstrations on the target task in order to pretrain the network. ...
Preprint
Full-text available
Pretraining is a common technique in deep learning for increasing performance and reducing training time, with promising experimental results in deep reinforcement learning (RL). However, pretraining requires a relevant dataset for training. In this work, we evaluate the effectiveness of pretraining for RL tasks, with and without distracting backgrounds, using both large, publicly available datasets with minimal relevance, as well as case-by-case generated datasets labeled via self-supervision. Results suggest filters learned during training on less relevant datasets render pretraining ineffective, while filters learned during training on the in-distribution datasets reliably reduce RL training time and improve performance after 80k RL training steps. We further investigate, given a limited number of environment steps, how to optimally divide the available steps into pretraining and RL training to maximize RL performance. Our code is available on GitHub
... To reduce the training time of deep reinforcement learning, a lot of work has been carried out, such as combining transfer learning (Barrett et al., 2010) and pre-training neural networks (Cruz et al., 2017). Among them, there is a kind of human-guided method, which aims to improve the sample efficiency of deep reinforcement learning. ...
Article
Full-text available
Purpose This paper aims to realize a fully distributed multi-UAV collision detection and avoidance based on deep reinforcement learning (DRL). To deal with the problem of low sample efficiency in DRL and speed up the training. To improve the applicability and reliability of the DRL-based approach in multi-UAV control problems. Design/methodology/approach In this paper, a fully distributed collision detection and avoidance approach for multi-UAV based on DRL is proposed. A method that integrates human experience into policy training via a human experience-based adviser is proposed. The authors propose a hybrid control method which combines the learning-based policy with traditional model-based control. Extensive experiments including simulations, real flights and comparative experiments are conducted to evaluate the performance of the approach. Findings A fully distributed multi-UAV collision detection and avoidance method based on DRL is realized. The reward curve shows that the training process when integrating human experience is significantly accelerated and the mean episode reward is higher than the pure DRL method. The experimental results show that the DRL method with human experience integration has a significant improvement than the pure DRL method for multi-UAV collision detection and avoidance. Moreover, the safer flight brought by the hybrid control method has also been validated. Originality/value The fully distributed architecture is suitable for large-scale unmanned aerial vehicle (UAV) swarms and real applications. The DRL method with human experience integration has significantly accelerated the training compared to the pure DRL method. The proposed hybrid control strategy makes up for the shortcomings of two-dimensional light detection and ranging and other puzzles in applications.
... A major challenge of deep RL is that it often requires a prohibitively large amount of training time and data to reach a reasonable accuracy, making it inapplicable in realworld settings [3]. Leveraging humans to provide demonstrations (known as learning from demonstration (LfD)) in RL has recently gained traction as a possible way of speeding up deep RL [4], [5], [6]. ...
... In LfD, actions demonstrated by the human are considered as the ground truth labels for a given input game/image frame. An agent closely simulates the demonstrator's policy at the start, and later on, learns to surpass the demonstrator [3]. However, LfD holds a distinct challenge, in the sense that it often requires the agent to acquire skills from only a few demonstrations and interactions due to the time and expense of acquiring them [7]. ...
... We propose the technique of pre-training the underlying deep neural network to speed up training in deep RL. It enables the RL agent to learn better features leading to better performance without changing the policy learning strategies [3]. In supervised methods, pre-training helps regularisation and enables faster convergence compared to randomly initialised networks [8]. ...
Preprint
Reinforcement Learning (RL) is a semi-supervised learning paradigm which an agent learns by interacting with an environment. Deep learning in combination with RL provides an efficient method to learn how to interact with the environment is called Deep Reinforcement Learning (deep RL). Deep RL has gained tremendous success in gaming - such as AlphaGo, but its potential have rarely being explored for challenging tasks like Speech Emotion Recognition (SER). The deep RL being used for SER can potentially improve the performance of an automated call centre agent by dynamically learning emotional-aware response to customer queries. While the policy employed by the RL agent plays a major role in action selection, there is no current RL policy tailored for SER. In addition, extended learning period is a general challenge for deep RL which can impact the speed of learning for SER. Therefore, in this paper, we introduce a novel policy - "Zeta policy" which is tailored for SER and apply Pre-training in deep RL to achieve faster learning rate. Pre-training with cross dataset was also studied to discover the feasibility of pre-training the RL Agent with a similar dataset in a scenario of where no real environmental data is not available. IEMOCAP and SAVEE datasets were used for the evaluation with the problem being to recognize four emotions happy, sad, angry and neutral in the utterances provided. Experimental results show that the proposed "Zeta policy" performs better than existing policies. The results also support that pre-training can reduce the training time upon reducing the warm-up period and is robust to cross-corpus scenario.
... Works in (Cruz et al., 2017) and (Nair et al., 2017) apply ideas of imitation learning to fit the demonstration dataset in a supervised manner. (Xiang and Su, 2019) use the mean episode rewards in demonstration data as a baseline to amplify the incentive in high reward zone. ...
Conference Paper
Full-text available
Using direct reinforcement learning (RL) to accomplish a task can be very inefficient, especially in robotic configurations where interactions with the environment are lengthy and costly. Instead, learning from expert demonstration (LfD) is an alternative approach to gain better performance in an RL setting, which also greatly improves sample efficiency. We propose a novel demonstration learning framework for actor-critic based algorithms. Firstly, we put forward an environment pre-training paradigm to initialize the model parameters without interacting with the target environment, which effectively avoids the cold start problem in deep RL scenarios.Secondly, we design a general-purpose LfD framework for most of the mainstream actor-critic RL algorithms that include a policy network and a value function like PPO, SAC, TRPO, A3C. Thirdly, we build a dedicated model training platform to perform the human-robot interaction and numerical experimentation. We evaluate the method in six Mujoco simulated locomotion environments and our robot control simulation platform. Results show that several epochs of pre-training can improve the agent's performance over the early stage of training. Also, the final converged performance of the RL algorithm is also boosted by external demonstration. In general the sample efficiency is improved by 30% with the proposed method. Our demonstration pipeline makes full use of the exploration property of the RL algorithm, which is feasible for fast teaching robots in dynamic environments.
... A major challenge of deep RL is that it often requires a prohibitively large amount of training time and data to reach a reasonable accuracy, making it inapplicable in real-world settings [8]. Leveraging humans to provide demonstrations (known as learning from demonstration (LfD)) in RL has recently gained traction as a possible way of speeding up deep RL [9,10,11]. ...
... In LfD, actions demonstrated by the human are considered as the ground truth labels for a given input game/image frame. An agent closely simulates the demonstrator's policy at the start, and later on, learns to surpass the demonstrator [8]. However, LfD holds a distinct challenge, in the sense that it often requires the agent to acquire skills from only a few demonstrations and interactions due to the time and expense of acquiring them [12]. ...
... Pre-training the underlying deep neural network is another approach to speed up training in deep RL. It enables the RL agent to learn better features which leads to better performance without changing the policy learning strategies [8]. In supervised methods, pre-training helps regularisation and enables faster convergence compared to randomly initialised networks [13]. ...
Preprint
Full-text available
Deep reinforcement learning (deep RL) is a combination of deep learning with reinforcement learning principles to create efficient methods that can learn by interacting with its environment. This has led to breakthroughs in many complex tasks, such as playing the game "Go", that were previously difficult to solve. However, deep RL requires significant training time making it difficult to use in various real-life applications such as Human-Computer Interaction (HCI). In this paper, we study pre-training in deep RL to reduce the training time and improve the performance of Speech Recognition, a popular application of HCI. To evaluate the performance improvement in training we use the publicly available "Speech Command" dataset, which contains utterances of 30 command keywords spoken by 2,618 speakers. Results show that pre-training with deep RL offers faster convergence compared to non-pre-trained RL while achieving improved speech recognition accuracy.