Coline Devin's research while affiliated with University of California, Berkeley and other places

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Figure 10 | RoboCat-lim trained with additional demonstrations vs with additional demonstrations and self-generated data.

Figure 11 | RoboCat compared with the performance of the datagenerating agents or the combined performance of these and the demonstrations, the latter of which is used for training RoboCat.

Figure 14 | Reconstructions from our RoboCat-lim VQ-GAN on the training datasets. From right to left, panda sim, sawyer real red-on-blue, sawyer sim (with visual domain randomisation), MetaWorld, DM control, and ImageNet.

Figure 15 | Reconstructions of tasks not included in the RoboCat-lim VQ-GAN training: YCB fruit lifting and vegetable lifting. Although the reconstructions are inaccurate, they contain enough information for the agent to learn the task.

Figure 17 | Reconstructions of tasks not included in the v1.0 VQ-GAN training: YCB fruit insert/remove, and the 3-fingered gripper. While the vegetables and YCB fruit were seen in the agent play data, the bowl was not seen at all in the training data.

+1

RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation

June 2023

·

240 Reads

Konstantinos Bousmalis

·

·

·

[...]

·

The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a foundation agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming multi-embodiment action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100--1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.

How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation

Conference Paper

October 2022

·

7 Reads

·

9 Citations

·

·

Jost Tobias Springenberg

·

[...]

·

Konstantinos Bousmalis

How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation

Preprint

May 2022

·

27 Reads

·

·

Jost Tobias Springenberg

·

[...]

·

Konstantinos Bousmalis

Reinforcement learning (RL) has been shown to be effective at learning control from experience. However, RL typically requires a large amount of online interaction with the environment. This limits its applicability to real-world settings, such as in robotics, where such interaction is expensive. In this work we investigate ways to minimize online interactions in a target task, by reusing a suboptimal policy we might have access to, for example from training on related prior tasks, or in simulation. To this end, we develop two RL algorithms that can speed up training by using not only the action distributions of teacher policies, but also data collected by such policies on the task at hand. We conduct a thorough experimental study of how to use suboptimal teachers on a challenging robotic manipulation benchmark on vision-based stacking with diverse objects. We compare our methods to offline, online, offline-to-online, and kickstarting RL algorithms. By doing so, we find that training on data from both the teacher and student, enables the best performance for limited data budgets. We examine how to best allocate a limited data budget -- on the target task -- between the teacher and the student policy, and report experiments using varying budgets, two teachers with different degrees of suboptimality, and five stacking tasks that require a diverse set of behaviors. Our analysis, both in simulation and in the real world, shows that our approach is the best across data budgets, while standard offline RL from teacher rollouts is surprisingly effective when enough data is given.

+4

Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes

October 2021

·

274 Reads

·

·

·

[...]

·

We study the problem of robotic stacking with objects of complex geometry. We propose a challenging and diverse set of such objects that was carefully designed to require strategies beyond a simple "pick-and-place" solution. Our method is a reinforcement learning (RL) approach combined with vision-based interactive policy distillation and simulation-to-reality transfer. Our learned policies can efficiently handle multiple object combinations in the real world and exhibit a large variety of stacking skills. In a large experimental study, we investigate what choices matter for learning such general vision-based agents in simulation, and what affects optimal transfer to the real robot. We then leverage data collected by such policies and improve upon them with offline RL. A video and a blog post of our work are provided as supplementary material.

Figure 4: Left: Ablation of different number of grasp pretraining samples (Npt) and navigation reward relabeling (described in subsection 4.2) in simulation. Increased pretraining of the grasping policy increases overall mobile manipulation performance. Center: Analysis to find the best parameters for the autonomous curriculum compared to the stationary curriculum. With the proper settings (Nstart|Nstop|Nmax) the autonomous curriculum can be almost as efficient as the stationary, without requiring as much manual effort. Right: Ablation of ReLMM, all use the stationary curriculum and no relabeling except for Ours (AutoCurr). The Single Policy does not use our grasp/navigation decomposition and fails to learn in a reasonable time. Pretrain Only freezes the mediocre grasping policy from the stationary phase and only trains the navigation. Removing the grasp policy's Uncertainty Bonus reduces performance due to lack of exploration. Learned perturbation using RND [47] like [2] does not improve performance in this simulation environment.

Figure 5: Simulation task is to collect the green balls.

Figure 9: Evaluation of 5 different checkpoints of the ReLMM training in the obst+rugs room, training shows nearly linear improvement (the stationary pretraining phase is marked with a dashed line). Scripted policy does not improve over time.

Percent of objects collected after training in

ReLMM: Practical RL for Learning Mobile Manipulation Skills Using Only Onboard Sensors

July 2021

·

34 Reads

·

·

·

[...]

·

In this paper, we study how robots can autonomously learn skills that require a combination of navigation and grasping. Learning robotic skills in the real world remains challenging without large-scale data collection and supervision. Our aim is to devise a robotic reinforcement learning system for learning navigation and manipulation together, in an \textit{autonomous} way without human intervention, enabling continual learning under realistic assumptions. Specifically, our system, ReLMM, can learn continuously on a real-world platform without any environment instrumentation, without human intervention, and without access to privileged information, such as maps, objects positions, or a global view of the environment. Our method employs a modularized policy with components for manipulation and navigation, where uncertainty over the manipulation success drives exploration for the navigation controller, and the manipulation module provides rewards for navigation. We evaluate our method on a room cleanup task, where the robot must navigate to and pick up items of scattered on the floor. After a grasp curriculum training phase, ReLMM can learn navigation and grasping together fully automatically, in around 40 hours of real-world training.

Modular Networks for Compositional Instruction Following

Conference Paper

January 2021

·

11 Reads

·

19 Citations

·

·

·

[...]

·

Modularity Improves Out-of-Domain Instruction Following

Preprint

October 2020

·

9 Reads

·

·

·

[...]

·

We propose a modular architecture for following natural language instructions that describe sequences of diverse subgoals, such as navigating to landmarks or picking up objects. Standard, non-modular, architectures used in instruction following do not exploit subgoal compositionality and often struggle on out-of-distribution tasks and environments. In our approach, subgoal modules each carry out natural language instructions for a specific subgoal type. A sequence of modules to execute is chosen by learning to segment the instructions and predicting a subgoal type for each segment. When compared to standard sequence-to-sequence approaches on ALFRED, a challenging instruction following benchmark, we find that modularization improves generalization to environments unseen in training and to novel tasks.

Self-Supervised Goal-Conditioned Pick and Place

Preprint

August 2020

·

16 Reads

·

Payam Rowghanian

·

·

[...]

·

Khashayar Rohanimanesh

Robots have the capability to collect large amounts of data autonomously by interacting with objects in the world. However, it is often not obvious \emph{how} to learning from autonomously collected data without human-labeled supervision. In this work we learn pixel-wise object representations from unsupervised pick and place data that generalize to new objects. We introduce a novel framework for using these representations in order to predict where to pick and where to place in order to match a goal image. Finally, we demonstrate the utility of our approach in a simulated grasping environment.

Adapting Deep Visuomotor Representations with Weak Pairwise Constraints

Chapter

May 2020

·

66 Reads

·

96 Citations

·

·

·

[...]

·

Real-world robotics problems often occur in domains that differ significantly from the robot’s prior training environment. For many robotic control tasks, real world experience is expensive to obtain, but data is easy to collect in either an instrumented environment or in simulation. We propose a novel domain adaptation approach for robot perception that adapts visual representations learned on a large easy-to-obtain source dataset (e.g. synthetic images) to a target real-world domain, without requiring expensive manual data annotation of real world data before policy search. Supervised domain adaptation methods minimize cross-domain differences using pairs of aligned images that contain the same object or scene in both the source and target domains, thus learning a domain-invariant representation. However, they require manual alignment of such image pairs. Fully unsupervised adaptation methods rely on minimizing the discrepancy between the feature distributions across domains. We propose a novel, more powerful combination of both distribution and pairwise image alignment, and remove the requirement for expensive annotation by using weakly aligned pairs of images in the source and target domains. Focusing on adapting from simulation to real world data using a PR2 robot, we evaluate our approach on a manipulation task and show that by using weakly paired images, our method compensates for domain shift more effectively than previous techniques, enabling better robot performance in the real world.

SMiRL: Surprise Minimizing RL in Dynamic Environments

Preprint

December 2019

·

36 Reads

·

·

·

[...]

·

All living organisms struggle against the forces of nature to carve out niches where they can maintain homeostasis. We propose that such a search for order amidst chaos might offer a unifying principle for the emergence of useful behaviors in artificial agents. We formalize this idea into an unsupervised reinforcement learning method called surprise minimizing RL (SMiRL). SMiRL trains an agent with the objective of maximizing the probability of observed states under a model trained on previously seen states. The resulting agents can acquire proactive behaviors that seek out and maintain stable conditions, such as balancing and damage avoidance, that are closely tied to an environment's prevailing sources of entropy, such as wind, earthquakes, and other agents. We demonstrate that our surprise minimizing agents can successfully play Tetris, Doom, control a humanoid to avoid falls and navigate to escape enemy agents, without any task-specific reward supervision. We further show that SMiRL can be used together with a standard task reward to accelerate reward-driven learning.

... Additionally, [23], [24] perform model-based planning in a learned latent space to tackle block construction tasks from pixels. Recently, [25] proposed a new manipulation benchmark involving picking and placing objects of complex geometries, adopted by several [26], [27]. However, in this work, we consider block construction tasks of regular shapes, taken from the BulletArm benchmark [28]. ...
Reference:
Learning from Pixels with Expert Observations

How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation

Citing Conference Paper
October 2022

·

·

Jost Tobias Springenberg

·

[...]

·

Konstantinos Bousmalis

... For compositional generalization tasks, it is a good practice to enhance the primary task by addressing the sub-tasks (Corona et al., 2021;Bursztyn et al., 2022). In the context of dish name prediction, there are three sub-tasks: recognizing the food / action / flavor components. ...
Reference:
DiNeR: a Large Realistic Dataset for Evaluating Compositional Generalization

Modular Networks for Compositional Instruction Following

Citing Conference Paper
January 2021

·

·

·

[...]

·

... The central concept is to modify simulated environments using real-world samples [24]. In [25], a novel domain adaptation approach was proposed for robot perception to close the reality gap between simulation and the real world by searching common features of synthetic and real data. In [17], an end-to-end pipeline was developed to generate realistic depth data from 3D simulation models by accurately modelling vital real-world factors such as sensor noise and surface geometry. ...
Reference:
Sim-to-Real Deep Reinforcement Learning with Manipulators for Pick-and-Place

Adapting Deep Visuomotor Representations with Weak Pairwise Constraints

Citing Chapter
May 2020

·

·

·

[...]

·

... There are textual-based explanation methods [15], [16], [17], [18] and visual-based explanations [19], [20], [21], [22], Fig. 1. (i) is the explanation generated by a traditional E2EDM, which suffers from a low objectification degree (the extent to which driving-related object features are utilized); (ii) is the explanation generated by a ROB-integrated E2EDM [29], despite it could have a high objectification degree, it suffers from a low simplification degree (the simplicity of the explanation), leading to overcomplex explanations tend to confuse and deceive humans, i.e., humans are unable to recognize the precise cause responsible for the prediction; (iii) is the explanation generated by our proposed SOB-integrated E2EDM, which has higher simplification and objectification degrees that could improve the explainability of E2EDMs. [23], [24] methods, the former generates natural language to explain why the driving models perform a specific driving action, the latter uses visual information, i.e., images to offer intuitive explanations. Compared to textual-based explanations, visual-based explanations have the advantage at time-critical tasks, such as driving, thus in this paper, we focus on visual-based explanations. ...
Reference:
Toward Explainable End-to-End Driving Models via Simplified Objectification Constraints

Monocular Plan View Networks for Autonomous Driving

Citing Conference Paper
November 2019

·

·

·

[...]

·

... The concept of objectcentric representation has long been recognized for its potential to enhance robotic perception and manipulation by focusing on the objects within a scene. Prior works have shown effectiveness of such representation in downstream manipulation tasks by factorizing visual scenes into disentangled object concepts [46,47,48,49,50], but these works are typically confined to known object categories or instances. Recent developments in foundation models allow robots to access the open-world object concepts through pre-trained vision models [13,14], enabling a wide range of abilities such as imitation of long-horizon tabletop manipulation [5,51] or mobile manipulation in the wild [52]. ...
Reference:
Vision-based Manipulation from Single Human Video with Open-World Object Graphs

Deep Object-Centric Policies for Autonomous Driving

Citing Conference Paper
May 2019

·

·

·

[...]

·

... Detection Transformer. Taking into consideration the objectcentric nature of an interaction process [13], we introduce a detection query q det ∈ R d to locate the interaction object, as highlighted in purple in Fig. 2. To expedite training convergence, we initialize the detection query with the aggregated visual embeddingṽ. It is noteworthy that the detection query is designed to regress the object location mainly based on the information inferred from the prediction queries q pred , which we expect to contain comprehensive information about the target state. ...
Reference:
Learning Manipulation by Predicting Interaction

Deep Object-Centric Representations for Generalizable Robot Learning

Citing Conference Paper
May 2018

·

·

·

... Luo et al. [37] go even further and propose a RFID-based localization framework that allows to accurately track objects. Object-centric representations, learned from visual data, can serve as a powerful means to understand physical agent and object interactions [62,10]. As a recent example, Janner et al. [22] propose an object-oriented prediction and planning approach to model physical object interactions for stacking tasks. ...
Reference:
Modeling Long-horizon Tasks as Sequential Interaction Landscapes

Deep Object-Centric Representations for Generalizable Robot Learning

Citing Article
August 2017

·

·

·

... Robot Transfer Learning Transfer learning has long been considered as a primary challenge in robotics [78,79]. In the context of reinforcement learning, many previous work transfer different components such as policies [12,46,16], parameters [18,40,13], features [3,22], experience samples [48], value functions [53,89,80], and reward functions [45]. In imitation learning, additional studies have made progress via domain adaptation [44,86], querying unlabeled datasets [14], abstracting and transferring concepts [49,75], or conditioning on other information such as language instructions [77,54,37] and goal images [64]. ...
Reference:
Perception Stitching: Zero-Shot Perception Encoder Transfer for Visuomotor Robot Policies

Learning modular neural network policies for multi-task and multi-robot transfer

Citing Conference Paper
May 2017

·

·

·

[...]

·

... From a more robotic perspective, skill acquisition is an explicit procedure for grounding task-specific domain-agnostic features for easy transfer. With paired data from both domains, agents learn multiple skills and transfer knowledge by training in invariant feature spaces, upon which target domain agents can acquire new skills mastered by source domain agents (Gupta et al., 2017). In a hierarchical scheme, STAR (Pertsch et al., 2022) pre-trains a low-level policy to decode actions from learned high-level semantic skill policies that select a transferable skill in target task learning. ...
Reference:
A Comprehensive Survey of Cross-Domain Policy Transfer for Embodied Agents

Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning

Citing Article
March 2017

·

·

·

[...]

·

... ii) transferring: the learned MTRL policy could serve as a starting point to be transferred to a new task. While the first point has been extensively explored in MTRL literature [3], [8], [9], [10], [11], [7], [12], [13], the second one is less investigated [14], [15], [16], potentially due to many practical challenges in MTRL itself, e.g., conflicts between tasks [9] and training stability [12], [6], limiting its effectiveness on transferring. ...
Reference:
Efficient Multi-Task and Transfer Reinforcement Learning with Parameter-Compositional Framework

Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer

Citing Article
September 2016

·

·

·

[...]

·