We propose a two-phase method for learning safe control in fully controllable robotic manipulation tasks. In the first phase, the state-space support of supervisor demonstrations is estimated to infer implicit constraints. In the second phase, we present a switching policy to prevent the robot from leaving safe states. The policy switches between the robot’s learned policy and a novel failure-avoidance policy depending on the distance to the boundary of the support. We prove that inferred constraints are guaranteed to be enforced if the support is well-estimated. A simulated pushing task suggests that support estimation and failure avoidance control can reduce failures by 87% while sacrificing only 40% of performance. On a line tracking task using a da Vinci surgical robot, failure avoidance control reduced failures by 84%.
We identify abstractions of algorithmic components that are common in successful reinforcement learning (RL) algorithms. We argue that these components form a convenient standard modularization of RL workflows, which encapsulates parallelism and resource requirements within short-running compute tasks, and provides interfaces for easy development of new algorithms and reuse of existing implementations. We offer RLlib, a library of scalable RL primitives built on Ray, as demonstrated by its state-of-the-art results for popular and cutting-edge RL algorithms.
We propose and evaluate a two-phase calibration process for cable-driven robots. In Phase I, the robot performs a set of open-loop trajectories and learns a nonlinear transformation from camera pixels to joint values. In Phase II, the robot uses Phase I to move systematically to target points in a printed array, where a human operator manually adjusts the end-effector position. Phase I can reduce average error from 4.55mm to 2.14mm, and together with Phase II to 1.08mm. We apply this combination to clear away raisins and pumpkin seeds with success rate up to 99.2%, exceeding prior results and more than twice faster.
We explore how the Toyota HSR mobile manipulator robot can autonomously adjust the screw-cap on drip emitters for precision irrigation. As the built-in sensors do not provide sufficient accuracy for gripper alignment, we designed a lightweight, modular Emitter Localization Device (ELD) with cameras and LEDs that can be non-invasively mounted on the gripper. This paper presents details on the design, algorithms, and experiments in bringing the gripper into alignment with a sequence of 9 emitters using a two-phase procedure: 1) aligning the robot base using the built-in hand camera, and 2) aligning the gripper axis with the emitter axis using the ELD. Experiments suggest that each emitter can be reliably adjusted in under 20 seconds.
The main challenges that set algorithmic domains apart from other imitation learning domains are the need for high accuracy, the involvement of specific structures of data, and the extremely limited observability. To address these challenges, we propose to model programs as Parametrized Hierarchical Procedures (PHPs). A PHP is a sequence of conditional operations, that uses a program counter, along with the observation, to select between taking an elementary action, invoking another PHP as a sub-procedure, and returning to the caller. We develop an algorithm for training PHPs from a mixture of annotated and unannotated demonstrations, and apply it to efficient level-wise training of multi-level PHPs. We show in two benchmarks, NanoCraft and long-hand addition, that PHPs can learn neural programs more accurately from smaller amounts of strong and weak supervision.
Discovery of Deep Continuous Options (DDCO) learns from demonstrations low-level continuous control skills parametrized by deep neural networks. A hybrid categorical–continuous distribution model parametrizes high-level policies that can invoke discrete options as well continuous control actions, and a cross-validation method tunes the number of options to be discovered. We evaluate DDCO in simulation of a 3-link robot, and in two physical experiments on the da Vinci surgical robot.
* Equal contribution
DART injects noise during supervisor demonstrations, with a noise level optimized to approximate the error of the robot’s trained policy. This forces the supervisor to demonstrate how to recover from errors, and doesn’t suffer from supervisor burden, computational cost, and training risk as much as on-policy methods. DART learns better policies faster and from fewer demonstrations.
Bilateral Iterated Best Response (BIBR) reduces supervisor burden when training multiple manipulators, by rolling out an estimated robot policy for one arm while the human demonstrates for the other, iteratively updating the estimated policy. BIBR learns policies with increased success rate, and shorter and smoother trajectories.
We explore how characterizing supervisor inconsistency and correcting for this noise can improve task performance with a limited budget of data. In a planar part extraction task where human operators provide demonstrations by teleoperating a 2DOF robot, CNN models perform better when trained after error corrections.
We consider the problem of controlling a linear system with Gaussian noise and quadratic cost (LQG), using a memoryless controller that has limited capacity of the channel connecting its sensor to its actuator. We formulate this setting as a sequential rate-distortion (SRD) problem, where we minimize the rate of information required for the controller’s operation, under a constraint on its external cost. We present the optimality principle, and study the interesting and useful phenomenology of the optimal controller, such as the principled reduction of its order.
We consider the case where the controller is retentive (memory-utilizing). We can view the memory reader as one more sensor, and the memory writer as one more actuator. We can then formulate the problem of control under communication limitations, again as a sequential rate-distortion (SRD) problem of minimizing the rate of information required for the controller’s operation, under a constraint on its external cost. We show that this problem can be reduced to the memoryless case, studied in Part I. We then further investigate the form of the resulting optimal solution, and demonstrate its interesting phenomenology.
We characterize a good set of prior options as the centroids of clusters of control options that are optimized for a set of subtasks. We formulate this insight as an optimization problem and derive an optimization algorithm that alternates between planning given the set of prior options and clustering the set of control options. We illustrate this approach in a simple two-room simulation.
* Equal contribution
We identify a shortcoming of off-policy reinforcement learning algorithms, in which the optimization over noisy estimates introduces bias during updates. We propose G-learning, a new off-policy learning algorithm that regularizes the updates by introducing an informational cost. We show how these soft updates reduce the accumulation of bias and lead to faster convergence. We discuss additional benefits of G-learning, such as the ability to naturally incorporate any available prior domain knowledge, and to avoid some exploration costs. We illustrate its benefits in several examples where G-learning results in significant improvements of the learning rate and the learning cost.
* Equal contribution
We model the process by which the brain and the computer in a brain-computer interface (BCI) co-adapt to one another. We show in this simplified Linear-Quadratic-Gaussian (LQG) model how the brain’s neural encoding can adapt to the task of controlling the computer, at the same time that the computer’s adaptive decoder can adapt to the task of estimating the intention signal, leading to improvement in the system’s performance. We then propose an encoder-aware decoder adaptation scheme, which allows the computer to drive improvement forward faster by anticipating the brain’s adaptation.
* Equal contribution
We define Passive POMDPs, where actions do not affect the world state, but still incur costs. We present a variational principle for the problem of maintaining in memory the information state that is most useful for minimizing the cost, leading to a trade-off between memory and sensing, similar to multi-terminal source coding. We analyze the problem as an equivalent joint-state MDP, and introduce an efficient and simple algorithm for finding an optimum.
We define Only-Costly-Observable Markov Decision Processes (OCOMDPs), an interesting subclass of POMDPs, where the state is unobservable, except through a costly action that completely reveals the state. This is an extention of Unobservable MDPs, where planning is known to be NP-complete. Despite this computational complexity, we give an algorithm for PAC-learning OCOMDPs with polynomial interaction complexity, given a planning oracle.
Symposia and Workshops
We propose to pre-train perceptual embeddings for reinforcement learning, on the task of predicting future rewards for various action sequences. Unlike autoencoders and state-predictive embeddings, the proposed embedding is not concerned with reconstructing irrelevant perceptual features, such as raw pixels. We demonstrate that the learned embedding can reconstruct task-relevant state features and improve learning performance.
We argue for building composable RL components by encapsulating parallelism and resource requirements within individual components, which can be achieved by building on top of a flexible task-based programming model. We demonstrate this principle by building Ray RLlib on top of Ray and show that we can implement a wide range of state-of-the-art algorithms by composing and reusing a handful of standard components. This composability does not come at the cost of performance — in our experiments, RLlib matches or exceeds the performance of highly optimized reference implementations.
* Equal contribution
We formulate the problem of optimizing an agent under both extrinsic and intrinsic constraints on its operation in a dynamical system and develop the main tools for solving it. We identify the challenging convergence properties of the optimization algorithm, such as the bifurcation structure of the update operator near phase transitions. We study the special case of linear-Gaussian dynamics and quadratic cost (LQG), where the optimal solution has a particularly simple and solvable form. We also explore the learning task, where the model of the world dynamics is unknown and sample-based updates are used instead.
We study the interaction complexity of several partially observable decision processes. We show a simple subclass of POMDPs which are learnable, but require exponential interaction complexity. We discuss the previously studied OPOMDPs, and show they can be PAC-learnable in polynomial interaction complexity. We then define a more general and complex subclass of POMDPs, which we call OCOMDPs, and give an algorithm for PAC-learning them with polynomial interaction complexity, given a planning oracle.
Existing techniques for automated option discovery do not scale to multi-level hierarchies and to expressive representations such as deep networks. We present Discovery of Deep Options (DDO), a policy-gradient algorithm that discovers parametrized options from a set of demonstration trajectories, and can be used recursively to discover additional levels of the hierarchy. We show that DDO is effective in adding options that accelerate learning in 4 out of 5 Atari RAM environments chosen in our experiments. We also show that DDO can discover structure in robot-assisted surgical videos and kinematics that match expert annotation with 72% accuracy.
* Equal contribution
We present the minimum-information principle for selective attention in reactive agents. We motivate this approach by reducing the general problem of optimal control in POMDPs, to reactive control with complex observations. We introduce a forward-backward algorithm for finding optimal selective-attention policies, and illustrate it with several examples. Finally, we analyze and explore the newly discovered phenomenon of period doubling bifurcations in this optimization process.