We argue for building composable RL components by encapsulating parallelism and resource requirements within individual components, which can be achieved by building on top of a flexible task-based programming model. We demonstrate this principle by building Ray RLlib on top of Ray and show that we can implement a wide range of state-of-the-art algorithms by composing and reusing a handful of standard components. This composability does not come at the cost of performance — in our experiments, RLlib matches or exceeds the performance of highly optimized reference implementations.
Eric Liang*, Richard Liaw*, Robert Nishihara, Philipp Moritz, Roy Fox, Joseph Gonzalez, Ken Goldberg, and Ion Stoica, DeepRL Symposium, NIPS 2017 * Equal contribution
We characterize a good set of prior options as the centroids of clusters of control options that are optimized for a set of subtasks. We formulate this insight as an optimization problem and derive an optimization algorithm that alternates between planning given the set of prior options and clustering the set of control options. We illustrate this approach in a simple two-room simulation.
Roy Fox*, Michal Moshkovitz*, and Naftali Tishby, EWRL 2016 * Equal contribution
We identify a shortcoming of off-policy reinforcement learning algorithms, in which the optimization over noisy estimates introduces bias during updates. We propose G-learning, a new off-policy learning algorithm that regularizes the updates by introducing an informational cost. We show how these soft updates reduce the accumulation of bias and lead to faster convergence. We discuss additional benefits of G-learning, such as the ability to naturally incorporate any available prior domain knowledge, and to avoid some exploration costs. We illustrate its benefits in several examples where G-learning results in significant improvements of the learning rate and the learning cost.
Roy Fox*, Ari Pakman*, and Naftali Tishby, UAI 2016 * Equal contribution
We define Passive POMDPs, where actions do not affect the world state, but still incur costs. We present a variational principle for the problem of maintaining in memory the information state that is most useful for minimizing the cost, leading to a trade-off between memory and sensing, similar to multi-terminal source coding. We analyze the problem as an equivalent joint-state MDP, and introduce an efficient and simple algorithm for finding an optimum.
We define Only-Costly-Observable Markov Decision Processes (OCOMDPs), an interesting subclass of POMDPs, where the state is unobservable, except through a costly action that completely reveals the state. This is an extention of Unobservable MDPs, where planning is known to be NP-complete. Despite this computational complexity, we give an algorithm for PAC-learning OCOMDPs with polynomial interaction complexity, given a planning oracle.
We formulate the problem of optimizing an agent under both extrinsic and intrinsic constraints on its operation in a dynamical system and develop the main tools for solving it. We identify the challenging convergence properties of the optimization algorithm, such as the bifurcation structure of the update operator near phase transitions. We study the special case of linear-Gaussian dynamics and quadratic cost (LQG), where the optimal solution has a particularly simple and solvable form. We also explore the learning task, where the model of the world dynamics is unknown and sample-based updates are used instead.
We study the interaction complexity of several partially observable decision processes. We show a simple subclass of POMDPs which are learnable, but require exponential interaction complexity. We discuss the previously studied OPOMDPs, and show they can be PAC-learnable in polynomial interaction complexity. We then define a more general and complex subclass of POMDPs, which we call OCOMDPs, and give an algorithm for PAC-learning them with polynomial interaction complexity, given a planning oracle.
Existing techniques for automated option discovery do not scale to multi-level hierarchies and to expressive representations such as deep networks. We present Discovery of Deep Options (DDO), a policy-gradient algorithm that discovers parametrized options from a set of demonstration trajectories, and can be used recursively to discover additional levels of the hierarchy. We show that DDO is effective in adding options that accelerate learning in 4 out of 5 Atari RAM environments chosen in our experiments. We also show that DDO can discover structure in robot-assisted surgical videos and kinematics that match expert annotation with 72% accuracy.
Roy Fox*, Sanjay Krishnan*, Ion Stoica, and Ken Goldberg, arXiv, 2017 * Equal contribution
We present the minimum-information principle for selective attention in reactive agents. We motivate this approach by reducing the general problem of optimal control in POMDPs, to reactive control with complex observations. We introduce a forward-backward algorithm for finding optimal selective-attention policies, and illustrate it with several examples. Finally, we analyze and explore the newly discovered phenomenon of period doubling bifurcations in this optimization process.