We identify a shortcoming of off-policy reinforcement learning algorithms, in which the optimization over noisy estimates introduces bias during updates. We propose G-learning, a new off-policy learning algorithm that regularizes the updates by introducing an informational cost. We show how these soft updates reduce the accumulation of bias and lead to faster convergence. We discuss additional benefits of G-learning, such as the ability to naturally incorporate any available prior domain knowledge, and to avoid some exploration costs. We illustrate its benefits in several examples where G-learning results in significant improvements of the learning rate and the learning cost.
Roy Fox*, Ari Pakman*, and Naftali Tishby, UAI 2016 * Equal contribution
We model the process by which the brain and the computer in a brain-computer interface (BCI) co-adapt to one another. We show in this simplified Linear-Quadratic-Gaussian (LQG) model how the brain’s neural encoding can adapt to the task of controlling the computer, at the same time that the computer’s adaptive decoder can adapt to the task of estimating the intention signal, leading to improvement in the system’s performance. We then propose an encoder-aware decoder adaptation scheme, which allows the computer to drive improvement forward faster by anticipating the brain’s adaptation.
Josh Merel*, Roy Fox*, Tony Jebara, and Liam Paninski, NIPS 2013 * Equal contribution
We define Only-Costly-Observable Markov Decision Processes (OCOMDPs), an interesting subclass of POMDPs, where the state is unobservable, except through a costly action that completely reveals the state. This is an extention of Unobservable MDPs, where planning is known to be NP-complete. Despite this computational complexity, we give an algorithm for PAC-learning OCOMDPs with polynomial interaction complexity, given a planning oracle.
We formulate the problem of optimizing an agent under both extrinsic and intrinsic constraints on its operation in a dynamical system and develop the main tools for solving it. We identify the challenging convergence properties of the optimization algorithm, such as the bifurcation structure of the update operator near phase transitions. We study the special case of linear-Gaussian dynamics and quadratic cost (LQG), where the optimal solution has a particularly simple and solvable form. We also explore the learning task, where the model of the world dynamics is unknown and sample-based updates are used instead.
We study the interaction complexity of several partially observable decision processes. We show a simple subclass of POMDPs which are learnable, but require exponential interaction complexity. We discuss the previously studied OPOMDPs, and show they can be PAC-learnable in polynomial interaction complexity. We then define a more general and complex subclass of POMDPs, which we call OCOMDPs, and give an algorithm for PAC-learning them with polynomial interaction complexity, given a planning oracle.