The main challenges that set algorithmic domains apart from other imitation learning domains are the need for high accuracy, the involvement of specific structures of data, and the extremely limited observability. To address these challenges, we propose to model programs as Parametrized Hierarchical Procedures (PHPs). A PHP is a sequence of conditional operations, that uses a program counter, along with the observation, to select between taking an elementary action, invoking another PHP as a sub-procedure, and returning to the caller. We develop an algorithm for training PHPs from a mixture of annotated and unannotated demonstrations, and apply it to efficient level-wise training of multi-level PHPs. We show in two benchmarks, NanoCraft and long-hand addition, that PHPs can learn neural programs more accurately from smaller amounts of strong and weak supervision.

# Publications tagged "Partial observability"

### Conferences

## Minimum-Information LQG Control — Part I: Memoryless Controllers

We consider the problem of controlling a linear system with Gaussian noise and quadratic cost (LQG), using a memoryless controller that has limited capacity of the channel connecting its sensor to its actuator. We formulate this setting as a sequential rate-distortion (SRD) problem, where we minimize the rate of information required for the controller’s operation, under a constraint on its external cost. We present the optimality principle, and study the interesting and useful phenomenology of the optimal controller, such as the principled reduction of its order.

## Minimum-Information LQG Control — Part II: Retentive Controllers

We consider the case where the controller is retentive (memory-utilizing). We can view the memory reader as one more sensor, and the memory writer as one more actuator. We can then formulate the problem of control under communication limitations, again as a sequential rate-distortion (SRD) problem of minimizing the rate of information required for the controller’s operation, under a constraint on its external cost. We show that this problem can be reduced to the memoryless case, studied in Part I. We then further investigate the form of the resulting optimal solution, and demonstrate its interesting phenomenology.

## A Multi-Agent Control Framework for Co-Adaptation in Brain-Computer Interfaces

We model the process by which the brain and the computer in a brain-computer interface (BCI) co-adapt to one another. We show in this simplified Linear-Quadratic-Gaussian (LQG) model how the brain’s neural encoding can adapt to the task of controlling the computer, at the same time that the computer’s adaptive decoder can adapt to the task of estimating the intention signal, leading to improvement in the system’s performance. We then propose an encoder-aware decoder adaptation scheme, which allows the computer to drive improvement forward faster by anticipating the brain’s adaptation.

* Equal contribution

## Bounded Planning in Passive POMDPs

We define Passive POMDPs, where actions do not affect the world state, but still incur costs. We present a variational principle for the problem of maintaining in memory the information state that is most useful for minimizing the cost, leading to a trade-off between memory and sensing, similar to multi-terminal source coding. We analyze the problem as an equivalent joint-state MDP, and introduce an efficient and simple algorithm for finding an optimum.

## A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs

We define Only-Costly-Observable Markov Decision Processes (OCOMDPs), an interesting subclass of POMDPs, where the state is unobservable, except through a costly action that completely reveals the state. This is an extention of Unobservable MDPs, where planning is known to be NP-complete. Despite this computational complexity, we give an algorithm for PAC-learning OCOMDPs with polynomial interaction complexity, given a planning oracle.

### Symposia and Workshops

## Task-Relevant Embeddings for Robust Perception in Reinforcement Learning

We propose to pre-train perceptual embeddings for reinforcement learning, on the task of predicting future rewards for various action sequences. Unlike autoencoders and state-predictive embeddings, the proposed embedding is not concerned with reconstructing irrelevant perceptual features, such as raw pixels. We demonstrate that the learned embedding can reconstruct task-relevant state features and improve learning performance.

### Theses

## Information-Theoretic Methods for Planning and Learning in Partially Observable Markov Decision Processes

We formulate the problem of optimizing an agent under both extrinsic and intrinsic constraints on its operation in a dynamical system and develop the main tools for solving it. We identify the challenging convergence properties of the optimization algorithm, such as the bifurcation structure of the update operator near phase transitions. We study the special case of linear-Gaussian dynamics and quadratic cost (LQG), where the optimal solution has a particularly simple and solvable form. We also explore the learning task, where the model of the world dynamics is unknown and sample-based updates are used instead.

## Reinforcement Learning in Partially Observable Decision Processes

We study the interaction complexity of several partially observable decision processes. We show a simple subclass of POMDPs which are learnable, but require exponential interaction complexity. We discuss the previously studied OPOMDPs, and show they can be PAC-learnable in polynomial interaction complexity. We then define a more general and complex subclass of POMDPs, which we call OCOMDPs, and give an algorithm for PAC-learning them with polynomial interaction complexity, given a planning oracle.

### Preprints

## Optimal Selective Attention in Reactive Agents

We present the minimum-information principle for selective attention in reactive agents. We motivate this approach by reducing the general problem of optimal control in POMDPs, to reactive control with complex observations. We introduce a forward-backward algorithm for finding optimal selective-attention policies, and illustrate it with several examples. Finally, we analyze and explore the newly discovered phenomenon of period doubling bifurcations in this optimization process.