Multilateral demonstrations can be difficult for human supervisors to proved because they require divided attention. We propose Bilateral Iterated Best Response (BIBR), a new algorithm that reduces supervisor burden by iteratively demonstrating each manipulator unilaterally while rolling out an estimated robot policy for the other manipulator. We present a web-based user study of a two-agent gridworld domain. We confirm prior work that bilateral demonstrations are noisier and longer when the task is asymmetric, and show that BIBR improves success rate in the asymmetric task, while learning policies that have shorter and smoother trajectories.
Human demonstrators of deep learning tasks are prone to inconsistencies and errors that can delay or degrade learning. We improve task performance by characterizing supervisor inconsistency and correcting for it using data cleaning techniques. In human demonstrations of a planar part extraction task on a 2DOF robot, trained CNN models show an improvement of 11.2% in mean absolute success rate after data cleaning.
Discovery of Deep Options (DDO) discovers parametrized options from demonstrations, and can recursively discover additional levels of the hierarchy. It scales to multi-level hierarchies by decoupling low-level option discovery from high-level meta-control policy learning, facilitated by under-parametrization of the high level. We demonstrate that augmenting the action space of Deep Q-Networks with the discovered options accelerates learning by guiding exploration in tasks where random actions are unlikely to reach valuable states.
* Equal contribution
Imitation Learning algorithms suffer from covariate shift when the agent visits different states than the supervisor. Active approaches such as DAgger collect data from the current agent distribution, which can also change in each iteration with very large batch sizes. We propose injecting artificial noise into the supervisor’s policy, prove an improved bound on the loss due to the covariate shift, and introduce an algorithm to estimate the level of ε-greedy noise to inject. Our algorithm, Dart, achieves a better performance than DAgger in a driving simulator.