Jump to Content
George Tucker

George Tucker

I am interested in modeling sequences and sequential decision-making problems. Before joining Google, I was a research scientist on the Amazon Speech team in Boston. My focus was on designing deep learning models for small-footprint keyword spotting. Before joining Amazon, I was a Postdoctoral Research Fellow in the Price lab at the Harvard School of Public Health. I worked on methods for risk prediction and association testing in studies with related individuals. I conducted my PhD research in the MIT Mathematics department in Professor Bonnie Berger's research group.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Offline reinforcement learning (RL) on large, heterogeneous datasets with highcapacity models can, in principle, lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: wider ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multitask Atari as a test-bed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 100M parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we substantially extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% humannormalized score). Compared to supervised approaches, offline RL scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that such offline Q-functions learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing representation learning approaches. View details
    Preview abstract Despite overparameterization, deep networks trained via supervised learning are easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive "aliasing", in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains and robotic manipulation from images. View details
    Model-Based Reinforcement Learning for Atari
    Blazej Osinski
    Chelsea Finn
    Henryk Michalewski
    Konrad Czechowski
    Lukasz Mieczyslaw Kaiser
    Mohammad Babaeizadeh
    Piotr Kozakowski
    Piotr Milos
    Roy H Campbell
    Ryan Sepassi
    Sergey Levine
    NIPS'18 (2020)
    Preview abstract Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with orders of magnitude fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games and achieve competitive results with only 100K interactions between the agent and the environment (400K frames), which corresponds to about two hours of real-time play. View details
    Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives
    Dieterich Lawson
    Shixiang Gu
    Christopher Maddison
    ICLR (2019)
    Preview abstract Deep latent variable models have become a popular model choice due to the scalable learning algorithms introduced by (Kingma & Welling, 2013; Rezende et al., 2014). These approaches maximize a variational lower bound on the intractable log likelihood of the observed data. Burda et al. (2015) introduced a multi-sample variational bound, IWAE, that is at least as tight as the standard variational lower bound and becomes increasingly tight as the number of samples increases. Counterintuitively, the typical inference network gradient estimator for the IWAE bound performs poorly as the number of samples increases (Rainforth et al., 2018; Le et al., 2018). Roeder et al. (2017) propose an improved gradient estimator, however, are unable to show it is unbiased. We show that it is in fact biased and that the bias can be estimated efficiently with a second application of the reparameterization trick. The doubly reparameterized gradient (DReG) estimator does not suffer as the number of samples increases, resolving the previously raised issues. The same idea can be used to improve many recently introduced training techniques for latent variable models. In particular, we show that this estimator reduces the variance of the IWAE gradient, the reweighted wake-sleep update (RWS) (Bornschein & Bengio, 2014), and the jackknife variational inference (JVI) gradient (Nowozin, 2018). Finally, we show that this computationally efficient, unbiased drop-in gradient estimator translates to improved performance for all three objectives on several modeling tasks. View details
    Preview abstract The smallest eigenvectors of the graph Laplacian are well-known to provide a succinct representation of the geometry of a weighted graph. In reinforcement learning (RL), where the weighted graph may be interpreted as the state transition process induced by a behavior policy acting on the environment, approximating the eigenvectors of the Laplacian provides a promising approach to state representation learning. However, existing methods for performing this approximation are ill-suited in general RL settings for two main reasons: First, they are computationally expensive, often requiring operations on large matrices. Second, these methods lack adequate justification beyond simple, tabular, finite-state settings. In this paper, we present a fully general and scalable method for approximating the eigenvectors of the Laplacian in a model-free RL context. We systematically evaluate our approach and empirically show that it generalizes beyond the tabular, finite-state setting. Even in tabular, finite-state settings, its ability to approximate the eigenvectors outperforms previous proposals. Finally, we show the potential benefits of using a Laplacian representation learned using our method in goalachieving RL tasks, providing evidence that our technique can be used to significantly improve the performance of an RL agent. View details
    Preview abstract Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning. View details
    The Mirage of Action-Dependent Baselines in Reinforcement Learning
    Surya Bhupatiraju
    Shane Gu
    Richard E. Turner
    Zoubin Ghahramani
    Sergey Levine
    ICML (2018)
    Preview abstract Model-free reinforcement learning with flexible function approximators has shown recent success for solving goal-directed sequential decision-making problems. Policy gradient methods are a promising class of model-free algorithms, but they have high variance, which necessitates large batches resulting in low sample efficiency. Typically, a state-dependent control variate is used to reduce variance. Recently, several papers have introduced the idea of state and action-dependent control variates and showed that they significantly reduce variance and improve sample efficiency on continuous control tasks. We theoretically and numerically evaluate biases and variances of these policy gradient methods, and show that action-dependent control variates do not appreciably reduce variance in the tested domains. We show that seemingly insignificant implementation details enable these prior methods to achieve good empirical improvements, but at the cost of introducing further bias to the gradient. Our analysis indicates that biased methods tend to improve the performance significantly more than unbiased ones. View details
    Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion
    Jacob Buckman
    Danijar Hafner
    Eugene Brevdo
    Honglak Lee
    NeurIPS (2018)
    Preview abstract Integrating model-free and model-based approaches in reinforcement learning has the potential to achieve the high performance of model-free algorithms with low sample complexity. However, this is difficult because an imperfect dynamics model can degrade the performance of the learning algorithm, and in sufficiently complex environments, the dynamics model will almost always be imperfect. As a result, a key challenge is to combine model-based approaches with model-free learning in such a way that errors in the model do not degrade performance. We propose stochastic ensemble value expansion (STEVE), a novel model-based technique that addresses this issue. By dynamically interpolating between model rollouts of various horizon lengths for each individual example, STEVE ensures that the model is only utilized when doing so does not introduce significant errors. Our approach outperforms model-free baselines on challenging continuous control benchmarks with an order-of-magnitude increase in sample efficiency, and in contrast to previous model-based approaches, performance does not degrade in complex environments. View details
    Guided evolutionary strategies: augmenting random search with surrogate gradients
    Niru Maheswaranathan
    Luke Metz
    Dami Choi
    Jascha Sohl-dickstein
    ICML (2018)
    Preview abstract Many applications in machine learning require optimizing a function whose true gradient is unknown or computationally expensive, but where surrogate gradient information, directions that may be correlated with the true gradient, is cheaply available. For example, this occurs when an approximate gradient is easier to compute than the full gradient (e.g. in meta-learning or unrolled optimization), or when a true gradient is intractable and is replaced with a surrogate (e.g. in reinforcement learning or training networks with discrete variables). We propose Guided Evolutionary Strategies (GES), a method for optimally using surrogate gradient directions to accelerate random search. GES defines a search distribution for evolutionary strategies that is elongated along a subspace spanned by the surrogate gradients and estimates a descent direction which can then be passed to a first-order optimizer. We analytically and numerically characterize the tradeoffs that result from tuning how strongly the search distribution is stretched along the guiding subspace and use this to derive a setting of the hyperparameters that works well across problems. We evaluate GES on several example problems, demonstrating an improvement over both standard evolutionary strategies and first-order methods that directly follow the surrogate gradient. View details
    Preview abstract We do an empirical comparison of a variety of recent methods for decision making based on deep Bayesian Neural Networks with Thompson Sampling. View details
    Preview abstract State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks. View details
    Filtering Variational Objectives
    Chris J Maddison
    Dieterich Lawson
    Nicolas Heess
    Mohammad Norouzi
    Andriy Mnih
    Arnaud Doucet
    Yee Whye Teh
    NIPS (2017)
    Preview abstract When used as a surrogate objective for maximum likelihood estimation in latent variable models, the evidence lower bound (ELBO) produces state-of-the-art results. Inspired by this, we consider the extension of the ELBO to a family of lower bounds defined by a particle filter's estimator of the marginal likelihood, the filtering variational objectives (FIVOs). FIVOs take the same arguments as the ELBO, but can exploit a model's sequential structure to form tighter bounds. We present results that relate the tightness of FIVO's bound to the variance of the particle filter's estimator by considering the generic case of bounds defined as log-transformed likelihood estimators. Experimentally, we show that training with FIVO results in substantial improvements over training with ELBO on sequential data. View details
    REBAR: Low-variance, unbiased gradient estimates for discrete variable models
    Andriy Mnih
    Chris J. Maddison
    Dieterich Lawson
    Jascha Sohl-Dickstein
    NIPS (2017)
    Preview abstract Learning in models with discrete latent variables is challenging due to high variance gradient estimators. Generally, approaches have relied on control variates to reduce the variance of the REINFORCE estimator. Recent work (Jang et al. 2016; Maddison et al. 2016) has taken a different approach, introducing a continuous relaxation of discrete variables to produce low-variance, but biased, gradient estimates. In this work, we combine the two approaches through a novel control variate that produces low-variance, unbiased gradient estimates. Then, we introduce a novel continuous relaxation and show that the tightness of the relaxation can be adapted online, removing it as a hyperparameter. We show state-of-the-art variance reduction on several benchmark generative modeling tasks, generally leading to faster convergence to a better final log likelihood. View details
    Regularizing Neural Networks by Penalizing Confident Output Distributions
    Gabriel Pereyra
    Jan Chorowski
    Łukasz Kaiser
    Geoffrey Hinton
    ICLR Workshop (2017)
    Preview abstract We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. We exhaustively evaluate the proposed confidence penalty and label smoothing on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers. View details
    Learning Hard Alignments with Variational Inference
    Dieterich Lawson
    Chung-Cheng Chiu
    Colin Raffel
    Navdeep Jaitly
    ICASSP (2017)
    Preview abstract There has recently been significant interest in hard attention models for tasks such as object recognition, visual captioning and speech recognition. Hard attention can offer benefits over soft attention such as decreased computational cost, but training hard attention models can be difficult because of the discrete latent variables they introduce. Previous work has used REINFORCE and Q-learning to approach these issues, but those methods can provide high-variance gradient estimates and be slow to train. In this paper, we tackle the problem of learning hard attention for a 1-d temporal task using variational inference methods, specifically the recently introduced VIMCO and NVIL. Furthermore, we propose novel baselines that adapt VIMCO to this setting. We demonstrate our method on a phoneme recognition task in clean and noisy environments and show that our method outperforms REINFORCE with the difference being greater for a more complicated task. View details
    Particle Value Function
    Chris Maddison
    Dieterich Lawson
    Nicolas Heess
    Arnaud Doucet
    Andriy Minh
    Yee Whye Teh
    ICLR Workshop (2017)
    Preview abstract The policy gradients of the expected return objective can react slowly to rare rewards. Yet, in some cases agents may wish to emphasize the low or high returns regardless of their probability. Borrowing from the economics and control literature, we review the risk-sensitive value function that arises from an exponential utility and illustrate its effects on an example. This risk-sensitive value function is not always applicable to reinforcement learning problems, so we introduce the particle value function defined by a particle filter over the distributions of an agent’s experience, which bounds the risk-sensitive one. We illustrate the benefit of the policy gradients of this objective in Cliffworld. View details
    No Results Found