ICLR2021

November 8, 2025 · View on GitHub

会议论文列表

本会议共有 860 篇论文

序号	链接	摘要	作者
1	What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study	In recent years, reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their state-of-the-art implementations take numerous low- and high-level design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and...	Anton Raichuk, Léonard Hussenot, Manu Orsini, Marcin Andrychowicz, Marcin Michalski, Matthieu Geist, Olivier Bachem, Olivier Pietquin, Piotr Stanczyk, Raphaël Marinier, Sertan Girgin, Sylvain Gelly
2	Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data	Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic “expansion”...	Colin Wei, Kendrick Shen, Tengyu Ma, Yining Chen
3	Learning to Reach Goals via Iterated Supervised Learning	Current reinforcement learning (RL) algorithms can be brittle and difficult to use, especially when learning goal-reaching behaviors from sparse rewards. Although supervised imitation learning provides a simple and stable alternative, it requires access to demonstrations from a human supervisor. In this paper, we study RL algorithms that use imitation learning to acquire goal reaching policies from scratch, without the need for expert demonstrations or a value function. In lieu of demonstrations, we leverage the property that any...	Abhishek Gupta, Ashwin Reddy, Benjamin Eysenbach, Coline Manon Devin, Dibya Ghosh, Justin Fu, Sergey Levine
4	Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients	Discovering the underlying mathematical expressions describing a dataset is a core challenge for artificial intelligence. This is the problem of $\textit{symbolic regression}$ . Despite recent advances in training neural networks to solve complex tasks, deep learning approaches to symbolic regression are underexplored. We propose a framework that leverages deep learning for symbolic regression via a simple idea: use a large model to search the space of small models. Specifically, we use a recurrent neural network to emit a distribution...	Brenden K. Petersen, Cláudio Prata Santiago, Joanne Taery Kim, Mikel Landajuela, Sookyung Kim, T. Nathan Mundhenk
5	Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime	We analyze the convergence of the averaged stochastic gradient descent for overparameterized two-layer neural networks for regression problems. It was recently found that a neural tangent kernel (NTK) plays an important role in showing the global convergence of gradient-based methods under the NTK regime, where the learning dynamics for overparameterized neural networks can be almost characterized by that for the associated reproducing kernel Hilbert space (RKHS). However, there is still room for a convergence rate analysis in the NTK...	Atsushi Nitanda, Taiji Suzuki
6	Free Lunch for Few-shot Learning: Distribution Calibration	Learning from a limited number of samples is challenging since the learned model can easily become overfitted based on the biased distribution formed by only a few training examples. In this paper, we calibrate the distribution of these few-sample classes by transferring statistics from the classes with sufficient examples. Then an adequate number of examples can be sampled from the calibrated distribution to expand the inputs to the classifier. We assume every dimension in the feature representation follows a Gaussian distribution so...	Lu Liu, Min Xu, Shuo Yang
7	Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes	Determinantal point processes (DPPs) have attracted significant attention in machine learning for their ability to model subsets drawn from a large item collection. Recent work shows that nonsymmetric DPP (NDPP) kernels have significant advantages over symmetric kernels in terms of modeling power and predictive performance. However, for an item collection of size $M$ , existing NDPP learning and inference algorithms require memory quadratic in $M$ and runtime cubic (for learning) or quadratic (for inference) in $M$ , making them...	Elvis Dohmatob, Insu Han, Jennifer Gillenwater, Mike Gartrell, VictorEmmanuel Brunel
8	Randomized Automatic Differentiation	The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reverse-mode automatic differentiation (AD) to compute gradients of mega-dimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We...	Alex Beatson, Deniz Oktay, Joshua Aduol, Nick McGreivy, Ryan P. Adams
9	Learning Generalizable Visual Representations via Interactive Gameplay	A growing body of research suggests that embodied gameplay, prevalent not just in human cultures but across a variety of animal species including turtles and ravens, is critical in developing the neural flexibility for creative problem solving, decision making, and socialization. Comparatively little is known regarding the impact of embodied gameplay upon artificial agents. While recent work has produced agents proficient in abstract games, these environments are far removed the real world and thus these agents can provide little...	Ali Farhadi, Alvaro Herrasti, Aniruddha Kembhavi, Dustin Schwenk, Eric Kolve, Kiana Ehsani, Luca Weihs, Roozbeh Mottaghi, Sarah M. Pratt, Winson Han
10	Global Convergence of Three-layer Neural Networks in the Mean Field Regime	In the mean field regime, neural networks are appropriately scaled so that as the width tends to infinity, the learning dynamics tends to a nonlinear and nontrivial dynamical limit, known as the mean field limit. This lends a way to study large-width neural networks via analyzing the mean field limit. Recent works have successfully applied such analysis to two-layer networks and provided global convergence guarantees. The extension to multilayer ones however has been a highly challenging puzzle, and little is known about the...	Huy Tuan Pham, PhanMinh Nguyen
11	Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator	Gradient estimation in models with discrete latent variables is a challenging problem, because the simplest unbiased estimators tend to have high variance. To counteract this, modern estimators either introduce bias, rely on multiple function evaluations, or use learned, input-dependent baselines. Thus, there is a need for estimators that require minimal tuning, are computationally cheap, and have low mean squared error. In this paper, we show that the variance of the straight-through variant of the popular Gumbel-Softmax estimator can...	Andreas Krause, Chris J. Maddison, Max B. Paulus
12	Rethinking Attention with Performers	We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model...	Adrian Weller, Afroz Mohiuddin, Andreea Gane, David Benjamin Belanger, David Dohan, Jared Quincy Davis, Krzysztof Marcin Choromanski, Lucy J. Colwell, Lukasz Kaiser, Peter Hawkins, Tamás Sarlós, Valerii Likhosherstov, Xingyou Song
13	Getting a CLUE: A Method for Explaining Uncertainty Estimates	Both uncertainty estimation and interpretability are important factors for trustworthy machine learning systems. However, there is little work at the intersection of these two areas. We address this gap by proposing a novel method for interpreting uncertainty estimates from differentiable probabilistic models, like Bayesian Neural Networks (BNNs). Our method, Counterfactual Latent Uncertainty Explanations (CLUE), indicates how to change an input, while keeping it on the data manifold, such that a BNN becomes more confident about the...	Adrian Weller, Javier Antorán, José Miguel HernándezLobato, Tameem Adel, Umang Bhatt
14	When Do Curricula Work?	Inspired by human learning, researchers have proposed ordering examples during training based on their difficulty. Both curriculum learning, exposing a network to easier examples early in training, and anti-curriculum learning, showing the most difficult examples first, have been suggested as improvements to the standard i.i.d. training. In this work, we set out to investigate the relative benefits of ordered learning. We first investigate the implicit curricula resulting from architectural and optimization bias and find that samples...	Behnam Neyshabur, Ethan Dyer, Xiaoxia Wu
15	Federated Learning Based on Dynamic Regularization	We propose a novel federated learning method for distributively training neural network models, where the server orchestrates cooperation between a subset of randomly chosen devices in each round. We view Federated Learning problem primarily from a communication perspective and allow more device level computations to save transmission costs. We point out a fundamental dilemma, in that the minima of the local-device level empirical loss are inconsistent with those of the global empirical loss. Different from recent prior works, that...	Durmus Alp Emre Acar, Matthew Mattina, Paul N. Whatmough, Ramon Matas Navarro, Venkatesh Saligrama, Yue Zhao
16	Geometry-aware Instance-reweighted Adversarial Training	In adversarial machine learning, there was a common belief that robustness and accuracy hurt each other. The belief was challenged by recent studies where we can maintain the robustness and improve the accuracy. However, the other direction, whether we can keep the accuracy and improve the robustness, is conceptually and practically more interesting, since robust accuracy should be lower than standard accuracy for any model. In this paper, we show this direction is also promising. Firstly, we find even over-parameterized deep networks...	Bo Han, Gang Niu, Jianing Zhu, Jingfeng Zhang, Masashi Sugiyama, Mohan S. Kankanhalli
17	Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity	While deep neural networks show great performance on fitting to the training distribution, improving the networks' generalization performance to the test distribution and robustness to the sensitivity to input perturbations still remain as a challenge. Although a number of mixup based augmentation strategies have been proposed to partially address them, it remains unclear as to how to best utilize the supervisory signal within each input data for mixup from the optimization perspective. We propose a new perspective on batch mixup and...	Hosan Jeong, Hyun Oh Song, JangHyun Kim, Wonho Choo
18	SenSeI: Sensitive Set Invariance for Enforcing Individual Fairness	In this paper, we cast fair machine learning as invariant machine learning. We first formulate a version of individual fairness that enforces invariance on certain sensitive sets. We then design a transport-based regularizer that enforces this version of individual fairness and develop an algorithm to minimize the regularizer efficiently. Our theoretical results guarantee the proposed approach trains certifiably fair ML models. Finally, in the experimental studies we demonstrate improved fairness metrics in comparison to several recent...	Mikhail Yurochkin, Yuekai Sun
19	End-to-end Adversarial Text-to-Speech	Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme...	Erich Elsen, Jeff Donahue, Karen Simonyan, Mikolaj Binkowski, Sander Dieleman
20	Dataset Condensation with Gradient Matching	As the state-of-the-art machine learning methods in many fields rely on larger datasets, storing datasets and training models on them become significantly more expensive. This paper proposes a training set synthesis technique for data-efficient learning, called Dataset Condensation, that learns to condense large dataset into a small set of informative synthetic samples for training deep neural networks from scratch. We formulate this goal as a gradient matching problem between the gradients of deep neural network weights that are...	Bo Zhao, Hakan Bilen, Konda Reddy Mopuri
21	Rethinking Architecture Selection in Differentiable NAS	Differentiable Neural Architecture Search is one of the most popular Neural Architecture Search (NAS) methods for its search efficiency and simplicity, accomplished by jointly optimizing the model weight and architecture parameters in a weight-sharing supernet via gradient-based algorithms. At the end of the search phase, the operations with the largest architecture parameters will be selected to form the final architecture, with the implicit assumption that the values of architecture parameters reflect the operation strength. While...	ChoJui Hsieh, Minhao Cheng, Ruochen Wang, Xiangning Chen, Xiaocheng Tang
22	A Distributional Approach to Controlled Text Generation	We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LM). This approach permits to specify, in a single formal framework, both “pointwise’” and “distributional” constraints over the target LM — to our knowledge, the first model with such generality —while minimizing KL divergence from the initial LM distribution. The optimal target distribution is then uniquely determined as an explicit EBM (Energy-BasedModel) representation. From that optimal representation, we then train a...	Hady Elsahar, Marc Dymetman, Muhammad Khalifa
23	Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency	At the heart of many robotics problems is the challenge of learning correspondences across domains. For instance, imitation learning requires obtaining correspondence between humans and robots; sim-to-real requires correspondence between physics simulators and real hardware; transfer learning requires correspondences between different robot environments. In this paper, we propose to learn correspondence across such domains emphasizing on differing modalities (vision and internal state), physics parameters (mass and friction), and...	Alexei A. Efros, Lerrel Pinto, Qiang Zhang, Tete Xiao, Xiaolong Wang
24	Human-Level Performance in No-Press Diplomacy via Equilibrium Search	Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via regret minimization. Regret minimization techniques have been behind previous AI...	Adam Lerer, Anton Bakhtin, Jonathan Gray, Noam Brown
25	Parrot: Data-Driven Behavioral Priors for Reinforcement Learning	Reinforcement learning provides a general framework for flexible decision making and control, but requires extensive data collection for each new task that an agent needs to learn. In other machine learning fields, such as natural language processing or computer vision, pre-training on large, previously collected datasets to bootstrap learning for new tasks has emerged as a powerful paradigm to reduce data requirements when learning a new task. In this paper, we ask the following question: how can we enable similarly useful...	Albert Yu, Avi Singh, Gaoyue Zhou, Huihan Liu, Nicholas Rhinehart, Sergey Levine
26	Learning Invariant Representations for Reinforcement Learning without Reconstruction	We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction. Our goal is to learn representations that provide for effective downstream control and invariance to task-irrelevant details. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs, which we propose using to learn robust latent representations which encode only the task-relevant information from observations. Our method...	Amy Zhang, Roberto Calandra, Rowan Thomas McAllister, Sergey Levine, Yarin Gal
27	Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs	Natural images are projections of 3D objects on a 2D image plane. While state-of-the-art 2D generative models like GANs show unprecedented quality in modeling the natural image manifold, it is unclear whether they implicitly capture the underlying 3D object structures. And if so, how could we exploit such knowledge to recover the 3D shapes of objects in the images? To answer these questions, in this work, we present the first attempt to directly mine 3D geometric cues from an off-the-shelf 2D GAN that is trained on RGB images only....	Bo Dai, Chen Change Loy, Ping Luo, Xingang Pan, Ziwei Liu
28	VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments	Motivated by the rising abundance of observational data with continuous treatments, we investigate the problem of estimating the average dose-response curve (ADRF). Available parametric methods are limited in their model space, and previous attempts in leveraging neural network to enhance model expressiveness relied on partitioning continuous treatment into blocks and using separate heads for each block; this however produces in practice discontinuous ADRFs. Therefore, the question of how to adapt the structure and training of neural...	Dan Nicolae, Lizhen Nie, Mao Ye, Qiang Liu
29	Rethinking the Role of Gradient-based Attribution Methods for Model Interpretability	Current methods for the interpretability of discriminative deep neural networks commonly rely on the model's input-gradients, i.e., the gradients of the output logits w.r.t. the inputs. The common assumption is that these input-gradients contain information regarding $p_{\theta} ( y\mid \mathbf{x} )$ , the model's discriminative capabilities, thus justifying their use for interpretability. However, in this work, we show that these input-gradients can be arbitrarily manipulated as a consequence of the shift-invariance of softmax without...	François Fleuret, Suraj Srinivas
30	Neural Synthesis of Binaural Speech From Mono Audio	We present a neural rendering approach for binaural sound synthesis that can produce realistic and spatially accurate binaural sound in realtime. The network takes, as input, a single-channel audio source and synthesizes, as output, two-channel binaural sound, conditioned on the relative position and orientation of the listener with respect to the source. We investigate deficiencies of the l2-loss on raw waveforms in a theoretical analysis and introduce an improved loss that overcomes these limitations. In an empirical evaluation, we...	Alexander Richard, Dejan Markovic, Fernando De la Torre, Gladstone Alexander Butler, Israel D. Gebru, Steven Krenn, Yaser Sheikh
31	DiffWave: A Versatile Diffusion Model for Audio Synthesis	In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram,...	Bryan Catanzaro, Jiaji Huang, Kexin Zhao, Wei Ping, Zhifeng Kong
32	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale	While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When...	Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, Xiaohua Zhai
33	On the mapping between Hopfield networks and Restricted Boltzmann Machines	Hopfield networks (HNs) and Restricted Boltzmann Machines (RBMs) are two important models at the interface of statistical physics, machine learning, and neuroscience. Recently, there has been interest in the relationship between HNs and RBMs, due to their similarity under the statistical mechanics formalism. An exact mapping between HNs and RBMs has been previously noted for the special case of orthogonal (“uncorrelated”) encoded patterns. We present here an exact mapping in the case of correlated pattern HNs, which are more broadly...	Anton Zilman, Matthew Smart
34	SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments	Every living organism struggles against disruptive environmental forces to carve out and maintain an orderly niche. We propose that such a struggle to achieve and preserve order might offer a principle for the emergence of useful behaviors in artificial agents. We formalize this idea into an unsupervised reinforcement learning method called surprise minimizing reinforcement learning (SMiRL). SMiRL alternates between learning a density model to evaluate the surprise of a stimulus, and improving the policy to seek more predictable...	Chelsea Finn, Coline Manon Devin, Daniel Geng, Dinesh Jayaraman, Glen Berseth, Nicholas Rhinehart, Sergey Levine
35	Evolving Reinforcement Learning Algorithms	We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld...	Aleksandra Faust, Daiyi Peng, Esteban Real, Honglak Lee, John D. CoReyes, Quoc V. Le, Sergey Levine, Yingjie Miao
36	Growing Efficient Deep Networks by Structured Continuous Sparsification	We develop an approach to growing deep network architectures over the course of training, driven by a principled combination of accuracy and sparsity objectives. Unlike existing pruning or architecture search techniques that operate on full-sized models or supernet architectures, our method can start from a small, simple seed architecture and dynamically grow and prune both layers and filters. By combining a continuous relaxation of discrete network structure optimization with a scheme for sampling sparse subnetworks, we produce...	Michael Maire, Pedro Henrique Pamplona Savarese, Xin Yuan
37	Deformable DETR: Deformable Transformers for End-to-End Object Detection	DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on...	Bin Li, Jifeng Dai, Lewei Lu, Weijie Su, Xiaogang Wang, Xizhou Zhu
38	EigenGame: PCA as a Nash Equilibrium	We present a novel view on principal components analysis as a competitive game in which each approximate eigenvector is controlled by a player whose goal is to maximize their own utility function. We analyze the properties of this PCA game and the behavior of its gradient based updates. The resulting algorithm---which combines elements from Oja's rule with a generalized Gram-Schmidt orthogonalization---is naturally decentralized and hence parallelizable through message passing. We demonstrate the scalability of the algorithm with...	Brian McWilliams, Claire Vernade, Ian Gemp, Thore Graepel
39	Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting	Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It...	Emmanuel de Bézenac, Ibrahim Ayed, Jérémie Donà, Nicolas Thome, Patrick Gallinari, Vincent Le Guen, Yuan Yin
40	Complex Query Answering with Neural Link Predictors	Neural link predictors are immensely useful for identifying missing edges in large scale Knowledge Graphs. However, it is still not clear how to use these models for answering more complex queries that arise in a number of domains, such as queries using logical conjunctions ( $\land$ ), disjunctions ( $\lor$ ) and existential quantifiers ( $\exists$ ), while accounting for missing edges. In this work, we propose a framework for efficiently answering complex queries on incomplete Knowledge Graphs. We translate each query into an end-to-end...	Daniel Daza, Erik Arakelyan, Michael Cochez, Pasquale Minervini
41	Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding	Disentangling the underlying generative factors from complex data has so far been limited to carefully constructed scenarios. We propose a path towards natural data by first showing that the statistics of natural data provide enough structure to enable disentanglement, both theoretically and empirically. Specifically, we provide evidence that objects in natural movies undergo transitions that are typically small in magnitude with occasional large jumps, which is characteristic of a temporally sparse distribution. To address this...	David A. Klindt, Dylan M. Paiton, Ivan Ustyuzhaninov, Lukas Schott, Matthias Bethge, Wieland Brendel, Yash Sharma
42	Self-training For Few-shot Transfer Across Extreme Task Differences	Most few-shot learning techniques are pre-trained on a large, labeled “base dataset”. In problem domains where such large labeled datasets are not available for pre-training (e.g., X-ray, satellite images), one must resort to pre-training in a different “source” problem domain (e.g., ImageNet), which can be very different from the desired target task. Traditional few-shot and transfer learning techniques fail in the presence of such extreme differences between the source and target tasks. In this paper, we present a simple and...	Bharath Hariharan, Cheng Perng Phoo
43	Score-Based Generative Modeling through Stochastic Differential Equations	Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (a.k.a., score) of the perturbed data distribution. By leveraging advances...	Abhishek Kumar, Ben Poole, Diederik P. Kingma, Jascha SohlDickstein, Stefano Ermon, Yang Song
44	Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation	Using a mix of shared and language-specific (LS) parameters has shown promise in multilingual neural machine translation (MNMT), but the question of when and where LS capacity matters most is still under-studied. We offer such a study by proposing conditional language-specific routing (CLSR). CLSR employs hard binary gates conditioned on token representations to dynamically select LS or shared paths. By manipulating these gates, it can schedule LS capacity across sub-layers in MNMT subject to the guidance of translation signals and...	Ankur Bapna, Biao Zhang, Orhan Firat, Rico Sennrich
45	Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering	Differentiable rendering has paved the way to training neural networks to perform “inverse graphics” tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire 3D knowledge implicitly during training: object viewpoints can be manipulated by simply manipulating the latent codes. However, these latent...	Antonio Torralba, Huan Ling, Jun Gao, Sanja Fidler, Wenzheng Chen, Yinan Zhang, Yuxuan Zhang
46	How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks	We study how neural networks trained by gradient descent extrapolate, i.e., what they learn outside the support of the training distribution. Previous works report mixed empirical results when extrapolating with neural networks: while feedforward neural networks, a.k.a. multilayer perceptrons (MLPs), do not extrapolate well in certain simple tasks, Graph Neural Networks (GNNs) -- structured networks with MLP modules -- have shown some success in more complex tasks. Working towards a theoretical explanation, we identify conditions under...	Jingling Li, Kenichi Kawarabayashi, Keyulu Xu, Mozhi Zhang, Simon Shaolei Du, Stefanie Jegelka
47	Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions	We investigate a deep reinforcement learning (RL) architecture that supports explaining why a learned agent prefers one action over another. The key idea is to learn action-values that are directly represented via human-understandable properties of expected futures. This is realized via the embedded self-prediction (ESP) model, which learns said properties in terms of human provided features. Action preferences can then be explained by contrasting the future properties predicted for each action. To address cases where there are a large...	Alan Fern, KinHo Lam, Zhengxian Lin
48	Improved Autoregressive Modeling with Distribution Smoothing	While autoregressive models excel at image compression, their sample quality is often lacking. Although not realistic, generated images often have high likelihood according to the model, resembling the case of adversarial examples. Inspired by a successful adversarial defense method, we incorporate randomized smoothing into autoregressive generative modeling. We first model a smoothed version of the data distribution, and then reverse the smoothing process to recover the original data distribution. This procedure drastically improves...	Chenlin Meng, Jiaming Song, Shengjia Zhao, Stefano Ermon, Yang Song
49	MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training	Recent advances by practitioners in the deep learning community have breathed new life into Locality Sensitive Hashing (LSH), using it to reduce memory and time bottlenecks in neural network (NN) training. However, while LSH has sub-linear guarantees for approximate near-neighbor search in theory, it is known to have inefficient query time in practice due to its use of random hash functions. Moreover, when model parameters are changing, LSH suffers from update overhead. This work is motivated by an observation that model parameters...	Anshumali Shrivastava, Beidi Chen, Binghui Peng, Christopher Ré, Jonathan Lingjie Li, Tri Dao, Zhao Song, Zhaozhuo Xu, Zichang Liu
50	Gradient Projection Memory for Continual Learning	The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by...	Gobinda Saha, Isha Garg, Kaushik Roy
51	Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?	Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of \textquotedblleft better inductive bias.\textquotedblright\ However, this has not been made mathematically rigorous, and the hurdle is that the sufficiently wide fully-connected net can always simulate the convolutional net. Thus the training algorithm plays a role. The current work describes a natural task on which a provable sample complexity gap can be...	Sanjeev Arora, Yi Zhang, Zhiyuan Li
52	Iterated learning for emergent systematicity in VQA	Although neural module networks have an architectural bias towards compositionality, they require gold standard layouts to generalize systematically in practice. When instead learning layouts and modules jointly, compositionality does not arise automatically and an explicit pressure is necessary for the emergence of layouts exhibiting the right structure. We propose to address this problem using iterated learning, a cognitive science theory of the emergence of compositional languages in nature that has primarily been applied to simple...	Aaron C. Courville, Ankit Vani, Eeshan Dhekane, Max Schwarzer, Yuchen Lu
53	Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies	Circuits of biological neurons, such as in the functional parts of the brain can be modeled as networks of coupled oscillators. Inspired by the ability of these systems to express a rich set of outputs while keeping (gradients of) state variables bounded, we propose a novel architecture for recurrent neural networks. Our proposed RNN is based on a time-discretization of a system of second-order ordinary differential equations, modeling networks of controlled nonlinear oscillators. We prove precise bounds on the gradients of the hidden...	Siddhartha Mishra, T. Konstantin Rusch
54	Sparse Quantized Spectral Clustering	Given a large data matrix, sparsifying, quantizing, and/or performing other entry-wise nonlinear operations can have numerous benefits, ranging from speeding up iterative algorithms for core numerical linear algebra problems to providing nonlinear filters to design state-of-the-art neural network models. Here, we exploit tools from random matrix theory to make precise statements about how the eigenspectrum of a matrix changes under such nonlinear transformations. In particular, we show that very little change occurs in the informative...	Michael W. Mahoney, Romain Couillet, Zhenyu Liao
55	Graph-Based Continual Learning	Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to...	Binh Tang, David S. Matteson
56	Dynamic Tensor Rematerialization	Checkpointing enables the training of deep learning models under restricted memory budgets by freeing intermediate activations from memory and recomputing them on demand. Current checkpointing techniques statically plan these recomputations offline and assume static computation graphs. We demonstrate that a simple online algorithm can achieve comparable performance by introducing Dynamic Tensor Rematerialization (DTR), a greedy online algorithm for checkpointing that is extensible and general, is parameterized by eviction policy, and...	Altan Haan, Jared Roesch, Jennifer Brennan, Marisa Kirisame, Mike He, Steven Lyubomirsky, Tianqi Chen, Zachary Tatlock
57	Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models	Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient...	Orhan Firat, Yuan Cao, Yulia Tsvetkov, Zirui Wang
58	CPT: Efficient Deep Neural Network Training via Cyclic Precision	Low-precision deep neural network (DNN) training has gained tremendous attention as reducing precision is one of the most effective knobs for boosting DNNs' training time/energy efficiency. In this paper, we attempt to explore low-precision training from a new perspective as inspired by recent findings in understanding DNN training: we conjecture that DNNs' precision might have a similar effect as the learning rate during DNN training, and advocate dynamic precision along the training trajectory for further boosting the time/energy...	Han Guo, Meng Li, Vikas Chandra, Xin Yang, Yingyan Lin, Yining Ding, Yonggan Fu
59	Learning a Latent Simplex in Input Sparsity Time	We consider the problem of learning a latent $k$ -vertex simplex $K\in\mathbb{R}^d$ , given $\mathbf{A}\in\mathbb{R}^{d\times n}$ , which can be viewed as $n$ data points that are formed by randomly perturbing some latent points in $K$ , possibly beyond $K$ . A large class of latent variable models, such as adversarial clustering, mixed membership stochastic block models, and topic models can be cast in this view of learning a latent simplex. Bhattacharyya and Kannan (SODA 2020) give an algorithm for learning such a $k$ -vertex latent...	Ainesh Bakshi, Chiranjib Bhattacharyya, David P. Woodruff, Ravi Kannan, Samson Zhou
60	Expressive Power of Invariant and Equivariant Graph Neural Networks	Various classes of Graph Neural Networks (GNN) have been proposed and shown to be successful in a wide range of applications with graph structured data. In this paper, we propose a theoretical framework able to compare the expressive power of these GNN architectures. The current universality theorems only apply to intractable classes of GNNs. Here, we prove the first approximation guarantees for practical GNNs, paving the way for a better understanding of their generalization. Our theoretical results are proved for invariant GNNs...	Marc Lelarge, Waïss Azizian
61	Discovering a set of policies for the worst case reward	We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any composition of the former whose performance is at least as good as that of its constituents across all the tasks. We focus on the most conservative...	André Barreto, Brendan O'Donoghue, Daniel J. Mankowitz, Iurii Kemaev, Satinder Singh, Shaobo Hou, Tom Zahavy
62	Model-Based Visual Planning with Self-Supervised Functional Distances	A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when hand-engineered reward functions are not available. Learned dynamics models are a promising approach for learning about the environment without rewards or task-directed data, but planning to reach goals with such a model requires a notion of functional similarity...	Benjamin Eysenbach, Chelsea Finn, Frederik Ebert, Sergey Levine, Stephen Tian, Sudeep Dasari, Suraj Nair
63	Noise against noise: stochastic label noise helps combat inherent label noise	The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect, previously studied in optimization by analyzing the dynamics of parameter updates. In this paper, we are interested in learning with noisy labels, where we have a collection of samples with potential mislabeling. We show that a previously rarely discussed SGD noise, induced by stochastic label noise (SLN), mitigates the effects of inherent label noise. In contrast, the common SGD noise directly applied to model parameters does not. We...	Guangyong Chen, Jingwei Zhao, Junjie Ye, Pengfei Chen, PhengAnn Heng
64	Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning	Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generalization, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for...	Marc G. Bellemare, Marlos C. Machado, Pablo Samuel Castro, Rishabh Agarwal
65	VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models	Energy-based models (EBMs) have recently been successful in representing complex distributions of small images. However, sampling from them requires expensive Markov chain Monte Carlo (MCMC) iterations that mix slowly in high dimensional pixel space. Unlike EBMs, variational autoencoders (VAEs) generate samples quickly and are equipped with a latent space that enables fast traversal of the data manifold. However, VAEs tend to assign high probability density to regions in data space outside the actual data distribution and often fail at...	Arash Vahdat, Jan Kautz, Karsten Kreis, Zhisheng Xiao
66	Geometry-Aware Gradient Algorithms for Neural Architecture Search	Recent state-of-the-art methods for neural architecture search (NAS) exploit gradient-based optimization by relaxing the problem into continuous optimization over architectures and shared-weights, a noisy process that remains poorly understood. We argue for the study of single-level empirical risk minimization to understand NAS with weight-sharing, reducing the design of NAS methods to devising optimizers and regularizers that can quickly obtain high-quality solutions to this problem. Invoking the theory of mirror descent, we present a...	Ameet Talwalkar, Liam Li, Mikhail Khodak, Nina Balcan
67	Learning-based Support Estimation in Sublinear Time	We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $\pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$ , where $n$ is the data set size....	Piotr Indyk, Ronitt Rubinfeld, Sandeep Silwal, Shyam Narayanan, Tal Wagner, Talya Eden
68	Deciphering and Optimizing Multi-Task Learning: a Random Matrix Approach	This article provides theoretical insights into the inner workings of multi-task and transfer learning methods, by studying the tractable least-square support vector machine multi-task learning (LS-SVM MTL) method, in the limit of large ( $p$ ) and numerous ( $n$ ) data. By a random matrix analysis applied to a Gaussian mixture data model, the performance of MTL LS-SVM is shown to converge, as $n,p\to\infty$ , to a deterministic limit involving simple (small-dimensional) statistics of the data. We prove (i) that the standard MTL LS-SVM...	Hafiz Tiomoko Ali, Malik Tiomoko, Romain Couillet
69	Autoregressive Entity Retrieval	Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. One way to understand current approaches is as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity meta information such...	Fabio Petroni, Gautier Izacard, Nicola De Cao, Sebastian Riedel
70	Systematic generalisation with group invariant predictions	We consider situations where the presence of dominant simpler correlations with the target variable in a training set can cause an SGD-trained neural network to be less reliant on more persistently correlating complex features. When the non-persistent, simpler correlations correspond to non-semantic background factors, a neural network trained on this data can exhibit dramatic failure upon encountering systematic distributional shift, where the correlating background features are recombined with different objects. We perform an...	Aaron C. Courville, Faruk Ahmed, Harm van Seijen, Yoshua Bengio
71	Iterative Empirical Game Solving via Single Policy Best Response	Policy-Space Response Oracles (PSRO) is a general algorithmic framework for learning policies in multiagent systems by interleaving empirical game analysis with deep reinforcement learning (DRL). At each iteration, DRL is invoked to train a best response to a mixture of opponent policies. The repeated application of DRL poses an expensive computational burden as we look to apply this algorithm to more complex domains. We introduce two variations of PSRO designed to reduce the amount of simulation required during DRL training. Both...	Max Olan Smith, Michael P. Wellman, Thomas Anthony
72	Understanding the role of importance weighting for deep learning	The recent paper by Byrd & Lipton (2019), based on empirical observations, raises a major concern on the impact of importance weighting for the over-parameterized deep learning models. They observe that as long as the model can separate the training data, the impact of importance weighting diminishes as the training proceeds. Nevertheless, there lacks a rigorous characterization of this phenomenon. In this paper, we provide formal characterizations and theoretical justifications on the role of importance weighting with respect to the...	Chuanwei Ruan, Da Xu, Yuting Ye
73	Long-tail learning via logit adjustment	Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels have only a few associated samples. This poses a challenge for generalisation on such labels, and also makes naive learning biased towards dominant labels. In this paper, we present a statistical framework that unifies and generalises several recent proposals to cope with these challenges. Our framework revisits the classic idea of logit adjustment based on the label frequencies, which encourages a large relative...	Aditya Krishna Menon, Andreas Veit, Ankit Singh Rawat, Himanshu Jain, Sadeep Jayasumana, Sanjiv Kumar
74	DDPNOpt: Differential Dynamic Programming Neural Optimizer	Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by reformulating the training procedure from the trajectory optimization perspective. We first show that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order method rooted in the...	Evangelos A. Theodorou, GuanHorng Liu, Tianrong Chen
75	Learning with Feature-Dependent Label Noise: A Progressive Approach	Label noise is frequently observed in real-world large-scale datasets. The noise is introduced due to a variety of reasons; it is heterogeneous and feature-dependent. Most existing approaches to handling noisy labels fall into two categories: they either assume an ideal feature-independent noise, or remain heuristic without theoretical guarantees. In this paper, we propose to target a new family of feature-dependent label noise, which is much more general than commonly used i.i.d. label noise and encompasses a broad spectrum of noise...	Chao Chen, Mayank Goswami, Pengxiang Wu, Songzhu Zheng, Yikai Zhang
76	Information Laundering for Model Privacy	In this work, we propose information laundering, a novel framework for enhancing model privacy. Unlike data privacy that concerns the protection of raw data information, model privacy aims to protect an already-learned model that is to be deployed for public use. The private model can be obtained from general learning methods, and its deployment means that it will return a deterministic or random response for a given input query. An information-laundered model consists of probabilistic components that deliberately maneuver the intended...	Jie Ding, Jun Gao, Xinran Wang, Yu Xiang
77	Mutual Information State Intrinsic Control	Reinforcement learning has been shown to be highly successful at many challenging tasks. However, success heavily relies on well-shaped rewards. Intrinsically motivated RL attempts to remove this constraint by defining an intrinsic reward function. Motivated by the self-consciousness concept in psychology, we make a natural assumption that the agent knows what constitutes itself, and propose a new intrinsic objective that encourages the agent to have maximum control on the environment. We mathematically formalize this reward as the...	Pieter Abbeel, Rui Zhao, Volker Tresp, Wei Xu, Yang Gao
78	Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods	Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, $k$ -NN...	Shunta Akiyama, Taiji Suzuki
79	How Does Mixup Help With Robustness and Generalization?	Mixup is a popular data augmentation technique based on on convex combinations of pairs of examples and their labels. This simple technique has shown to substantially improve both the model's robustness as well as the generalization of the trained model. However, it is not well-understood why such improvement occurs. In this paper, we provide theoretical analysis to demonstrate how using Mixup in training helps model robustness and generalization. For robustness, we show that minimizing the Mixup loss corresponds to approximately...	Amirata Ghorbani, James Zou, Kenji Kawaguchi, Linjun Zhang, Zhun Deng
80	Dataset Inference: Ownership Resolution in Machine Learning	With increasingly more data and computation involved in their training, machine learning models constitute valuable intellectual property. This has spurred interest in model stealing, which is made more practical by advances in learning with partial, little, or no supervision. Existing defenses focus on inserting unique watermarks in a model's decision surface, but this is insufficient: the watermarks are not sampled from the training distribution and thus are not always preserved during model stealing. In this paper, we make the key...	Mohammad Yaghini, Nicolas Papernot, Pratyush Maini
81	Individually Fair Gradient Boosting	We consider the task of enforcing individual fairness in gradient boosting. Gradient boosting is a popular method for machine learning from tabular data, which arise often in applications where algorithmic fairness is a concern. At a high level, our approach is a functional gradient descent on a (distributionally) robust loss function that encodes our intuition of algorithmic fairness for the ML task at hand. Unlike prior approaches to individual fairness that only work with smooth ML models, our approach also works with non-smooth...	Alexander Vargo, Fan Zhang, Mikhail Yurochkin, Yuekai Sun
82	Large Scale Image Completion via Co-Modulated Generative Adversarial Networks	Numerous task-specific variants of conditional generative adversarial networks have been developed for image completion. Yet, a serious limitation remains that all existing algorithms tend to fail when handling large-scale missing regions. To overcome this challenge, we propose a generic new approach that bridges the gap between image-conditional and recent modulated unconditional generative architectures via co-modulation of both conditional and stochastic style representations. Also, due to the lack of good quantitative metrics for...	Eric IChao Chang, Jonathan Cui, Shengyu Zhao, Xiao Liang, Yan Xu, Yilun Sheng, Yue Dong
83	Self-Supervised Policy Adaptation during Deployment	In most real world scenarios, a policy trained by reinforcement learning in one environment needs to be deployed in another, potentially quite different environment. However, generalization across different environments is known to be hard. A natural solution would be to keep training after deployment in the new environment, but this cannot be done if the new environment offers no reward signal. Our work explores the use of self-supervision to allow the policy to continue training after deployment without using any rewards. While...	Alexei A. Efros, Guillem Alenyà, Lerrel Pinto, Nicklas Hansen, Pieter Abbeel, Rishabh Jangir, Xiaolong Wang, Yu Sun
84	Sharpness-aware Minimization for Efficiently Improving Generalization	In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by the connection between geometry of the loss landscape and generalization---including a generalization bound that we prove here---we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure,...	Ariel Kleiner, Behnam Neyshabur, Hossein Mobahi, Pierre Foret
85	PMI-Masking: Principled masking of correlated spans	Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the...	Barak Lenz, Kevin LeytonBrown, Moshe Tennenholtz, Omri Abend, Opher Lieber, Yoav Levine, Yoav Shoham
86	Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images	We present a hierarchical VAE that, for the first time, generates samples quickly $\textit{and}$ outperforms the PixelCNN in log-likelihood on all natural image benchmarks. We begin by observing that, in theory, VAEs can actually represent autoregressive models, as well as faster, better models if they exist, when made sufficiently deep. Despite this, autoregressive models have historically outperformed VAEs in log-likelihood. We test if insufficient depth explains why by scaling a VAE to greater stochastic depth than previously...	Rewon Child
87	Data-Efficient Reinforcement Learning with Self-Predictive Representations	While deep reinforcement learning excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment, learning from limited interaction remains a key challenge. We posit that an agent can learn more efficiently if we augment reward maximization with self-supervised objectives based on structure in its visual input and sequential interaction with the environment. Our method, Self-Predictive Representations (SPR), trains an agent to predict its own latent state...	Aaron C. Courville, Ankesh Anand, Max Schwarzer, Philip Bachman, R. Devon Hjelm, Rishab Goel
88	Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration	In this paper, we introduce Watch-And-Help (WAH), a challenge for testing social intelligence in agents. In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently. To succeed, the AI agent needs to i) understand the underlying goal of the task by watching a single demonstration of the human-like agent performing the same task (social perception), and ii) coordinate with the human-like agent to solve the task in an unseen environment as fast as possible (human-AI collaboration). For this...	Antonio Torralba, Joshua B. Tenenbaum, Sanja Fidler, Shuang Li, Tianmin Shu, Xavier Puig, YuanHong Liao, Zilin Wang
89	A Good Image Generator Is What You Need for High-Resolution Video Synthesis	Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving image-based models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed...	Dimitris N. Metaxas, Jian Ren, Kyle Olszewski, Menglei Chai, Sergey Tulyakov, Xi Peng, Yu Tian
90	UPDeT: Universal Multi-agent RL via Policy Decoupling with Transformers	Recent advances in multi-agent reinforcement learning have been largely limited in training one model from scratch for every new task. The limitation is due to the restricted model architecture related to fixed input and output dimensions. This hinders the experience accumulation and transfer of the learned agent over tasks with diverse levels of difficulty (e.g. 3 vs 3 or 5 vs 6 multi-agent games). In this paper, we make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single...	Fengda Zhu, Siyi Hu, Xiaodan Liang, Xiaojun Chang
91	BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration	Program synthesis is challenging largely because of the difficulty of search in a large space of programs. Human programmers routinely tackle the task of writing complex programs by writing sub-programs and then analyzing their intermediate results to compose them in appropriate ways. Motivated by this intuition, we present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a given set...	Augustus Odena, Charles Sutton, David Bieber, Hanjun Dai, Kensen Shi, Rishabh Singh
92	Improving Adversarial Robustness via Channel-wise Activation Suppressing	The study of adversarial examples and their activations have attracted significant attention for secure and robust learning with deep neural networks (DNNs). Different from existing works, in this paper, we highlight two new characteristics of adversarial examples from the channel-wise activation perspective: 1) the activation magnitudes of adversarial examples are higher than that of natural examples; and 2) the channels are activated more uniformly by adversarial examples than natural examples. We find that, while the...	ShuTao Xia, Xingjun Ma, Yang Bai, Yisen Wang, Yong Jiang, Yuyuan Zeng
93	What are the Statistical Limits of Offline RL with Linear Function Approximation?	Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of...	Dean P. Foster, Ruosong Wang, Sham M. Kakade
94	Unlearnable Examples: Making Personal Data Unexploitable	The volume of "free" data on the internet has been key to the current success of deep learning. However, it also raises privacy concerns about the unauthorized exploitation of personal data for training commercial models. It is thus crucial to develop methods to prevent unauthorized data exploitation. This paper raises the question: can data be made unlearnable for deep learning models? We present a type of error-minimizing noise that can indeed make training examples unlearnable. Error-minimizing noise is intentionally generated to...	Hanxun Huang, James Bailey, Sarah Monazam Erfani, Xingjun Ma, Yisen Wang
95	Learning Mesh-Based Simulation with Graph Networks	Mesh-based simulations are central to modeling complex physical systems in many disciplines across science and engineering. Mesh representations support powerful numerical integration methods and their resolution can be adapted to strike favorable trade-offs between accuracy and efficiency. However, high-dimensional scientific simulations are very expensive to run, and solvers and parameters must often be tuned individually to each system studied. Here we introduce MeshGraphNets, a framework for learning mesh-based simulations using...	Alvaro SanchezGonzalez, Meire Fortunato, Peter W. Battaglia, Tobias Pfaff
96	Locally Free Weight Sharing for Network Width Search	Searching for network width is an effective way to slim deep neural networks with hardware budgets. With this aim, a one-shot supernet is usually leveraged as a performance evaluator to rank the performance \wrt~different width. Nevertheless, current methods mainly follow a manually fixed weight sharing pattern, which is limited to distinguish the performance gap of different width. In this paper, to better evaluate each width, we propose a locally free weight sharing strategy (CafeNet) accordingly. In CafeNet, weights are more freely...	Chang Xu, Changshui Zhang, Chen Qian, Fei Wang, Shan You, Tao Huang, Xiu Su
97	Graph Convolution with Low-rank Learnable Local Filters	Geometric variations like rotation, scaling, and viewpoint changes pose a significant challenge to visual understanding. One common solution is to directly model certain intrinsic structures, e.g., using landmarks. However, it then becomes non-trivial to build effective deep models, especially when the underlying non-Euclidean grid is irregular and coarse. Recent deep models using graph convolutions provide an appropriate framework to handle such non-Euclidean data, but many of them, particularly those based on global graph Laplacians,...	Qiang Qiu, Xiuyuan Cheng, Zichen Miao
98	Regularized Inverse Reinforcement Learning	Inverse Reinforcement Learning (IRL) aims to facilitate a learner’s ability to imitate expert behavior by acquiring reward functions that explain the expert’s decisions. Regularized IRLapplies strongly convex regularizers to the learner’s policy in order to avoid the expert’s behavior being rationalized by arbitrary constant rewards, also known as degenerate solutions. We propose tractable solutions, and practical methods to obtain them, for regularized IRL. Current methods are restricted to the maximum-entropy IRL framework, limiting...	ChenYang Su, Derek Nowrouzezahrai, Joelle Pineau, Paul Barde, Thang Doan, Wonseok Jeon
99	Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking	Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models. However, there has been little work on interpreting them, and specifically on understanding which parts of the graphs (e.g. syntactic trees or co-reference structures) contribute to a prediction. In this work, we introduce a post-hoc method for interpreting the predictions of GNNs which identifies unnecessary edges. Given a trained GNN model, we learn a simple classifier that, for every edge in every layer, predicts...	Ivan Titov, Michael Sejr Schlichtkrull, Nicola De Cao
100	Deep Neural Network Fingerprinting by Conferrable Adversarial Examples	In Machine Learning as a Service, a provider trains a deep neural network and gives many users access. The hosted (source) model is susceptible to model stealing attacks, where an adversary derives a surrogate model from API access to the source model. For post hoc detection of such attacks, the provider needs a robust method to determine whether a suspect model is a surrogate of their model. We propose a fingerprinting method for deep neural network classifiers that extracts a set of inputs from the source model so that only...	Florian Kerschbaum, Nils Lukas, Yuxuan Zhang
101	Tent: Fully Test-Time Adaptation by Entropy Minimization	A model must adapt itself to generalize to new and different data during testing. In this setting of fully test-time adaptation the model has only the test data and its own parameters. We propose to adapt by test entropy minimization (tent): we optimize the model for confidence as measured by the entropy of its predictions. Our method estimates normalization statistics and optimizes channel-wise affine transformations to update online on each batch. Tent reduces generalization error for image classification on corrupted ImageNet and...	Bruno A. Olshausen, Dequan Wang, Evan Shelhamer, Shaoteng Liu, Trevor Darrell
102	GAN "Steerability" without optimization	Recent research has shown remarkable success in revealing "steering" directions in the latent spaces of pre-trained GANs. These directions correspond to semantically meaningful image transformations (e.g., shift, zoom, color manipulations), and have the same interpretable effect across all categories that the GAN can generate. Some methods focus on user-specified transformations, while others discover transformations in an unsupervised manner. However, all existing techniques rely on an optimization procedure to expose those...	Nurit Spingarn, Ron Banner, Tomer Michaeli
103	Contrastive Divergence Learning is a Time Reversal Adversarial Game	Contrastive divergence (CD) learning is a classical method for fitting unnormalized statistical models to data samples. Despite its wide-spread use, the convergence properties of this algorithm are still not well understood. The main source of difficulty is an unjustified approximation which has been used to derive the gradient of the loss. In this paper, we present an alternative derivation of CD that does not require any approximation and sheds new light on the objective that is actually being optimized by the algorithm....	Omer Yair, Tomer Michaeli
104	Topology-Aware Segmentation Using Discrete Morse Theory	In the segmentation of fine-scale structures from natural and biomedical images, per-pixel accuracy is not the only metric of concern. Topological correctness, such as vessel connectivity and membrane closure, is crucial for downstream analysis tasks. In this paper, we propose a new approach to train deep image segmentation networks for better topological accuracy. In particular, leveraging the power of discrete Morse theory (DMT), we identify global structures, including 1D skeletons and 2D patches, which are important for topological...	Chao Chen, Dimitris Samaras, Fuxin Li, Xiaoling Hu, Yusu Wang
105	Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees?	Despite the success of neural models on many major machine learning problems, their effectiveness on traditional Learning-to-Rank (LTR) problems is still not widely acknowledged. We first validate this concern by showing that most recent neural LTR models are, by a large margin, inferior to the best publicly available Gradient Boosted Decision Trees (GBDT) in terms of their reported ranking accuracy on benchmark datasets. This unfortunately was somehow overlooked in recent neural LTR papers. We then investigate why existing neural LTR...	Honglei Zhuang, Le Yan, Marc Najork, Michael Bendersky, Rama Kumar Pasumarthi, Xuanhui Wang, Yi Tay, Zhen Qin
106	Predicting Infectiousness for Proactive Contact Tracing	The COVID-19 pandemic has spread rapidly worldwide, overwhelming manual contact tracing in many countries and resulting in widespread lockdowns for emergency containment. Large-scale digital contact tracing (DCT) has emerged as a potential solution to resume economic and social activity while minimizing spread of the virus. Various DCT methods have been proposed, each making trade-offs be-tween privacy, mobility restrictions, and public health. The most common approach, binary contact tracing (BCT), models infection as a binary event,...	Abhinav Sharma, Andrew Williams, Bernhard Schölkopf, Christopher J. Pal, David L. Buckeridge, Eilif Benjamin Müller, Gaétan MarceauCaron, Hannah Alsdurf, Irina Rish, Jian Tang, Joumana Ghosn, Martin Weiss, Meng Qu, Nasim Rahaman, Olexa Bilaniuk, Pierre Luc Carrier, PierreLuc StCharles, Prateek Gupta, Satya OrtizGagne, Tegan Maharaj, Tristan Deleu, Victor Schmidt, Yoshua Bengio
107	Regularization Matters in Policy Optimization - An Empirical Study on Continuous Control	Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks. Yet, conventional regularization techniques in training neural networks (e.g., $L_2$ regularization, dropout) have been largely ignored in RL methods, possibly because agents are typically trained and evaluated in the same environment, and because the deep RL community focuses more on high-level algorithm designs. In this work, we present the first comprehensive study of...	Bingyi Kang, Trevor Darrell, Xuanlin Li, Zhuang Liu
108	Minimum Width for Universal Approximation	The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. However, the critical width enabling the universal approximation has not been exactly characterized in terms of the input dimension $d_x$ and the output dimension $d_y$ . In this work, we provide the first definitive result in this direction for networks using the ReLU activation functions: The minimum width required for the universal approximation of the $L^p$ functions is...	Chulhee Yun, Jaeho Lee, Jinwoo Shin, Sejun Park
109	Towards Robustness Against Natural Language Word Substitutions	Robustness against word substitutions has a well-defined and widely acceptable form, i.e., using semantically similar words as substitutions, and thus it is considered as a fundamental stepping-stone towards broader robustness in natural language processing. Previous defense methods capture word substitutions in vector space by using either l_2-ball or hyper-rectangle, which results in perturbation sets that are not inclusive enough or unnecessarily large, and thus impedes mimicry of worst cases for robust training. In this paper, we...	Anh Tuan Luu, Hong Liu, Rongrong Ji, Xinshuai Dong
110	On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers	A deep equilibrium model uses implicit layers, which are implicitly defined through an equilibrium point of an infinite sequence of computation. It avoids any explicit computation of the infinite sequence by finding an equilibrium point directly via root-finding and by computing gradients via implicit differentiation. In this paper, we analyze the gradient dynamics of deep equilibrium models with nonlinearity only on weight matrices and non-convex objective functions of weights for regression and classification. Despite non-convexity,...	Kenji Kawaguchi
111	Structured Prediction as Translation between Augmented Natural Languages	We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named entity recognition, relation classification, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking. Instead of tackling the problem by training task-specific discriminative classifiers, we frame it as a translation task between augmented natural languages, from which the task-relevant information can be...	Alessandro Achille, Ben Athiwaratkun, Bing Xiang, Cícero Nogueira dos Santos, Giovanni Paolini, Jason Krone, Jie Ma, Rishita Anubhai, Stefano Soatto
112	How Benign is Benign Overfitting ?	We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly) trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even in the presence of label noise, while also exhibiting good generalization on natural test data, something referred to as benign overfitting (Bartlett et al., 2020; Chatterji & Long, 2020). However, these models are vulnerable to adversarial attacks. We identify label noise as one of the causes for adversarial vulnerability, and...	Amartya Sanyal, Philip H. S. Torr, Puneet K. Dokania, Varun Kanade
113	Correcting experience replay for multi-agent communication	We consider the problem of learning to communicate using multi-agent reinforcement learning (MARL). A common approach is to learn off-policy, using data sampled from a replay buffer. However, messages received in the past may not accurately reflect the current communication policy of each agent, and this complicates learning. We therefore introduce a 'communication correction' which accounts for the non-stationarity of observed communication induced by multi-agent learning. It works by relabelling the received message to make it likely...	Peter Dayan, Sanjeevan Ahilan
114	Emergent Symbols through Binding in External Memory	A key aspect of human intelligence is the ability to infer abstract rules directly from high-dimensional sensory data, and to do so given only a limited amount of training experience. Deep neural network algorithms have proven to be a powerful tool for learning directly from high-dimensional data, but currently lack this capacity for data-efficient induction of abstract rules, leading some to argue that symbol-processing mechanisms will be necessary to account for this capacity. In this work, we take a step toward bridging this gap by...	Ishan Sinha, Jonathan D. Cohen, Taylor Whittington Webb
115	Influence Estimation for Generative Adversarial Networks	Identifying harmful instances, whose absence in a training dataset improves model performance, is important for building better machine learning models. Although previous studies have succeeded in estimating harmful instances under supervised settings, they cannot be trivially extended to generative adversarial networks (GANs). This is because previous approaches require that (i) the absence of a training instance directly affects the loss value and that (ii) the change in the loss directly measures the harmfulness of the instance for...	Hiroki Ohashi, Naoyuki Terashita, Takashi Kanemaru, Yuichi Nonaka
116	PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics	Simulated virtual environments serve as one of the main driving forces behind developing and evaluating skill learning algorithms. However, existing environments typically only simulate rigid body physics. Additionally, the simulation process usually does not provide gradients that might be useful for planning and control optimizations. We introduce a new differentiable physics benchmark called PasticineLab, which includes a diverse collection of soft body manipulation tasks. In each task, the agent uses manipulators to deform the...	Chuang Gan, Hao Su, Joshua B. Tenenbaum, Siyuan Zhou, Tao Du, Yuanming Hu, Zhiao Huang
117	Implicit Normalizing Flows	Normalizing flows define a probability distribution by an explicit invertible transformation $\boldsymbol{\mathbf{z}}=f(\boldsymbol{\mathbf{x}})$ . In this work, we present implicit normalizing flows (ImpFlows), which generalize normalizing flows by allowing the mapping to be implicitly defined by the roots of an equation $F(\boldsymbol{\mathbf{z}}, \boldsymbol{\mathbf{x}})= \boldsymbol{\mathbf{0}}$ . ImpFlows build on residual flows (ResFlows) with a proper balance between expressiveness and tractability. Through theoretical analysis,...	Cheng Lu, Chongxuan Li, Jianfei Chen, Jun Zhu, Qiuhao Wang
118	Support-set bottlenecks for video-text representation learning	The dominant paradigm for learning video-text representations – noise contrastive learning – increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related – for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a...	Alexander G. Hauptmann, Andrea Vedaldi, Florian Metze, João F. Henriques, Mandela Patrick, PoYao Huang, Yuki Markus Asano
119	Winning the L2RPN Challenge: Power Grid Management via Semi-Markov Afterstate Actor-Critic	Safe and reliable electricity transmission in power grids is crucial for modern society. It is thus quite natural that there has been a growing interest in the automatic management of power grids, exempliﬁed by the Learning to Run a Power Network Challenge (L2RPN), modeling the problem as a reinforcement learning (RL) task. However, it is highly challenging to manage a real-world scale power grid, mostly due to the massive scale of its state and action space. In this paper, we present an off-policy actor-critic approach that...	ByungJun Lee, Deunsol Yoon, KeeEung Kim, Sunghoon Hong
120	Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize	Fast and stable fluid simulations are an essential prerequisite for applications ranging from computer-generated imagery to computer-aided design in research and development. However, solving the partial differential equations of incompressible fluids is a challenging task and traditional numerical approximation schemes come at high computational costs. Recent deep learning based approaches promise vast speed-ups but do not generalize to new fluid domains, require fluid simulation data for training, or rely on complex pipelines that...	Michael Weinmann, Nils Wandel, Reinhard Klein
121	The Traveling Observer Model: Multi-task Learning Through Spatial Variable Embeddings	This paper frames a general prediction system as an observer traveling around a continuous space, measuring values at some locations, and predicting them at others. The observer is completely agnostic about any particular task being solved; it cares only about measurement locations and their values. This perspective leads to a machine learning framework in which seemingly unrelated tasks can be solved by a single model, by embedding their input and output variables into a shared space. An implementation of the framework is developed in...	Elliot Meyerson, Risto Miikkulainen
122	Grounded Language Learning Fast and Slow	Recent work has shown that large text-based neural language models acquire a surprising propensity for one-shot learning. Here, we show that an agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional RL algorithms. After a single introduction to a novel object via visual perception and language ("This is a dax"), the agent can manipulate the object as instructed ("Put the dax on the bed"), combining short-term, within-episode...	Felix Hill, Hamza Merzic, Nathaniel Wong, Olivier Tieleman, Stephen Clark, Tamara von Glehn
123	Long-tailed Recognition by Routing Diverse Distribution-Aware Experts	Natural data are often long-tail distributed over semantic classes. Existing recognition methods tackle this imbalanced classification by placing more emphasis on the tail data, through class re-balancing/re-weighting or ensembling over different data groups, resulting in increased tail accuracies but reduced head accuracies. We take a dynamic view of the training data and provide a principled model bias and variance analysis as the training data fluctuates: Existing long-tail classifiers invariably increase the model variance and the...	Long Lian, Stella X. Yu, Xudong Wang, Zhongqi Miao, Ziwei Liu
124	Differentially Private Learning Needs Better Features (or Much More Data)	We demonstrate that differentially private machine learning has not yet reached its ''AlexNet moment'' on many canonical vision tasks: linear models trained on handcrafted features significantly outperform end-to-end deep neural networks for moderate privacy budgets. To exceed the performance of handcrafted features, we show that private learning requires either much more private data, or access to features learned on public data from a similar domain. Our work introduces simple yet strong baselines for differentially private learning...	Dan Boneh, Florian Tramèr
125	Unsupervised Object Keypoint Learning using Local Spatial Predictability	We propose PermaKey, a novel approach to representation learning based on object keypoints. It leverages the predictability of local image regions from spatial neighborhoods to identify salient regions that correspond to object parts, which are then converted to keypoints. Unlike prior approaches, it utilizes predictability to discover object keypoints, an intrinsic property of objects. This ensures that it does not overly bias keypoints to focus on characteristics that are not unique to objects, such as movement, shape, colour etc. We...	Anand Gopalakrishnan, Jürgen Schmidhuber, Sjoerd van Steenkiste
126	On Statistical Bias In Active Learning: How and When to Fix It	Active learning is a powerful tool when labelling data is expensive, but it introduces a bias because the training data no longer follows the population distribution. We formalize this bias and investigate the situations in which it can be harmful and sometimes even helpful. We further introduce novel corrective weights to remove bias when doing so is beneficial. Through this, our work not only provides a useful mechanism that can improve the active learning approach, but also an explanation for the empirical successes of various...	Sebastian Farquhar, Tom Rainforth, Yarin Gal
127	Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time	We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons, and data dimension. More specifically, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for several two- and three-layer CNN architectures. We first prove that two-layer CNNs can be globally optimized via an $\ell_2$ norm regularized convex...	Mert Pilanci, Tolga Ergen
128	Generalization in data-driven models of primary visual cortex	Deep neural networks (DNN) have set new standards at predicting responses of neural populations to visual input. Most such DNNs consist of a convolutional network (core) shared across all neurons which learns a representation of neural computation in visual cortex and a neuron-specific readout that linearly combines the relevant features in this representation. The goal of this paper is to test whether such a representation is indeed generally characteristic for visual cortex, i.e. generalizes between animals of a species, and what...	Akshay Kumar Jagadish, Alexander S. Ecker, Andreas S. Tolias, Edgar Y. Walker, Eric Wang, Erick Cobos, Fabian H. Sinz, Konstantin Willeke, KonstantinKlemens Lurz, Mohammad Bashiri, Santiago A. Cadena, Taliah Muhammad
129	Mathematical Reasoning via Self-supervised Skip-tree Training	We demonstrate that self-supervised language modeling applied to mathematical formulas enables logical reasoning. To measure the logical reasoning abilities of language models, we formulate several evaluation (downstream) tasks, such as inferring types, suggesting missing assumptions and completing equalities. For training language models for formal mathematics, we propose a novel skip-tree task. We find that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform models trained on...	Christian Szegedy, Dennis Lee, Kshitij Bansal, Markus Norman Rabe
130	Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1/n Parameters	Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically, “fully-connected layers with quaternions” (quaternions are 4D hypercomplex numbers), which replace real-valued matrix multiplications in fully-connected layers with Hamilton products of quaternions, both enjoy parameter savings with only 1/4 learnable parameters and achieve comparable performance in various applications. However, one key caveat is that hypercomplex space only exists at very few predefined dimensions (4D,...	Alvin Chan, Anh Tuan Luu, Aston Zhang, Jie Fu, Shuai Zhang, Siu Cheung Hui, Yi Tay
131	Distributional Sliced-Wasserstein and Applications to Generative Modeling	Sliced-Wasserstein distance (SW) and its variant, Max Sliced-Wasserstein distance (Max-SW), have been used widely in the recent years due to their fast computation and scalability even when the probability measures lie in a very high dimensional space. However, SW requires many unnecessary projection samples to approximate its value while Max-SW only uses the most important projection, which ignores the information of other useful directions. In order to account for these weaknesses, we propose a novel distance, named Distributional...	Hung Bui, Khai Nguyen, Nhat Ho, Tung Pham
132	Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors	Regularization by denoising (RED) is a recently developed framework for solving inverse problems by integrating advanced denoisers as image priors. Recent work has shown its state-of-the-art performance when combined with pre-trained deep denoisers. However, current RED algorithms are inadequate for parallel processing on multicore systems. We address this issue by proposing a new{asynchronous RED (Async-RED) algorithm that enables asynchronous parallel processing of data, making it significantly faster than its serial counterparts for...	Brendt Wohlberg, Jiaming Liu, Ulugbek Kamilov, Yiran Sun, Yu Sun
133	DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs	We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience. This approach can be applied on top of any learned representation and has the potential to easily support multiple solution objectives as well as zero-shot adjustment to changing environments and goals. Our main contribution is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate its solutions for offline RL. DAC-MDPs are a non-parametric model that can...	Aayam Kumar Shrestha, Alan Fern, Prasad Tadepalli, Stefan Lee
134	Learning from Protein Structure with Geometric Vector Perceptrons	Learning on 3D structures of large biomolecules is emerging as a distinct area in machine learning, but there has yet to emerge a unifying network architecture that simultaneously leverages the geometric and relational aspects of the problem domain. To address this gap, we introduce geometric vector perceptrons, which extend standard dense layers to operate on collections of Euclidean vectors. Graph neural networks equipped with such layers are able to perform both geometric and relational reasoning on efficient representations of...	Bowen Jing, Patricia Suriana, Raphael John Lamarre Townshend, Ron O. Dror, Stephan Eismann
135	Behavioral Cloning from Noisy Demonstrations	We consider the problem of learning an optimal expert behavior policy given noisy demonstrations that contain observations from both optimal and non-optimal expert behaviors. Popular imitation learning algorithms, such as generative adversarial imitation learning, assume that (clear) demonstrations are given from optimal expert policies but not the non-optimal ones, and thus often fail to imitate the optimal expert behaviors given the noisy demonstrations. Prior works that address the problem require (1) learning policies through...	Fumihiro Sasaki, Ryota Yamashina
136	Undistillable: Making A Nasty Teacher That CANNOT teach students	Knowledge Distillation (KD) is a widely used technique to transfer knowledge from pre-trained teacher models to (usually more lightweight) student models. However, in certain situations, this technique is more of a curse than a blessing. For instance, KD poses a potential risk of exposing intellectual properties (IPs): even if a trained machine learning model is released in ``black boxes'' (e.g., as executable software or APIs without open-sourcing code), it can still be replicated by KD through imitating input-output behaviors. To...	Chenyu You, Haoyu Ma, Tianlong Chen, TingKuei Hu, Xiaohui Xie, Zhangyang Wang
137	Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows	Time series forecasting is often fundamental to scientific and engineering problems and enables decision making. With ever increasing data set sizes, a trivial solution to scale up predictions is to assume independence between interacting time series. However, modeling statistical dependencies can improve accuracy and enable analysis of interaction effects. Deep learning methods are well suited for this problem, but multi-variate models often assume a simple parametric distribution and do not scale to high dimensions. In this work we...	AbdulSaboor Sheikh, Ingmar Schuster, Kashif Rasul, Roland Vollgraf, Urs M. Bergmann
138	Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels	We propose a simple data augmentation technique that can be applied to standard model-free reinforcement learning algorithms, enabling robust learning directly from pixels without the need for auxiliary losses or pre-training. The approach leverages input perturbations commonly used in computer vision tasks to transform input examples, as well as regularizing the value function and policy. Existing model-free approaches, such as Soft Actor-Critic (SAC), are not able to train deep networks effectively from image pixels. However, the...	Denis Yarats, Ilya Kostrikov, Rob Fergus
139	HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark	HardWare-aware Neural Architecture Search (HW-NAS) has recently gained tremendous attention by automating the design of deep neural networks deployed in more resource-constrained daily life devices. Despite its promising performance, developing optimal HW-NAS solutions can be prohibitively challenging as it requires cross-disciplinary knowledge in the algorithm, micro-architecture, and device-specific compilation. First, to determine the hardware-cost to be incorporated into the NAS process, existing works mostly adopt either...	Chaojian Li, Cong Hao, Haoran You, Qixuan Yu, Yang Zhao, Yingyan Lin, Yongan Zhang, Yonggan Fu, Yue Wang, Zhongzhi Yu
140	Practical Real Time Recurrent Learning with a Sparse Approximation	Recurrent neural networks are usually trained with backpropagation through time, which requires storing a complete history of network states, and prohibits updating the weights "online" (after every timestep). Real Time Recurrent Learning (RTRL) eliminates the need for history storage and allows for online weight updates, but does so at the expense of computational costs that are quartic in the state size. This renders RTRL training intractable for all but the smallest networks, even ones that are made highly sparse. We introduce the...	Alex Graves, Erich Elsen, Jacob Menick, Karen Simonyan, Simon Osindero, Utku Evci
141	Random Feature Attention	Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in...	Dani Yogatama, Hao Peng, Lingpeng Kong, Nikolaos Pappas, Noah A. Smith, Roy Schwartz
142	A Gradient Flow Framework For Analyzing Network Pruning	Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general framework that uses gradient flow to unify state-of-the-art importance measures through the norm of model parameters. We use this framework to...	Ekdeep Singh Lubana, Robert P. Dick
143	Recurrent Independent Mechanisms	We explore the hypothesis that learning modular structures which reflect the dynamics of the environment can lead to better generalization and robustness to changes that only affect a few of the underlying causes. We propose Recurrent Independent Mechanisms (RIMs), a new recurrent architecture in which multiple groups of recurrent cells operate with nearly independent transition dynamics, communicate only sparingly through the bottleneck of attention, and compete with each other so they are updated only at time steps where they are...	Alex Lamb, Anirudh Goyal, Bernhard Schölkopf, Jordan Hoffmann, Sergey Levine, Shagun Sodhani, Yoshua Bengio
144	The Intrinsic Dimension of Images and Its Impact on Learning	It is widely believed that natural image data exhibits low-dimensional structure despite the high dimensionality of conventional pixel representations. This idea underlies a common intuition for the remarkable success of deep learning in computer vision. In this work, we apply dimension estimation tools to popular datasets and investigate the role of low-dimensional structure in deep learning. We find that common natural image datasets indeed have very low intrinsic dimension relative to the high number of pixels in the images....	Ahmed Abdelkader, Chen Zhu, Micah Goldblum, Phillip Pope, Tom Goldstein
145	Uncertainty Sets for Image Classifiers using Conformal Prediction	Convolutional image classifiers can achieve high predictive accuracy, but quantifying their uncertainty remains an unresolved challenge, hindering their deployment in consequential settings. Existing uncertainty quantification techniques, such as Platt scaling, attempt to calibrate the network’s probability estimates, but they do not have formal guarantees. We present an algorithm that modifies any classifier to output a predictive set containing the true label with a user-specified probability, such as 90%. The algorithm is simple...	Anastasios Nikolas Angelopoulos, Jitendra Malik, Michael I. Jordan, Stephen Bates
146	Sequential Density Ratio Estimation for Simultaneous Optimization of Speed and Accuracy	Classifying sequential data as early and as accurately as possible is a challenging yet critical problem, especially when a sampling cost is high. One algorithm that achieves this goal is the sequential probability ratio test (SPRT), which is known as Bayes-optimal: it can keep the expected number of data samples as small as possible, given the desired error upper-bound. However, the original SPRT makes two critical assumptions that limit its application in real-world scenarios: (i) samples are independently and identically...	Akinori F. Ebihara, Hitoshi Imaoka, Kazuyuki Sakurai, Taiki Miyagawa
147	Disentangled Recurrent Wasserstein Autoencoder	Learning disentangled representations leads to interpretable models and facilitates data generation with style transfer, which has been extensively studied on static data such as images in an unsupervised learning framework. However, only a few works have explored unsupervised disentangled sequential representation learning due to challenges of generating sequential data. In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE), a new framework for generative modeling of sequential data. R-WAE disentangles the representation...	Jun Han, Li Erran Li, Ligong Han, Martin Renqiang Min, Xuan Zhang
148	Generalization bounds via distillation	This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds. The main contribution is an analysis showing that the original network inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation. This bound is presented both in an abstract and in a concrete form, the latter...	Daniel Hsu, Lan Wang, Matus Telgarsky, Ziwei Ji
149	Neural Approximate Sufficient Statistics for Implicit Models	We consider the fundamental problem of how to automatically construct summary statistics for implicit generative models where the evaluation of the likelihood function is intractable but sampling data from the model is possible. The idea is to frame the task of constructing sufficient statistics as learning mutual information maximizing representations of the data with the help of deep neural networks. The infomax learning procedure does not need to estimate any density or density ratio. We apply our approach to both traditional...	Aaron C. Courville, Dinghuai Zhang, Michael U. Gutmann, Yanzhi Chen, Zhanxing Zhu
150	A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference	Recent increases in the computational demands of deep neural networks (DNNs), combined with the observation that most input samples require only simple models, have sparked interest in input-adaptive multi-exit architectures, such as MSDNets or Shallow-Deep Networks. These architectures enable faster inferences and could bring DNNs to low-power devices, e.g., in the Internet of Things (IoT). However, it is unknown if the computational savings provided by this approach are robust against adversarial pressure. In particular, an adversary...	IonutVlad Modoranu, Sanghyun Hong, Tudor Dumitras, Yigitcan Kaya
151	Orthogonalizing Convolutional Layers with the Cayley Transform	Recent work has highlighted several advantages of enforcing orthogonality in the weight layers of deep networks, such as maintaining the stability of activations, preserving gradient norms, and enhancing adversarial robustness by enforcing low Lipschitz constants. Although numerous methods exist for enforcing the orthogonality of fully-connected layers, those for convolutional layers are more heuristic in nature, often focusing on penalty methods or limited classes of convolutions. In this work, we propose and evaluate an alternative...	Asher Trockman, J. Zico Kolter
152	LambdaNetworks: Modeling long-range Interactions without Attention	We present lambda layers -- an alternative framework to self-attention -- for capturing long-range interactions between an input and structured contextual information (e.g. a pixel surrounded by other pixels). Lambda layers capture such interactions by transforming available contexts into linear functions, termed lambdas, and applying these linear functions to each input separately. Similar to linear attention, lambda layers bypass expensive attention maps, but in contrast, they model both content and position-based interactions which...	Irwan Bello
153	Mind the Pad - CNNs Can Develop Blind Spots	We show how feature maps in convolutional networks are susceptible to spatial bias. Due to a combination of architectural choices, the activation at certain locations is systematically elevated or weakened. The major source of this bias is the padding mechanism. Depending on several aspects of convolution arithmetic, this mechanism can apply the padding unevenly, leading to asymmetries in the learned weights. We demonstrate how such bias can be detrimental to certain tasks such as small object detection: the activation is suppressed if...	Bilal Alsallakh, Jun Yuan, Narine Kokhlikyan, Orion ReblitzRichardson, Vivek Miglani
154	Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning	Unsupervised learning aims to learn meaningful representations from unlabeled data which can captures its intrinsic structure, that can be transferred to downstream tasks. Meta-learning, whose objective is to learn to generalize across tasks such that the learned model can rapidly adapt to a novel task, shares the spirit of unsupervised learning in that the both seek to learn more effective and efficient learning procedure than learning from scratch. The fundamental difference of the two is that the most meta-learning approaches are...	Dong Bok Lee, Dongchan Min, Seanie Lee, Sung Ju Hwang
155	Fast Geometric Projections for Local Robustness Certification	Local robustness ensures that a model classifies all inputs within an $\ell_p$ -ball consistently, which precludes various forms of adversarial inputs. In this paper, we present a fast procedure for checking local robustness in feed-forward neural networks with piecewise-linear activation functions. Such networks partition the input space into a set of convex polyhedral regions in which the network’s behavior is linear; hence, a systematic search for decision boundaries within the regions around a given input is sufficient for assessing...	Aymeric Fromherz, Bryan Parno, Corina S. Pasareanu, Klas Leino, Matt Fredrikson
156	Fidelity-based Deep Adiabatic Scheduling	Adiabatic quantum computation is a form of computation that acts by slowly interpolating a quantum system between an easy to prepare initial state and a final state that represents a solution to a given computational problem. The choice of the interpolation schedule is critical to the performance: if at a certain time point, the evolution is too rapid, the system has a high probability to transfer to a higher energy state, which does not represent a solution to the problem. On the other hand, an evolution that is too slow leads to a...	Eli Ovits, Lior Wolf
157	On Self-Supervised Image Representations for GAN Evaluation	The embeddings from CNNs pretrained on Imagenet classification are de-facto standard image representations for assessing GANs via FID, Precision and Recall measures. Despite broad previous criticism of their usage for non-Imagenet domains, these embeddings are still the top choice in most of the GAN literature. In this paper, we advocate the usage of the state-of-the-art self-supervised representations to evaluate GANs on the established non-Imagenet benchmarks. These representations, typically obtained via contrastive learning, are...	Andrey Voynov, Artem Babenko, Stanislav Morozov
158	Retrieval-Augmented Generation for Code Summarization via Hybrid GNN	Source code summarization aims to generate natural language summaries from structured code snippets for better understanding code functionalities. However, automatic code summarization is challenging due to the complexity of the source code and the language gap between the source code and natural language summaries. Most previous approaches either rely on retrieval-based (which can take advantage of similar examples seen from the retrieval database, but have low generalization performance) or generation-based methods (which have better...	Jing Kai Siow, Shangqing Liu, Xiaofei Xie, Yang Liu, Yu Chen
159	Self-supervised Visual Reinforcement Learning with Object-centric Representations	Autonomous agents need large repertoires of skills to act reasonably on new tasks that they have not seen before. However, acquiring these skills using only a stream of high-dimensional, unstructured, and unlabeled observations is a tricky challenge for any autonomous agent. Previous methods have used variational autoencoders to encode a scene into a low-dimensional vector that can be used as a goal for an agent to discover new skills. Nevertheless, in compositional/multi-object environments it is difficult to disentangle all the...	Andrii Zadaianchuk, Georg Martius, Maximilian Seitzer
160	Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies	A main theoretical interest in biology and physics is to identify the nonlinear dynamical system (DS) that generated observed time series. Recurrent Neural Networks (RNN) are, in principle, powerful enough to approximate any underlying DS, but in their vanilla form suffer from the exploding vs. vanishing gradients problem. Previous attempts to alleviate this problem resulted either in more complicated, mathematically less tractable RNN architectures, or strongly limited the dynamical expressiveness of the RNN. Here we address this...	Daniel Durstewitz, Dominik Schmidt, Georgia Koppe, Max Beutelspacher, Zahra Monfared
161	Neural Topic Model via Optimal Transport	Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic...	Dinh Phung, He Zhao, Trung Le, Viet Huynh, Wray L. Buntine
162	Memory Optimization for Deep Networks	Deep learning is slowly, but steadily, hitting a memory bottleneck. While the tensor computation in top-of-the-line GPUs increased by $32\times $over the last five years, the total available memory only grew by \$ 2.5\times$. This prevents researchers from exploring larger architectures, as training large networks requires more memory for storing intermediate outputs. In this paper, we present MONeT, an automatic framework that minimizes both the memory footprint and computational overhead of deep networks. MONeT jointly optimizes the...	Aashaka Shah, ChaoYuan Wu, Jayashree Mohan, Philipp Krähenbühl, Vijay Chidambaram
163	Stabilized Medical Image Attacks	Convolutional Neural Networks (CNNs) have advanced existing medical systems for automatic disease diagnosis. However, a threat to these systems arises that adversarial attacks make CNNs vulnerable. Inaccurate diagnosis results make a negative influence on human healthcare. There is a need to investigate potential adversarial attacks to robustify deep medical diagnosis systems. On the other side, there are several modalities of medical images (e.g., CT, fundus, and endoscopic image) of which each type is significantly different from...	Gege Qi, Kai Ma, Lijun Gong, Yefeng Zheng, Yibing Song
164	Quantifying Differences in Reward Functions	For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward. Moreover, this method can only tell us about behavior in the evaluation environment, but...	Adam Gleave, Jan Leike, Michael Dennis, Shane Legg, Stuart Russell
165	MARS: Markov Molecular Sampling for Multi-objective Drug Discovery	Searching for novel molecules with desired chemical properties is crucial in drug discovery. Existing work focuses on developing neural models to generate either molecular sequences or chemical graphs. However, it remains a big challenge to find novel and diverse compounds satisfying several properties. In this paper, we propose MARS, a method for multi-objective drug molecule discovery. MARS is based on the idea of generating the chemical candidates by iteratively editing fragments of molecular graphs. To search for high-quality...	Chence Shi, Hao Zhou, Lei Li, Weinan Zhang, Yong Yu, Yutong Xie, Yuwei Yang
166	Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs	A common approach to define convolutions on meshes is to interpret them as a graph and apply graph convolutional networks (GCNs). Such GCNs utilize isotropic kernels and are therefore insensitive to the relative orientation of vertices and thus to the geometry of the mesh as a whole. We propose Gauge Equivariant Mesh CNNs which generalize GCNs to apply anisotropic gauge equivariant kernels. Since the resulting features carry orientation information, we introduce a geometric message passing scheme defined by parallel transporting...	Maurice Weiler, Max Welling, Pim de Haan, Taco Cohen
167	RMSprop converges with proper hyper-parameter	Despite the existence of divergence examples, RMSprop remains one of the most popular algorithms in machine learning. Towards closing the gap between theory and practice, we prove that RMSprop converges with proper choice of hyper-parameters under certain conditions. More specifically, we prove that when the hyper-parameter $\beta_2$ is close enough to $1$, RMSprop and its random shuffling version converge to a bounded region in general, and to critical points in the interpolation regime. It is worth mentioning that our results do not...	Dawei Li, Mingyi Hong, Naichen Shi, Ruoyu Sun
168	Revisiting Dynamic Convolution via Matrix Decomposition	Recent research in dynamic convolution shows substantial performance boost for efficient CNNs, due to the adaptive aggregation of K static convolution kernels. It has two limitations: (a) it increases the number of convolutional weights by K-times, and (b) the joint optimization of dynamic attention and static convolution kernels is challenging. In this paper, we revisit it from a new perspective of matrix decomposition and reveal the key issue is that dynamic convolution applies dynamic attention over channel groups after projecting...	Dongdong Chen, Lu Yuan, Mei Chen, Mengchen Liu, Nuno Vasconcelos, Xiyang Dai, Ye Yu, Yinpeng Chen, Yunsheng Li, Zicheng Liu
169	Explainable Deep One-Class Classification	Deep one-class classification variants for anomaly detection learn a mapping that concentrates nominal samples in feature space causing anomalies to be mapped away. Because this transformation is highly non-linear, finding interpretations poses a significant challenge. In this paper we present an explainable deep one-class classification method, Fully Convolutional Data Description (FCDD), where the mapped samples are themselves also an explanation heatmap. FCDD yields competitive detection performance and provides reasonable...	Billy Joe Franks, KlausRobert Müller, Lukas Ruff, Marius Kloft, Philipp Liznerski, Robert A. Vandermeulen
170	Taking Notes on the Fly Helps Language Pre-Training	How to make unsupervised language pre-training more efficient and less resource-intensive is an important research direction in NLP. In this paper, we focus on improving the efficiency of language pre-training methods through providing better data utilization. It is well-known that in language data corpus, words follow a heavy-tail distribution. A large proportion of words appear only very few times and the embeddings of rare words are usually poorly optimized. We argue that such embeddings carry inadequate semantic signals, which...	Chen Xing, Di He, Guolin Ke, Qiyu Wu, TieYan Liu, Yatao Li
171	Mixed-Features Vectors and Subspace Splitting	Motivated by metagenomics, recommender systems, dictionary learning, and related problems, this paper introduces subspace splitting(SS): the task of clustering the entries of what we call amixed-features vector, that is, a vector whose subsets of coordinates agree with a collection of subspaces. We derive precise identifiability conditions under which SS is well-posed, thus providing the first fundamental theory for this problem. We also propose the first three practical SS algorithms, each with advantages and disadvantages: a random...	Alejandro PimentelAlarcón, Daniel L. PimentelAlarcón
172	Neural Pruning via Growing Regularization	Regularization has long been utilized to learn sparsity in deep neural network pruning. However, its role is mainly explored in the small penalty strength regime. In this work, we extend its application to a new scenario where the regularization grows large gradually to tackle two central problems of pruning: pruning schedule and weight importance scoring. (1) The former topic is newly brought up in this work, which we find critical to the pruning performance while receives little research attention. Specifically, we propose an L2...	Can Qin, Huan Wang, Yulun Zhang, Yun Fu
173	Practical Massively Parallel Monte-Carlo Tree Search Applied to Molecular Design	It is common practice to use large computational resources to train neural networks, known from many examples, such as reinforcement learning applications. However, while massively parallel computing is often used for training models, it is rarely used to search solutions for combinatorial optimization problems. This paper proposes a novel massively parallel Monte-Carlo Tree Search (MP-MCTS) algorithm that works efficiently for a 1,000 worker scale on a distributed memory environment using multiple compute nodes and applies it to...	Kazuki Yoshizoe, Tanuj Kr Aasawat, Xiufeng Yang
174	Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition	In many scenarios, named entity recognition (NER) models severely suffer from unlabeled entity problem, where the entities of a sentence may not be fully annotated. Through empirical studies performed on synthetic datasets, we find two causes of performance degradation. One is the reduction of annotated entities and the other is treating unlabeled entities as negative instances. The first cause has less impact than the second one and can be mitigated by adopting pretraining language models. The second cause seriously misguides a model...	Lemao Liu, Shuming Shi, Yangming Li
175	Deep Networks and the Multiple Manifold Problem	We study the multiple manifold problem, a binary classification task modeled on applications in machine vision, in which a deep fully-connected neural network is trained to separate two low-dimensional submanifolds of the unit sphere. We provide an analysis of the one-dimensional case, proving for a simple manifold configuration that when the network depth $L$ is large relative to certain geometric and statistical properties of the data, the network width $n$ grows as a sufficiently large polynomial in $L$ , and the number of i.i.d....	Dar Gilboa, John Wright, Sam Buchanan
176	Knowledge distillation via softmax regression representation learning	This paper addresses the problem of model compression via knowledge distillation. We advocate for a method that optimizes the output feature of the penultimate layer of the student network and hence is directly related to representation learning. Previous distillation methods which typically impose direct feature matching between the student and the teacher do not take into account the classification problem at hand. On the contrary, our distillation method decouples representation learning and classification and utilizes the teacher's...	Adrian Bulat, Brais Martínez, Georgios Tzimiropoulos, Jing Yang
177	Nearest Neighbor Machine Translation	We introduce $k$ -nearest-neighbor machine translation ( $k$ NN-MT), which predicts tokens with a nearest-neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search. This approach requires no additional training and scales to give the decoder direct access to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings. Simply adding nearest-neighbor search improves a state-of-the-art...	Angela Fan, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis, Urvashi Khandelwal
178	WrapNet: Neural Net Inference with Ultra-Low-Precision Arithmetic	Low-precision neural networks represent both weights and activations with few bits, drastically reducing the cost of multiplications. Meanwhile, these products are accumulated using high-precision (typically 32-bit) additions. Additions dominate the arithmetic complexity of inference in quantized (e.g., binary) nets, and high precision is needed to avoid overflow. To further optimize inference, we propose WrapNet, an architecture that adapts neural networks to use low-precision (8-bit) additions while achieving classification accuracy...	Christoph Studer, HongMin Chu, Oscar Castañeda, Pingyeh Chiang, Renkun Ni, Tom Goldstein
179	Wandering within a world: Online contextualized few-shot learning	We aim to bridge the gap between typical human and machine-learning environments by extending the standard framework of few-shot learning to an online, continual setting. In this setting, episodes do not have separate training and testing phases, and instead models are evaluated online while learning novel classes. As in the real world, where the presence of spatiotemporal context helps us retrieve learned skills in the past, our online few-shot learning setting also features an underlying context that changes throughout time. Object...	Mengye Ren, Michael Curtis Mozer, Michael Louis Iuzzolino, Richard S. Zemel
180	Few-Shot Learning via Learning the Representation, Provably	This paper studies few-shot learning via representation learning, where one uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only $n_2 (\ll n_1)$ data. Specifically, we focus on the setting where there exists a good common representation between source and target, and our goal is to understand how much a sample size reduction is possible. First, we study the setting where this common representation is low-dimensional and provide a risk...	Jason D. Lee, Qi Lei, Sham M. Kakade, Simon Shaolei Du, Wei Hu
181	AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models	The design of deep graph models still remains to be investigated and the crucial part is how to explore and exploit the knowledge from different hops of neighbors in an efficient way. In this paper, we propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network; and the proposed graph convolutional network called AdaGCN~(Adaboosting Graph Convolutional Network) has the ability to efficiently extract knowledge from high-order neighbors of current nodes and then integrates...	Ke Sun, Zhanxing Zhu, Zhouchen Lin
182	MultiModalQA: complex question answering over text, tables and images	When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been relatively little work on question answering models that reason across multiple modalities. In this paper, we present MultiModalQA (MMQA): a challenging question answering dataset that requires joint reasoning over text, tables and images. We create MMQA using a new framework for generating complex...	Akari Asai, Alon Talmor, Amnon Catav, Dan Lahav, Gabriel Ilharco, Hannaneh Hajishirzi, Jonathan Berant, Ori Yoran, Yizhong Wang
183	Net-DNF: Effective Deep Modeling of Tabular Data	A challenging open question in deep learning is how to handle tabular data. Unlike domains such as image and natural language processing, where deep architectures prevail, there is still no widely accepted neural architecture that dominates tabular data. As a step toward bridging this gap, we present Net-DNF a novel generic architecture whose inductive bias elicits models whose structure corresponds to logical Boolean formulas in disjunctive normal form (DNF) over affine soft-threshold decision terms. Net-DNFs also promote localized...	Gal Elidan, Liran Katzir, Ran ElYaniv
184	Optimal Regularization can Mitigate Double Descent	Recent empirical and theoretical studies have shown that many learning algorithms -- from linear regression to neural networks -- can have test performance that is non-monotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as "double descent", has raised questions of if we need to re-think our current understanding of generalization. In this work, we study whether the double-descent phenomenon can be avoided by using optimal regularization. Theoretically, we prove that for certain...	Prayaag Venkat, Preetum Nakkiran, Sham M. Kakade, Tengyu Ma
185	Meta Back-Translation	Back-translation is an effective strategy to improve the performance of Neural Machine Translation~(NMT) by generating pseudo-parallel data. However, several recent works have found that better translation quality in the pseudo-parallel data does not necessarily lead to a better final translation model, while lower-quality but diverse data often yields stronger results instead. In this paper we propose a new way to generate pseudo-parallel data for back-translation that directly optimizes the final model performance. Specifically, we...	Graham Neubig, Hieu Pham, Xinyi Wang, Yiming Yang
186	Learning A Minimax Optimizer: A Pilot Study	Solving continuous minimax optimization is of extensive practical interest, yet notoriously unstable and difficult. This paper introduces the learning to optimize(L2O) methodology to the minimax problems for the first time and addresses its accompanying unique challenges. We first present Twin-L2O, the first dedicated minimax L2O method consisting of two LSTMs for updating min and max variables separately. The decoupled design is found to facilitate learning, particularly when the min and max variables are highly asymmetric. Empirical...	Howard Heaton, Jialin Liu, Jiayi Shen, Tianlong Chen, Wotao Yin, Xiaohan Chen, Zhangyang Wang
187	A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels	Group equivariant convolutional networks (GCNNs) endow classical convolutional networks with additional symmetry priors, which can lead to a considerably improved performance. Recent advances in the theoretical description of GCNNs revealed that such models can generally be understood as performing convolutions with $G$ -steerable kernels, that is, kernels that satisfy an equivariance constraint themselves. While the $G$ -steerability constraint has been derived, it has to date only been solved for specific use cases - a general...	Leon Lang, Maurice Weiler
188	Viewmaker Networks: Learning Views for Unsupervised Representation Learning	Many recent methods for unsupervised representation learning train models to be invariant to different "views," or distorted versions of an input. However, designing these views requires considerable trial and error by human experts, hindering widespread adoption of unsupervised representation learning methods across domains and modalities. To address this, we propose viewmaker networks: generative models that learn to produce useful views from a given input. Viewmakers are stochastic bounded adversaries: they produce views by...	Alex Tamkin, Mike Wu, Noah D. Goodman
189	Scalable Transfer Learning with Expert Models	Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task. This strategy...	André Susano Pinto, Basil Mustafa, Carlos Riquelme Ruiz, Cédric Renggli, Daniel Keysers, Joan Puigcerver, Neil Houlsby, Sylvain Gelly
190	Negative Data Augmentation	Data augmentation is often used to enlarge datasets with synthetic samples generated in accordance with the underlying data distribution. To enable a wider range of augmentations, we explore negative data augmentation strategies (NDA) that intentionally create out-of-distribution samples. We show that such negative out-of-distribution samples provide information on the support of the data distribution, and can be leveraged for generative modeling and representation learning. We introduce a new GAN training objective where we use NDA as...	Abhishek Sinha, Burak Uzkent, Hongxia Jin, Jiaming Song, Kumar Ayush, Stefano Ermon
191	Fantastic Four: Differentiable and Efficient Bounds on Singular Values of Convolution Layers	In deep neural networks, the spectral norm of the Jacobian of a layer bounds the factor by which the norm of a signal changes during forward/backward propagation. Spectral norm regularizations have been shown to improve generalization, robustness and optimization of deep learning methods. Existing methods to compute the spectral norm of convolution layers either rely on heuristics that are efficient in computation but lack guarantees or are theoretically-sound but computationally expensive. In this work, we obtain the best of both...	Sahil Singla, Soheil Feizi
192	CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding	Data augmentation has been demonstrated as an effective strategy for improving model generalization and data efficiency. However, due to the discrete nature of natural language, designing label-preserving transformations for text data tends to be more challenging. In this paper, we propose a novel data augmentation frame-work dubbed CoDA, which synthesizes diverse and informative augmented examples by integrating multiple transformations organically. Moreover, a contrastive regularization is introduced to capture the global...	Dinghan Shen, Jiawei Han, Sandra Sajeev, Weizhu Chen, Yanru Qu, Yelong Shen
193	Teaching with Commentaries	Effective training of deep neural networks can be challenging, and there remain many open questions on how to best learn these models. Recently developed methods to improve neural network training examine teaching: providing learned information during the training process to improve downstream model performance. In this paper, we take steps towards extending the scope of teaching. We propose a flexible teaching framework using commentaries, learned meta-information helpful for training on a particular task. We present gradient-based...	Aniruddh Raghu, David Duvenaud, Geoffrey E. Hinton, Maithra Raghu, Simon Kornblith
194	MixKD: Towards Efficient Distillation of Large-scale Language Models	Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent...	Changyou Chen, Dinghan Shen, Kevin J. Liang, Lawrence Carin, Weituo Hao, Weizhu Chen, Yufan Zhou
195	FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders	Pretrained text encoders, such as BERT, have been applied increasingly in various natural language processing (NLP) tasks, and have recently demonstrated significant performance gains. However, recent studies have demonstrated the existence of social bias in these pretrained NLP models. Although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained encoders still lacks exploration. In this paper, we proposed the first neural debiasing method for a pretrained sentence encoder, which...	Lawrence Carin, Pengyu Cheng, Shijing Si, Siyang Yuan, Weituo Hao
196	Probabilistic Numeric Convolutional Neural Networks	Continuous input signals like images and time series that are irregularly sampled or have missing values are challenging for existing deep learning methods. Coherently defined feature representations must depend on the values in unobserved regions of the input. Drawing from the work in probabilistic numerics, we propose Probabilistic Numeric Convolutional Neural Networks which represent features as Gaussian processes, providing a probabilistic description of discretization error. We then define a convolutional layer as the evolution of...	Marc Anton Finzi, Max Welling, Roberto Bondesan
197	Computational Separation Between Convolutional and Fully-Connected Networks	Convolutional neural networks (CNN) exhibit unmatched performance in a multitude of computer vision tasks. However, the advantage of using convolutional networks over fully-connected networks is not understood from a theoretical perspective. In this work, we show how convolutional networks can leverage locality in the data, and thus achieve a computational advantage over fully-connected networks. Specifically, we show a class of problems that can be efficiently solved using convolutional networks trained with gradient-descent, but at...	Eran Malach, Shai ShalevShwartz
198	On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines	Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small...	Dietrich Klakow, Maksym Andriushchenko, Marius Mosbach
199	Variational Information Bottleneck for Effective Low-Resource Fine-Tuning	While large-scale pretrained language models have obtained impressive results when fine-tuned on a wide variety of tasks, they still often suffer from overfitting in low-resource scenarios. Since such models are general-purpose feature extractors, many of these features are inevitably irrelevant for a given target task. We propose to use Variational Information Bottleneck (VIB) to suppress irrelevant features when fine-tuning on low-resource target tasks, and show that our method successfully reduces overfitting. Moreover, we show that...	James Henderson, Rabeeh Karimi Mahabadi, Yonatan Belinkov
200	Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching	Data Poisoning attacks modify training data to maliciously control a model trained on such data. In this work, we focus on targeted poisoning attacks which cause a reclassification of an unmodified test image and as such breach model integrity. We consider a particularly malicious poisoning attack that is both ``from scratch" and ``clean label", meaning we analyze an attack that successfully works against new, randomly initialized models, and is nearly imperceptible to humans, all while perturbing only a small fraction of the...	Gavin Taylor, Jonas Geiping, Liam H. Fowl, Michael Moeller, Tom Goldstein, W. Ronny Huang, Wojciech Czaja
201	Deberta: decoding-Enhanced Bert with Disentangled Attention	Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using...	Jianfeng Gao, Pengcheng He, Weizhu Chen, Xiaodong Liu
202	Optimism in Reinforcement Learning with Generalized Linear Function Approximation	We design a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation. We analyze the algorithm under a new expressivity assumption that we call ``optimistic closure,'' which is strictly weaker than assumptions from prior analyses for the linear setting. With optimistic closure, we prove that our algorithm enjoys a regret bound of $\widetilde{O}\left(H\sqrt{d^3 T}\right)$ where $H$ is the horizon, $d$ is the dimensionality of the state-action features and $T$ is the number of...	Akshay Krishnamurthy, Ruosong Wang, Simon Shaolei Du, Yining Wang
203	Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning	Graph Representation Learning (GRL) methods have impacted fields from chemistry to social science. However, their algorithmic implementations are specialized to specific use-cases e.g. "message passing" methods are run differently from "node embedding" ones. Despite their apparent differences, all these methods utilize the graph structure, and therefore, their learning can be approximated with stochastic graph traversals. We propose Graph Traversal via Tensor Functionals (GTTF), a unifying meta-algorithm framework for easing the...	Aram Galstyan, Bryan Perozzi, Elan Sopher Markowitz, Greg Ver Steeg, Keshav Balasubramanian, Mehrnoosh Mirtaheri, Sami AbuElHaija
204	Diverse Video Generation using a Gaussian Process Trigger	Generating future frames given a few context (or past) frames is a challenging task. It requires modeling the temporal coherence of videos as well as multi-modality in terms of diversity in the potential future states. Current variational approaches for video generation tend to marginalize over multi-modal future outcomes. Instead, we propose to explicitly model the multi-modality in the future outcomes and leverage it to sample diverse futures. Our approach, Diverse Video Generator, uses a GP to learn priors on future states given the...	Abhinav Shrivastava, Gaurav Shrivastava
205	Signatory: differentiable computations of the signature and logsignature transforms, on both CPU and GPU	Signatory is a library for calculating and performing functionality related to the signature and logsignature transforms. The focus is on machine learning, and as such includes features such as CPU parallelism, GPU support, and backpropagation. To our knowledge it is the first GPU-capable library for these operations. Signatory implements new features not available in previous libraries, such as efficient precomputation strategies. Furthermore, several novel algorithmic improvements are introduced, producing substantial real-world...	Patrick Kidger, Terry J. Lyons
206	MoPro: Webly Supervised Learning with Momentum Prototypes	We propose a webly-supervised representation learning method that does not suffer from the annotation unscalability of supervised learning, nor the computation unscalability of self-supervised learning. Most existing works on webly-supervised representation learning adopt a vanilla supervised learning method without accounting for the prevalent noise in the training data, whereas most prior methods in learning with label noise are less effective for real-world large-scale noisy data. We propose momentum prototypes (MoPro), a simple...	Caiming Xiong, Junnan Li, Steven C. H. Hoi
207	A Universal Representation Transformer Layer for Few-Shot Image Classification	Few-shot classification aims to recognize unseen classes when presented with only a small number of samples. We consider the problem of multi-domain few-shot image classification, where unseen classes and examples come from diverse data sources. This problem has seen growing interest and has inspired the development of benchmarks such as Meta-Dataset. A key challenge in this multi-domain setting is to effectively integrate the feature representations from the diverse set of training domains. Here, we propose a Universal Representation...	Guodong Long, Hugo Larochelle, Jing Jiang, Lu Liu, William L. Hamilton
208	Primal Wasserstein Imitation Learning	Imitation Learning (IL) methods seek to match the behavior of an agent with that of an expert. In the present work, we propose a new IL method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL), which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. We present a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and which...	Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Robert Dadashi
209	Learning perturbation sets for robust machine learning	Although much progress has been made towards robust deep learning, a significant gap in robustness remains between real-world perturbations and more narrowly defined sets typically studied in adversarial defenses. In this paper, we aim to bridge this gap by learning perturbation sets from data, in order to characterize real-world effects for robust training and evaluation. Specifically, we use a conditional generator that defines the perturbation set over a constrained region of the latent space. We formulate desirable properties that...	Eric Wong, J. Zico Kolter
210	CopulaGNN: Towards Integrating Representational and Correlational Roles of Graphs in Graph Neural Networks	Graph-structured data are ubiquitous. However, graphs encode diverse types of information and thus play different roles in data representation. In this paper, we distinguish the \textit{representational} and the \textit{correlational} roles played by the graphs in node-level prediction tasks, and we investigate how Graph Neural Network (GNN) models can effectively leverage both types of information. Conceptually, the representational information provides guidance for the model to construct better node features; while the correlational...	Bo Chang, Jiaqi Ma, Qiaozhu Mei, Xuefei Zhang
211	On the Critical Role of Conventions in Adaptive Human-AI Collaboration	Humans can quickly adapt to new partners in collaborative tasks (e.g. playing basketball), because they understand which fundamental skills of the task (e.g. how to dribble, how to shoot) carry over across new partners. Humans can also quickly adapt to similar tasks with the same partners by carrying over conventions that they have developed (e.g. raising hand signals pass the ball), without learning to coordinate from scratch. To collaborate seamlessly with humans, AI agents should adapt quickly to new partners and new tasks as well....	Andy Shih, Arjun Sawhney, Dorsa Sadigh, Jovana Kondic, Stefano Ermon
212	On the Bottleneck of Graph Neural Networks and its Practical Implications	Since the proposal of the graph neural network (GNN) by Gori et al. (2005) and Scarselli et al. (2008), one of the major problems in training GNNs was their struggle to propagate information between distant nodes in the graph. We propose a new explanation for this problem: GNNs are susceptible to a bottleneck when aggregating messages across a long path. This bottleneck causes the over-squashing of exponentially growing information into fixed-size vectors. As a result, GNNs fail to propagate messages originating from distant nodes and...	Eran Yahav, Uri Alon
213	The geometry of integration in text classification RNNs	Despite the widespread application of recurrent neural networks (RNNs), a unified understanding of how RNNs solve particular tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those pat-terns depend on the training dataset or task. This work addresses these questions in the context of text classification, building on earlier work studying the dynamics of binary sentiment-classification networks (Maheswaranathan et al., 2019). We study text-classification tasks beyond the binary...	Ankush Garg, David Sussillo, Kyle Aitken, Niru Maheswaranathan, Vinay Venkatesh Ramasesh, Yuan Cao
214	Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability	We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to...	Ameet Talwalkar, J. Zico Kolter, Jeremy Cohen, Simran Kaur, Yuanzhi Li
215	CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning	Despite recent successes of reinforcement learning (RL), it remains a challenge for agents to transfer learned skills to related environments. To facilitate research addressing this problem, we proposeCausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment. The environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer. Tasks consist of constructing 3D shapes from a set of blocks - inspired by how children learn to build...	Alexander Neitz, Anirudh Goyal, Bernhard Schölkopf, Frederik Träuble, Manuel Wuthrich, Ossama Ahmed, Stefan Bauer, Yoshua Bengio
216	Empirical or Invariant Risk Minimization? A Sample Complexity Perspective	Recently, invariant risk minimization (IRM) was proposed as a promising solution to address out-of-distribution (OOD) generalization. However, it is unclear when IRM should be preferred over the widely-employed empirical risk minimization (ERM) framework. In this work, we analyze both these frameworks from the perspective of sample complexity, thus taking a firm step towards answering this important question. We find that depending on the type of data generation mechanism, the two approaches might have very different finite sample and...	Amit Dhurandhar, Jun Wang, Karthikeyan Shanmugam, Kartik Ahuja, Kush R. Varshney
217	Scaling Symbolic Methods using Gradients for Neural Model Explanation	Symbolic techniques based on Satisfiability Modulo Theory (SMT) solvers have been proposed for analyzing and verifying neural network properties, but their usage has been fairly limited owing to their poor scalability with larger networks. In this work, we propose a technique for combining gradient-based methods with symbolic techniques to scale such analyses and demonstrate its application for model explanation. In particular, we apply this technique to identify minimal regions in an input that are most relevant for a neural network's...	Li Li, Patrick Riley, Rishabh Singh, Subham Sekhar Sahoo, Subhashini Venugopalan
218	Control-Aware Representations for Model-based Reinforcement Learning	A major challenge in modern reinforcement learning (RL) is efficient control of dynamical systems from high-dimensional sensory observations. Learning controllable embedding (LCE) is a promising approach that addresses this challenge by embedding the observations into a lower-dimensional latent space, estimating the latent dynamics, and utilizing it to perform control in the latent space. Two important questions in this area are how to learn a representation that is amenable to the control problem at hand, and how to achieve an...	Brandon Cui, Mohammad Ghavamzadeh, Yinlam Chow
219	C-Learning: Learning to Achieve Goals via Recursive Classification	We study the problem of predicting and controlling the future state distribution of an autonomous agent. This problem, which can be viewed as a reframing of goal-conditioned reinforcement learning (RL), is centered around learning a conditional probability density function over future states. Instead of directly estimating this density function, we indirectly estimate this density function by training a classifier to predict whether an observation comes from the future. Via Bayes' rule, predictions from our classifier can be...	Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine
220	The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers	We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online...	Behnam Neyshabur, Hanie Sedghi, Preetum Nakkiran
221	Improving VAEs' Robustness to Adversarial Attack	Variational autoencoders (VAEs) have recently been shown to be vulnerable to adversarial attacks, wherein they are fooled into reconstructing a chosen target image. However, how to defend against such attacks remains an open problem. We make significant advances in addressing this issue by introducing methods for producing adversarially robust VAEs. Namely, we first demonstrate that methods proposed to obtain disentangled latent representations produce VAEs that are more robust to these attacks. However, this robustness comes at the...	Alexander Camuto, Christopher C. Holmes, Matthew Willetts, Stephen J. Roberts, Tom Rainforth
222	What Can You Learn From Your Muscles? Learning Visual Representation from Human Interactions	Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. For this study, we collect a dataset of human interactions capturing body part movements and gaze in their daily...	Ali Farhadi, Daniel Gordon, Kiana Ehsani, Roozbeh Mottaghi, Thomas Hai Dang Nguyen
223	EEC: Learning to Encode and Regenerate Images for Continual Learning	The two main impediments to continual learning are catastrophic forgetting and memory limitations on the storage of data. To cope with these challenges, we propose a novel, cognitively-inspired approach which trains autoencoders with Neural Style Transfer to encode and store images. Reconstructed images from encoded episodes are replayed when training the classifier model on a new task to avoid catastrophic forgetting. The loss function for the reconstructed images is weighted to reduce its effect during classifier training to cope...	Alan R. Wagner, Ali Ayub
224	Impact of Representation Learning in Linear Bandits	We study how representation learning can improve the efficiency of bandit problems. We study the setting where we play $T$ linear bandits with dimension $d$ concurrently, and these $T$ bandit tasks share a common $k (\ll d)$ dimensional linear representation. For the finite-action setting, we present a new algorithm which achieves $\widetilde{O}(T\sqrt{kN} + \sqrt{dkNT})$ regret, where $N$ is the number of rounds we play for each bandit. When $T$ is sufficiently large, our algorithm significantly outperforms the naive algorithm...	Jason D. Lee, Jiaqi Yang, Simon Shaolei Du, Wei Hu
225	MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space	Data augmentation is an efficient way to expand a training dataset by creating additional artificial data. While data augmentation is found to be effective in improving the generalization capabilities of models for various machine learning tasks, the underlying augmentation methods are usually manually designed and carefully evaluated for each data modality separately, like image processing functions for image data and word-replacing rules for text data. In this work, we propose an automated data augmentation approach called MODALS...	DitYan Yeung, TszHim Cheung
226	The Recurrent Neural Tangent Kernel	The study of deep neural networks (DNNs) in the infinite-width limit, via the so-called neural tangent kernel (NTK) approach, has provided new insights into the dynamics of learning, generalization, and the impact of initialization. One key DNN architecture remains to be kernelized, namely, the recurrent neural network (RNN). In this paper we introduce and study the Recurrent Neural Tangent Kernel (RNTK), which provides new insights into the behavior of overparametrized RNNs. A key property of the RNTK should greatly benefit...	Randall Balestriero, Richard G. Baraniuk, Sina Alemohammad, Zichao Wang
227	Projected Latent Markov Chain Monte Carlo: Conditional Sampling of Normalizing Flows	We introduce Projected Latent Markov Chain Monte Carlo (PL-MCMC), a technique for sampling from the exact conditional distributions learned by normalizing flows. As a conditional sampling method, PL-MCMC enables Monte Carlo Expectation Maximization (MC-EM) training of normalizing flows from incomplete data. Through experimental tests applying normalizing flows to missing data tasks for a variety of data sets, we demonstrate the efficacy of PL-MCMC for conditional sampling from normalizing flows.	Chris Cannella, Mohammadreza Soltani, Vahid Tarokh
228	Learning the Pareto Front with Hypernetworks	Multi-objective optimization (MOO) problems are prevalent in machine learning. These problems have a set of optimal solutions, called the Pareto front, where each point on the front represents a different trade-off between possibly conflicting objectives. Recent MOO methods can target a specific desired ray in loss space however, most approaches still face two grave limitations: (i) A separate model has to be trained for each point on the front; and (ii) The exact trade-off must be known before the optimization process. Here, we tackle...	Aviv Navon, Aviv Shamsian, Ethan Fetaya, Gal Chechik
229	Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors	Predictive uncertainty estimation is an essential next step for the reliable deployment of deep object detectors in safety-critical tasks. In this work, we focus on estimating predictive distributions for bounding box regression output with variance networks. We show that in the context of object detection, training variance networks with negative log likelihood (NLL) can lead to high entropy predictive distributions regardless of the correctness of the output mean. We propose to use the energy score as a non-local proper scoring rule...	Ali Harakeh, Steven L. Waslander
230	Predicting Classification Accuracy When Adding New Unobserved Classes	Multiclass classifiers are often designed and evaluated only on a sample from the classes on which they will eventually be applied. Hence, their final accuracy remains unknown. In this work we study how a classifier’s performance over the initial class sample can be used to extrapolate its expected accuracy on a larger, unobserved set of classes. For this, we define a measure of separation between correct and incorrect classes that is independent of the number of classes: the "reversed ROC" (rROC), which is obtained by replacing the...	Yuli Slavutsky, Yuval Benjamini
231	BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction	We study the challenging task of neural network quantization without end-to-end retraining, called Post-training Quantization (PTQ). PTQ usually requires a small subset of training data but produces less powerful quantized models than Quantization-Aware Training (QAT). In this work, we propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time. BRECQ leverages the basic building blocks in neural networks and reconstructs them one-by-one. In a comprehensive theoretical study...	Fengwei Yu, Peng Hu, Qi Zhang, Ruihao Gong, Shi Gu, Wei Wang, Xu Tan, Yang Yang, Yuhang Li
232	No MCMC for me: Amortized sampling for fast and stable training of energy-based models	Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty. Despite recent advances, training EBMs on high-dimensional data remains a challenging problem as the state-of-the-art approaches are costly, unstable, and require considerable tuning and domain expertise to apply successfully. In this work, we present a simple method for training EBMs at scale which uses an entropy-regularized generator to amortize the MCMC sampling typically used in EBM training. We improve upon prior MCMC-based entropy...	David Duvenaud, Jacob Jin Kelly, Kevin Swersky, Milad Hashemi, Mohammad Norouzi, Will Sussman Grathwohl
233	GraphCodeBERT: Pre-training Code Representations with Data Flow	Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of...	Alexey Svyatkovskiy, Colin B. Clement, Dawn Drain, Daxin Jiang, Daya Guo, Duyu Tang, Jian Yin, Long Zhou, Michele Tufano, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shuai Lu, Shujie Liu, Shuo Ren, Zhangyin Feng
234	Conservative Safety Critics for Exploration	Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL, by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically...	Animesh Garg, Aviral Kumar, Florian Shkurti, Homanga Bharadhwaj, Nicholas Rhinehart, Sergey Levine
235	Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors	Knowledge distillation, in which a student model is trained to mimic a teacher model, has been proved as an effective technique for model compression and model accuracy boosting. However, most knowledge distillation methods, designed for image classification, have failed on more challenging tasks, such as object detection. In this paper, we suggest that the failure of knowledge distillation on object detection is mainly caused by two reasons: (1) the imbalance between pixels of foreground and background and (2) lack of distillation on...	Kaisheng Ma, Linfeng Zhang
236	A Temporal Kernel Approach for Deep Learning with Continuous-time Information	Sequential deep learning models such as RNN, causal CNN and attention mechanism do not readily consume continuous-time information. Discretizing the temporal data, as we show, causes inconsistency even for simple continuous-time processes. Current approaches often handle time in a heuristic manner to be consistent with the existing deep learning architectures and implementations. In this paper, we provide a principled way to characterize continuous-time systems using deep learning tools. Notably, the proposed approach applies to all...	Chuanwei Ruan, Da Xu, Evren Körpeoglu, Kannan Achan, Sushant Kumar
237	For self-supervised learning, Rationality implies generalization, provably	We prove a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a representation $r$ of the training~data, and then fitting a simple (e.g., linear) classifier $g$ to the labels. Specifically, we show that (under the assumptions described below) the generalization gap of such classifiers tends to zero if $\mathsf{C}(g) \ll n$ , where $\mathsf{C}(g)$ is an appropriately-defined measure of the simple classifier $g$ 's complexity, and $n$ is the number of training samples. We...	Boaz Barak, Gal Kaplun, Yamini Bansal
238	How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision	Attention mechanism in graph neural networks is designed to assign larger weights to important neighbor nodes for better representation. However, what graph attention learns is not understood well, particularly when graphs are noisy. In this paper, we propose a self-supervised graph attention network (SuperGAT), an improved graph attention model for noisy graphs. Specifically, we exploit two attention forms compatible with a self-supervised task to predict edges, whose presence and absence contain the inherent information about the...	Alice Oh, Dongkwan Kim
239	Interpretable Models for Granger Causality Using Self-explaining Neural Networks	Exploratory analysis of time series data can yield a better understanding of complex dynamical systems. Granger causality is a practical framework for analysing interactions in sequential data, applied in a wide range of domains. In this paper, we propose a novel framework for inferring multivariate Granger causality under nonlinear dynamics based on an extension of self-explaining neural networks. This framework is more interpretable than other neural-network-based techniques for inferring Granger causality, since in addition to...	Julia E. Vogt, Ricards Marcinkevics
240	Meta-learning Symmetries by Reparameterization	Many successful deep learning architectures are equivariant to certain transformations in order to conserve parameters and improve generalization: most famously, convolution layers are equivariant to shifts of the input. This approach only works when practitioners know the symmetries of the task and can manually construct an architecture with the corresponding equivariances. Our goal is an approach for learning equivariances from data, without needing to design custom task-specific architectures. We present a method for learning and...	Allan Zhou, Chelsea Finn, Tom Knowles
241	Removing Undesirable Feature Contributions Using Out-of-Distribution Data	Several data augmentation methods deploy unlabeled-in-distribution (UID) data to bridge the gap between the training and inference of neural networks. However, these methods have clear limitations in terms of availability of UID data and dependence of algorithms on pseudo-labels. Herein, we propose a data augmentation method to improve generalization in both adversarial and standard learning by using out-of-distribution (OOD) data that are devoid of the abovementioned issues. We show how to improve generalization theoretically using...	Changhwa Park, Hyungyu Lee, Jihun Yi, Jonghyun Lee, Saehyung Lee, Sungroh Yoon
242	Mind the Gap when Conditioning Amortised Inference in Sequential Latent-Variable Models	Amortised inference enables scalable learning of sequential latent-variable models (LVMs) with the evidence lower bound (ELBO). In this setting, variational posteriors are often only partially conditioned. While the true posteriors depend, e.g., on the entire sequence of observations, approximate posteriors are only informed by past observations. This mimics the Bayesian filter---a mixture of smoothing posteriors. Yet, we show that the ELBO objective forces partially-conditioned amortised posteriors to approximate products of smoothing...	Atanas Mirchev, Baris Kayalibay, Justin Bayer, Maximilian Soelch, Patrick van der Smagt
243	On the Universality of the Double Descent Peak in Ridgeless Regression	We prove a non-asymptotic distribution-independent lower bound for the expected mean squared generalization error caused by label noise in ridgeless linear regression. Our lower bound generalizes a similar known result to the overparameterized (interpolating) regime. In contrast to most previous works, our analysis applies to a broad class of input distributions with almost surely full-rank feature matrices, which allows us to cover various types of deterministic or random feature maps. Our lower bound is asymptotically sharp and...	David Holzmüller
244	Fair Mixup: Fairness via Interpolation	Training classifiers under fairness constraints such as group fairness, regularizes the disparities of predictions between the groups. Nevertheless, even though the constraints are satisfied during training, they might not generalize at evaluation time. To improve the generalizability of fair classifiers, we propose fair mixup, a new data augmentation strategy for imposing the fairness constraint. In particular, we show that fairness can be achieved by regularizing the models on paths of interpolated samples between the groups. We use...	ChingYao Chuang, Youssef Mroueh
245	Self-supervised Learning from a Multi-view Perspective	As a subset of unsupervised representation learning, self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as object detection and image captioning. Many proposed approaches for self-supervised learning follow naturally a multi-view perspective, where the input (e.g., original images) and the self-supervised signals (e.g., augmented images) can be seen as two redundant views of the data. Building from this multi-view perspective, this paper...	LouisPhilippe Morency, Ruslan Salakhutdinov, YaoHung Hubert Tsai, Yue Wu
246	Integrating Categorical Semantics into Unsupervised Domain Translation	While unsupervised domain translation (UDT) has seen a lot of success recently, we argue that mediating its translation via categorical semantic features could broaden its applicability. In particular, we demonstrate that categorical semantics improves the translation between perceptually different domains sharing multiple object categories. We propose a method to learn, in an unsupervised manner, categorical semantic features (such as object labels) that are invariant of the source and target domains. We show that conditioning the...	Aaron C. Courville, Faruk Ahmed, Samuel LavoieMarchildon
247	The Unreasonable Effectiveness of Patches in Deep Convolutional Kernels Methods	A recent line of work showed that various forms of convolutional kernel methods can be competitive with standard supervised deep convolutional networks on datasets like CIFAR-10, obtaining accuracies in the range of 87-90% while being more amenable to theoretical analysis. In this work, we highlight the importance of a data-dependent feature extraction step that is key to the obtain good performance in convolutional kernel methods. This step typically corresponds to a whitened dictionary of patches, and gives rise to a data-driven...	Edouard Oyallon, Eugene Belilovsky, Louis Thiry, Michael Arbel
248	Open Question Answering over Tables and Text	In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question. Most open QA systems have considered only retrieving information from unstructured text. Here we consider for the first time open QA over {\em both} tabular and textual data and present a new large-scale dataset \emph{Open Table-and-Text Question Answering} (OTT-QA) to evaluate performance on this task. Most questions in OTT-QA require multi-hop inference across tabular data and...	Eva Schlinger, MingWei Chang, Wenhu Chen, William W. Cohen, William Yang Wang
249	Evaluation of Similarity-based Explanations	Explaining the predictions made by complex machine learning models helps users to understand and accept the predicted outputs with confidence. One promising way is to use similarity-based explanation that provides similar instances as evidence to support model predictions. Several relevance metrics are used for this purpose. In this study, we investigated relevance metrics that can provide reasonable explanations to users. Specifically, we adopted three tests to evaluate whether the relevance metrics satisfy the minimal requirements...	Kazuaki Hanawa, Kentaro Inui, Satoshi Hara, Sho Yokoi
250	A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima	Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and...	Issei Sato, Masashi Sugiyama, Zeke Xie
251	How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?	A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $\epsilon^{-1}$ , deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumptions on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to...	Difan Zou, Quanquan Gu, Yuan Cao, Zixiang Chen
252	Auction Learning as a Two-Player Game	Designing an incentive compatible auction that maximizes expected revenue is a central problem in Auction Design. While theoretical approaches to the problem have hit some limits, a recent research direction initiated by Duetting et al. (2019) consists in building neural network architectures to find optimal auctions. We propose two conceptual deviations from their approach which result in enhanced performance. First, we use recent results in theoretical auction design to introduce a time-independent Lagrangian. This not only...	Jad Rahme, S. Matthew Weinberg, Samy Jelassi
253	Robust Reinforcement Learning on State Observations with Learned Optimal Adversary	We study the robustness of reinforcement learning (RL) with adversarially perturbed state observations, which aligns with the setting of many adversarial attacks to deep reinforcement learning (DRL) and is also important for rolling out real-world RL agent under unpredictable sensing noise. With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found, which is guaranteed to obtain the worst case agent reward. For DRL settings, this leads to a novel empirical adversarial attack to RL...	ChoJui Hsieh, Duane S. Boning, Hongge Chen, Huan Zhang
254	Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning	For deep neural network accelerators, memory movement is both energetically expensive and can bound computation. Therefore, optimal mapping of tensors to memory hierarchies is critical to performance. The growing complexity of neural networks calls for automated memory mapping instead of manual heuristic approaches; yet the search space of neural network computational graphs have previously been prohibitively large. We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces, that combines...	Avrech BenDavid, Estelle Aflalo, Hanlin Tang, Mattias Marder, Santiago Miret, Shauharda Khadka, Shie Mannor, Somdeb Majumdar, Tamir Hazan
255	Hierarchical Autoregressive Modeling for Neural Video Compression	Recent work by Marino et al. (2020) showed improved performance in sequential density estimation by combining masked autoregressive flows with hierarchical latent variable models. We draw a connection between such autoregressive generative models and the task of lossy video compression. Specifically, we view recent neural video compression methods (Lu et al., 2019; Yang et al., 2020b; Agustssonet al., 2020) as instances of a generalized stochastic temporal autoregressive transform, and propose avenues for enhancement based on this...	Joseph Marino, Ruihan Yang, Stephan Mandt, Yibo Yang
256	Individually Fair Rankings	We develop an algorithm to train individually fair learning-to-rank (LTR) models. The proposed approach ensures items from minority groups appear alongside similar items from majority groups. This notion of fair ranking is based on the definition of individual fairness from supervised learning and is more nuanced than prior fair LTR approaches that simply ensure the ranking model provides underrepresented items with a basic level of exposure. The crux of our method is an optimal transport-based regularizer that enforces individual...	Amanda Bower, Hamid Eftekhari, Mikhail Yurochkin, Yuekai Sun
257	Learning Neural Generative Dynamics for Molecular Conformation Generation	We study how to generate molecule conformations (i.e., 3D structures) from a molecular graph. Traditional methods, such as molecular dynamics, sample conformations via computationally expensive simulations. Recently, machine learning methods have shown great potential by training on a large collection of conformation data. Challenges arise from the limited model capacity for capturing complex distributions of conformations and the difficulty in modeling long-range dependencies between atoms. Inspired by the recent progress in deep...	Jian Peng, Jian Tang, Minkai Xu, Shitong Luo, Yoshua Bengio
258	Efficient Certified Defenses Against Patch Attacks on Image Classifiers	Adversarial patches pose a realistic threat model for physical world attacks on autonomous systems via their perception component. Autonomous systems in safety-critical domains such as automated driving should thus contain a fail-safe fallback component that combines certifiable robustness against patches with efficient inference while maintaining high performance on clean inputs. We propose BagCert, a novel combination of model architecture and certification procedure that allows efficient certification. We derive a loss that enables...	Jan Hendrik Metzen, Maksym Yatsura
259	Convex Regularization behind Neural Reconstruction	Neural networks have shown tremendous potential for reconstructing high-resolution images in inverse problems. The non-convex and opaque nature of neural networks, however, hinders their utility in sensitive applications such as medical imaging. To cope with this challenge, this paper advocates a convex duality framework that makes a two-layer fully-convolutional ReLU denoising network amenable to convex optimization. The convex dual network not only offers the optimum training with convex solvers, but also facilitates interpreting...	Arda Sahiner, Batu Ozturkler, John M. Pauly, Mert Pilanci, Morteza Mardani
260	Targeted Attack against Deep Neural Networks via Flipping Limited Weight Bits	To explore the vulnerability of deep neural networks (DNNs), many attack paradigms have been well studied, such as the poisoning-based backdoor attack in the training stage and the adversarial attack in the inference stage. In this paper, we study a novel attack paradigm, which modifies model parameters in the deployment stage for malicious purposes. Specifically, our goal is to misclassify a specific sample into a target class without any sample modification, while not significantly reduce the prediction accuracy of other samples to...	Baoyuan Wu, Jiawang Bai, ShuTao Xia, Yiming Li, Yong Zhang, Zhifeng Li
261	Generalized Multimodal ELBO	Multiple data types naturally co-occur when describing real-world phenomena and learning from them is a long-standing goal in machine learning research. However, existing self-supervised generative models approximating an ELBO are not able to fulfill all desired requirements of multimodal models: their posterior approximation functions lead to a trade-off between the semantic coherence and the ability to learn the joint data distribution. We propose a new, generalized ELBO formulation for multimodal data that overcomes these...	Imant Daunhawer, Julia E. Vogt, Thomas M. Sutter
262	Large-width functional asymptotics for deep Gaussian neural networks	In this paper, we consider fully connected feed-forward deep neural networks where weights and biases are independent and identically distributed according to Gaussian distributions. Extending previous results (Matthews et al., 2018a;b;Yang, 2019) we adopt a function-space perspective, i.e. we look at neural networks as infinite-dimensional random elements on the input space $\mathbb{R}^I$ . Under suitable assumptions on the activation function we show that: i) a network defines a continuous Gaussian process on the input space...	Daniele Bracale, Sandra Fortini, Stefano Favaro, Stefano Peluchetti
263	Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent	Byzantine-resilient Stochastic Gradient Descent (SGD) aims at shielding model training from Byzantine faults, be they ill-labeled training datapoints, exploited software/hardware vulnerabilities, or malicious worker nodes in a distributed setting. Two recent attacks have been challenging state-of-the-art defenses though, often successfully precluding the model from even fitting the training set. The main identified weakness in current defenses is their requirement of a sufficiently low variance-norm ratio for the stochastic gradients....	El Mahdi El Mhamdi, Rachid Guerraoui, Sébastien Rouault
264	Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders	Deep Learning based methods have emerged as the indisputable leaders for virtually all image restoration tasks. Especially in the domain of microscopy images, various content-aware image restoration (CARE) approaches are now used to improve the interpretability of acquired data. Naturally, there are limitations to what can be restored in corrupted images, and like for all inverse problems, many potential solutions exist, and one of them must be chosen. Here, we propose DivNoising, a denoising approach based on fully convolutional...	Alexander Krull, Florian Jug, Mangal Prakash
265	Auxiliary Learning by Implicit Differentiation	Training neural networks with auxiliary tasks is a common practice for improving the performance on a main task of interest. Two main challenges arise in this multi-task learning setting: (i) designing useful auxiliary tasks; and (ii) combining auxiliary tasks into a single coherent loss. Here, we propose a novel framework, AuxiLearn, that targets both challenges based on implicit differentiation. First, when useful auxiliaries are known, we propose learning a network that combines all losses into a single coherent objective function....	Aviv Navon, Ethan Fetaya, Gal Chechik, Haggai Maron, Idan Achituve
266	Balancing Constraints and Rewards with Meta-Gradient D4PG	Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved without violating the constraints. However, in many real-world cases, constraint violations are undesirable yet they are not catastrophic,...	Dan A. Calian, Daniel J. Mankowitz, Junhyuk Oh, Nir Levine, Timothy A. Mann, Tom Zahavy, Zhongwen Xu
267	Adversarially Guided Actor-Critic	Despite definite success in deep reinforcement learning problems, actor-critic algorithms are still confronted with sample inefficiency in complex environments, particularly in tasks where efficient exploration is a bottleneck. These methods consider a policy (the actor) and a value function (the critic) whose respective losses are built using different motivations and approaches. This paper introduces a third protagonist: the adversary. While the adversary mimics the actor by minimizing the KL-divergence between their respective...	Johan Ferret, Matthieu Geist, Olivier Pietquin, Philippe Preux, Yannis FletBerliac
268	DARTS-: Robustly Stepping out of Performance Collapse Without Indicators	Despite the fast development of differentiable architecture search (DARTS), it suffers from a standing instability issue regarding searching performance, which extremely limits its application. Existing robustifying methods draw clues from the outcome instead of finding out the causing factor. Various indicators such as Hessian eigenvalues are proposed as a signal of performance collapse, and the searching should be stopped once an indicator reaches a preset threshold. However, these methods tend to easily reject good architectures if...	Bo Zhang, Junchi Yan, Shun Lu, Xiangxiang Chu, Xiaolin Wei, Xiaoxing Wang
269	Are wider nets better given the same number of parameters?	Empirical studies demonstrate that the performance of neural networks improves with increasing number of parameters. In most of these studies, the number of parameters is increased by increasing the network width. This begs the question: Is the observed improvement due to the larger number of parameters, or is it due to the larger width itself? We compare different ways of increasing model width while keeping the number of parameters constant. We show that for models initialized with a random, static sparsity pattern in the weight...	Anna Golubeva, Behnam Neyshabur, Guy GurAri
270	Optimal Conversion of Conventional Artificial Neural Networks to Spiking Neural Networks	Spiking neural networks (SNNs) are biology-inspired artificial neural networks (ANNs) that comprise of spiking neurons to process asynchronous discrete signals. While more efficient in power consumption and inference speed on the neuromorphic hardware, SNNs are usually difficult to train directly from scratch with spikes due to the discreteness. As an alternative, many efforts have been devoted to converting conventional ANNs into SNNs by copying the weights from ANNs and adjusting the spiking threshold potential of neurons in SNNs....	Shi Gu, Shikuang Deng
271	Deep Equals Shallow for ReLU Networks in Kernel Regimes	Deep networks are often considered to be more expressive than shallow ones in terms of approximation. Indeed, certain functions can be approximated by deep networks provably more efficiently than by shallow ones, however, no tractable algorithms are known for learning such deep models. Separately, a recent line of work has shown that deep networks trained with gradient descent may behave like (tractable) kernel methods in a certain over-parameterized regime, where the kernel is determined by the architecture and initialization, and...	Alberto Bietti, Francis R. Bach
272	Graph Coarsening with Neural Networks	As large scale-graphs become increasingly more prevalent, it poses significant computational challenges to process, extract and analyze large graph data. Graph coarsening is one popular technique to reduce the size of a graph while maintaining essential properties. Despite rich graph coarsening literature, there is only limited exploration of data-driven method in the field. In this work, we leverage the recent progress of deep learning on graphs for graph coarsening. We first propose a framework for measuring the quality of coarsening...	Chen Cai, Dingkang Wang, Yusu Wang
273	Early Stopping in Deep Networks: Double Descent and How to Eliminate it	Over-parameterized models, such as large deep networks, often exhibit a double descent phenomenon, whereas a function of model size, error first decreases, increases, and decreases at last. This intriguing double descent behavior also occurs as a function of training epochs and has been conjectured to arise because training epochs control the model complexity. In this paper, we show that such epoch-wise double descent occurs for a different reason: It is caused by a superposition of two or more bias-variance tradeoffs that arise...	Fatih Furkan Yilmaz, Reinhard Heckel
274	Efficient Inference of Flexible Interaction in Spiking-neuron Networks	Hawkes process provides an effective statistical framework for analyzing the time-dependent interaction of neuronal spiking activities. Although utilized in many real applications, the classic Hawkes process is incapable of modelling inhibitory interactions among neurons. Instead, the nonlinear Hawkes process allows for a more flexible influence pattern with excitatory or inhibitory interactions. In this paper, three sets of auxiliary latent variables (Polya-Gamma variables, latent marked Poisson processes and sparsity variables) are...	Feng Zhou, Jun Zhu, Yixuan Zhang
275	DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation	Deep ensembles perform better than a single network thanks to the diversity among their members. Recent approaches regularize predictions to increase diversity; however, they also drastically decrease individual members’ performances. In this paper, we argue that learning strategies for deep ensembles need to tackle the trade-off between ensemble diversity and individual accuracies. Motivated by arguments from information theory and leveraging recent advances in neural estimation of conditional mutual information, we introduce a novel...	Alexandre Ramé, Matthieu Cord
276	Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks	Temporal networks serve as abstractions of many real-world dynamic systems. These networks typically evolve according to certain laws, such as the law of triadic closure, which is universal in social networks. Inductive representation learning of temporal networks should be able to capture such laws and further be applied to systems that follow the same laws but have not been unseen during the training stage. Previous works in this area depend on either network node identities or rich edge attributes and typically fail to extract these...	Jure Leskovec, Pan Li, Yanbang Wang, YenYu Chang, Yunyu Liu
277	FairBatch: Batch Selection for Model Fairness	Training a fair machine learning model is essential to prevent demographic disparity. Existing techniques for improving model fairness require broad changes in either data preprocessing or model training, rendering themselves difficult-to-adopt for potentially already complex machine learning systems. We address this problem via the lens of bilevel optimization. While keeping the standard training algorithm as an inner optimizer, we incorporate an outer optimizer so as to equip the inner problem with an additional functionality:...	Changho Suh, Kangwook Lee, Steven Euijong Whang, Yuji Roh
278	Representation Balancing Offline Model-based Reinforcement Learning	One of the main challenges in offline and off-policy reinforcement learning is to cope with the distribution shift that arises from the mismatch between the target policy and the data collection policy. In this paper, we focus on a model-based approach, particularly on learning the representation for a robust model of the environment under the distribution shift, which has been first studied by Representation Balancing MDP (RepBM). Although this prior work has shown promising results, there are a number of shortcomings that still...	ByungJun Lee, Jongmin Lee, KeeEung Kim
279	Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction	Replica exchange stochastic gradient Langevin dynamics (reSGLD) has shown promise in accelerating the convergence in non-convex learning; however, an excessively large correction for avoiding biases from noisy energy estimators has limited the potential of the acceleration. To address this issue, we study the variance reduction for noisy energy estimators, which promotes much more effective swaps. Theoretically, we provide a non-asymptotic analysis on the exponential convergence for the underlying continuous-time Markov jump process;...	Faming Liang, Georgios Karagiannis, Guang Lin, Qi Feng, Wei Deng
280	The Importance of Pessimism in Fixed-Dataset Policy Optimization	We study worst-case guarantees on the expected return of fixed-dataset policy optimization algorithms. Our core contribution is a unified conceptual and mathematical framework for the study of algorithms in this regime. This analysis reveals that for naive approaches, the possibility of erroneous value overestimation leads to a difficult-to-satisfy requirement: in order to guarantee that we select a policy which is near-optimal, we may need the dataset to be informative of the value of every policy. To avoid this, algorithms can follow...	Carles Gelada, Jacob Buckman, Marc G. Bellemare
281	Interpreting Knowledge Graph Relation Representation from Word Embeddings	Many models learn representations of knowledge graph data by exploiting its low-rank latent structure, encoding known relations between entities and enabling unknown facts to be inferred. To predict whether a relation holds between entities, embeddings are typically compared in the latent space following a relation-specific mapping. Whilst their predictive performance has steadily improved, how such models capture the underlying latent structure of semantic information remains unexplained. Building on recent theoretical understanding...	Carl Allen, Ivana Balazevic, Timothy M. Hospedales
282	Hopfield Networks is All You Need	We introduce a modern Hopfield network with continuous states and a corresponding update rule. The new Hopfield network can store exponentially (with the dimension of the associative space) many patterns, retrieves the pattern with one update, and has exponentially small retrieval errors. It has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. The new update rule is...	Bernhard Schäfl, David P. Kreil, Günter Klambauer, Hubert Ramsauer, Johannes Brandstetter, Johannes Lehner, Lukas Gruber, Markus Holzleitner, Michael K. Kopp, Michael Widrich, Philipp Seidl, Sepp Hochreiter, Thomas Adler
283	Uncertainty Estimation and Calibration with Finite-State Probabilistic RNNs	Uncertainty quantification is crucial for building reliable and trustable machine learning systems. We propose to estimate uncertainty in recurrent neural networks (RNNs) via stochastic discrete state transitions over recurrent timesteps. The uncertainty of the model can be quantified by running a prediction several times, each time sampling from the recurrent state transition distribution, leading to potentially different results if the model is uncertain. Alongside uncertainty quantification, our proposed method offers several...	Carolin Lawrence, Cheng Wang, Mathias Niepert
284	Understanding the failure modes of out-of-distribution generalization	Empirical studies suggest that machine learning models often rely on features, such as the background, that may be spuriously correlated with the label only during training time, resulting in poor accuracy during test-time. In this work, we identify the fundamental factors that give rise to this behavior, by explaining why models fail this way even in easy-to-learn tasks where one would expect these models to succeed. In particular, through a theoretical study of gradient-descent-trained linear classifiers on some easy-to-learn tasks,...	Anders Andreassen, Behnam Neyshabur, Vaishnavh Nagarajan
285	Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule	Vision-and-language navigation (VLN) is a task in which an agent is embodied in a realistic 3D environment and follows an instruction to reach the goal node. While most of the previous studies have built and investigated a discriminative approach, we notice that there are in fact two possible approaches to building such a VLN agent: discriminative and generative. In this paper, we design and investigate a generative language-grounded policy which uses a language model to compute the distribution over all possible instructions i.e. all...	Kyunghyun Cho, Shuhei Kurita
286	Emergent Road Rules In Multi-Agent Driving Environments	For autonomous vehicles to safely share the road with human drivers, autonomous vehicles must abide by specific "road rules" that human drivers have agreed to follow. "Road rules" include rules that drivers are required to follow by law – such as the requirement that vehicles stop at red lights – as well as more subtle social rules – such as the implicit designation of fast lanes on the highway. In this paper, we provide empirical evidence that suggests that – instead of hard-coding road rules into self-driving algorithms – a scalable...	Avik Pal, Jonah Philion, Sanja Fidler, YuanHong Liao
287	Wasserstein-2 Generative Networks	We propose a novel end-to-end non-minimax algorithm for training optimal transport mappings for the quadratic cost (Wasserstein-2 distance). The algorithm uses input convex neural networks and a cycle-consistency regularization to approximate Wasserstein-2 distance. In contrast to popular entropic and quadratic regularizers, cycle-consistency does not introduce bias and scales well to high dimensions. From the theoretical side, we estimate the properties of the generative mapping fitted by our algorithm. From the practical side, we...	Alexander Korotin, Alexander Safin, Arip Asadulaev, Evgeny Burnaev, Vage Egiazarian
288	Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics	Poisoning attacks on Reinforcement Learning (RL) systems could take advantage of RL algorithm’s vulnerabilities and cause failure of the learning. However, prior works on poisoning RL usually either unrealistically assume the attacker knows the underlying Markov Decision Process (MDP), or directly apply the poisoning methods in supervised learning to RL. In this work, we build a generic poisoning framework for online RL via a comprehensive investigation of heterogeneous poisoning models in RL. Without any prior knowledge of the MDP, we...	Da Huo, Furong Huang, Yanchao Sun
289	Tomographic Auto-Encoder: Unsupervised Bayesian Recovery of Corrupted Data	We propose a new probabilistic method for unsupervised recovery of corrupted data. Given a large ensemble of degraded samples, our method recovers accurate posteriors of clean values, allowing the exploration of the manifold of possible reconstructed data and hence characterising the underlying uncertainty. In this set-ting, direct application of classical variational methods often gives rise to collapsed densities that do not adequately explore the solution space. Instead, we derive our novel reduced entropy condition approximate...	Andreas C. Damianou, Francesco Tonolini, Pablo Garcia Moreno, Roderick MurraySmith
290	Monotonic Kronecker-Factored Lattice	It is computationally challenging to learn flexible monotonic functions that guarantee model behavior and provide interpretability beyond a few input features, and in a time where minimizing resource use is increasingly important, we must be able to learn such models that are still efficient. In this paper we show how to effectively and efficiently learn such functions using Kronecker-Factored Lattice ( $\mathrm{KFL}$ ), an efficient reparameterization of flexible monotonic lattice regression via Kronecker product. Both computational and...	Erez Louidor, Nobuyuki Morioka, William Taylor Bakst
291	LEAF: A Learnable Frontend for Audio Classification	Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio signals, including speech, music, audio events and animal sounds, providing a general-purpose learned frontend for audio classification. To...	Félix de Chaumont Quitry, Marco Tagliasacchi, Neil Zeghidour, Olivier Teboul
292	Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms	Federated learning is typically approached as an optimization problem, where the goal is to minimize a global loss function by distributing computation across client devices that possess local data and specify different parts of the global objective. We present an alternative perspective and formulate federated learning as a posterior inference problem, where the goal is to infer a global posterior distribution by having client devices each infer the posterior of their local data. While exact inference is often intractable, this...	Afshin Rostamizadeh, Eric P. Xing, Jennifer Gillenwater, Maruan AlShedivat
293	Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments	Exploration under sparse reward is a long-standing challenge of model-free reinforcement learning. The state-of-the-art methods address this challenge by introducing intrinsic rewards to encourage exploration in novel states or uncertain environment dynamics. Unfortunately, methods based on intrinsic rewards often fall short in procedurally-generated environments, where a different environment is generated in each episode so that the agent is not likely to visit the same state more than once. Motivated by how humans distinguish good...	Daochen Zha, Ji Liu, Lei Yuan, Wenye Ma, Xia Hu
294	Partitioned Learned Bloom Filters	Bloom filters are space-efficient probabilistic data structures that are used to test whether an element is a member of a set, and may return false positives. Recently, variations referred to as learned Bloom filters were developed that can provide improved performance in terms of the rate of false positives, by using a learned model for the represented set. However, previous methods for learned Bloom filters do not take full advantage of the learned model. Here we show how to frame the problem of optimal model utilization as an...	Eric Knorr, Kapil Vaidya, Michael Mitzenmacher, Tim Kraska
295	Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval	Conducting text retrieval in a learned dense representation space has many intriguing advantages. Yet dense retrieval (DR) often underperforms word-based sparse retrieval. In this paper, we first theoretically show the bottleneck of dense retrieval is the domination of uninformative negatives sampled in mini-batch training, which yield diminishing gradient norms, large gradient variances, and slow convergence. We then propose Approximate nearest neighbor Negative Contrastive Learning (ANCE), which selects hard training negatives...	Arnold Overwijk, Chenyan Xiong, Jialin Liu, Junaid Ahmed, KwokFung Tang, Lee Xiong, Paul N. Bennett, Ye Li
296	Auxiliary Task Update Decomposition: the Good, the Bad and the neutral	While deep learning has been very beneficial in data-rich settings, tasks with smaller training set often resort to pre-training or multitask learning to leverage data from other tasks. In this case, careful consideration is needed to select tasks and model parameterizations such that updates from the auxiliary tasks actually help the primary task. We seek to alleviate this burden by formulating a model-agnostic framework that performs fine-grained manipulation of the auxiliary task gradients. We propose to decompose auxiliary updates...	David Grangier, Lucio M. Dery, Yann N. Dauphin
297	SSD: A Unified Framework for Self-Supervised Outlier Detection	We ask the following question: what training information is required to design an effective outlier/out-of-distribution (OOD) detector, i.e., detecting samples that lie far away from training distribution? Since unlabeled data is easily accessible for many applications, the most compelling approach is to develop detectors based on only unlabeled in-distribution data. However, we observe that most existing detectors based on unlabeled data perform poorly, often equivalent to a random prediction. In contrast, existing state-of-the-art...	Mung Chiang, Prateek Mittal, Vikash Sehwag
298	Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning	Complex, multi-task problems have proven to be difficult to solve efficiently in a sparse-reward reinforcement learning setting. In order to be sample efficient, multi-task learning requires reuse and sharing of low-level policies. To facilitate the automatic decomposition of hierarchical tasks, we propose the use of step-by-step human demonstrations in the form of natural language instructions and action trajectories. We introduce a dataset of such demonstrations in a crafting-based grid world. Our model consists of a high-level...	Abhinav Gupta, Kenneth Marino, Valerie Chen
299	Revisiting Few-sample BERT Fine-tuning	This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify...	Arzoo Katiyar, Felix Wu, Kilian Q. Weinberger, Tianyi Zhang, Yoav Artzi
300	Tilted Empirical Risk Minimization	Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified framework---tilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the...	Ahmad Beirami, Maziar Sanjabi, Tian Li, Virginia Smith
301	Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS	We prove that the reproducing kernel Hilbert spaces (RKHS) of a deep neural tangent kernel and the Laplace kernel include the same set of functions, when both kernels are restricted to the sphere $\mathbb{S}^{d-1}$ . Additionally, we prove that the exponential power kernel with a smaller power (making the kernel less smooth) leads to a larger RKHS, when it is restricted to the sphere $\mathbb{S}^{d-1}$ and when it is defined on the entire $\mathbb{R}^d$ .	Lin Chen, Sheng Xu
302	On the Transfer of Disentangled Representations in Realistic Settings	Learning meaningful representations that disentangle the underlying structure of the data generating process is considered to be of key importance in machine learning. While disentangled representations were found to be useful for diverse tasks such as abstract reasoning and fair classification, their scalability and real-world impact remain questionable. We introduce a new high-resolution dataset with 1M simulated images and over 1,800 annotated real-world images of the same setup. In contrast to previous work, this new dataset...	Andrea Dittadi, Bernhard Schölkopf, Francesco Locatello, Frederik Träuble, Manuel Wuthrich, Ole Winther, Stefan Bauer, Vaibhav Agrawal
303	Calibration tests beyond classification	Most supervised machine learning tasks are subject to irreducible prediction errors. Probabilistic predictive models address this limitation by providing probability distributions that represent a belief over plausible targets, rather than point estimates. Such models can be a valuable tool in decision-making under uncertainty, provided that the model output is meaningful and interpretable. Calibrated models guarantee that the probabilistic predictions are neither over- nor under-confident. In the machine learning literature, different...	Dave Zachariah, David Widmann, Fredrik Lindsten
304	Overparameterisation and worst-case generalisation: friend or foe?	Overparameterised neural networks have demonstrated the remarkable ability to perfectly fit training samples, while still generalising to unseen test samples. However, several recent works have revealed that such models' good average performance does not always translate to good worst-case performance: in particular, they may perform poorly on subgroups that are under-represented in the training set. In this paper, we show that in certain settings, overparameterised models' performance on under-represented subgroups may be improved via...	Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar
305	You Only Need Adversarial Supervision for Semantic Image Synthesis	Despite their recent successes, GAN models for semantic image synthesis still suffer from poor image quality when trained with only adversarial supervision. Historically, additionally employing the VGG-based perceptual loss has helped to overcome this issue, significantly improving the synthesis quality, but at the same time limiting the progress of GAN models for semantic image synthesis. In this work, we propose a novel, simplified GAN model, which needs only adversarial supervision to achieve high quality results. We re-design the...	Anna Khoreva, Bernt Schiele, Dan Zhang, Edgar Schönfeld, Juergen Gall, Vadim Sushko
306	Learning to Recombine and Resample Data For Compositional Generalization	Flexible neural sequence models outperform grammar- and automaton-based counterparts on a variety of tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data—particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. We describe R&R, a learned data augmentation scheme that enables a large category of compositional generalizations without appeal to latent symbolic structure. R&R has two...	Afra Feyza Akyürek, Ekin Akyürek, Jacob Andreas
307	A Critique of Self-Expressive Deep Subspace Clustering	Subspace clustering is an unsupervised clustering technique designed to cluster data that is supported on a union of linear subspaces, with each subspace defining a cluster with dimension lower than the ambient space. Many existing formulations for this problem are based on exploiting the self-expressive property of linear subspaces, where any point within a subspace can be represented as linear combination of other points within the subspace. To extend this approach to data supported on a union of non-linear manifolds, numerous...	Benjamin David Haeffele, Chong You, René Vidal
308	INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving	In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark designed to test agents’ generalization ability. INT is based on a theorem generator, which provides theoretically infinite data and allows us to measure 6 different types of generalization, each reflecting a distinct challenge, characteristic of automated theorem proving. In addition, provides a fast theorem proving...	Albert Q. Jiang, Jimmy Ba, Roger Baker Grosse, Yuhuai Wu
309	Improved Estimation of Concentration Under ℓp-Norm Distance Metrics Using Half Spaces	Concentration of measure has been argued to be the fundamental cause of adversarial vulnerability. Mahloujifar et al. (2019) presented an empirical way to measure the concentration of a data distribution using samples, and employed it to find lower bounds on intrinsic robustness for several benchmark datasets. However, it remains unclear whether these lower bounds are tight enough to provide a useful approximation for the intrinsic robustness of a dataset. To gain a deeper understanding of the concentration of measure phenomenon, we...	David E. Evans, Jack Prescott, Xiao Zhang
310	Adaptive Federated Optimization	Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam,...	Hugh Brendan McMahan, Jakub Konecný, Keith Rush, Manzil Zaheer, Sanjiv Kumar, Sashank J. Reddi, Zachary Charles, Zachary Garrett
311	On the Dynamics of Training Attention Models	The attention mechanism has been widely used in deep neural networks as a model component. By now, it has become a critical building block in many state-of-the-art natural language models. Despite its great success established empirically, the working mechanism of attention has not been investigated at a sufficient theoretical depth to date. In this paper, we set up a simple text classification task and study the dynamics of training a simple attention-based classification model using gradient descent. In this setting, we show that,...	Amiya Nayak, Haoye Lu, Yongyi Mao
312	Linear Convergent Decentralized Optimization with Compression	Communication compression has become a key strategy to speed up distributed optimization. However, existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms. They are unsatisfactory in terms of convergence rate, stability, and the capability to handle heterogeneous data. Motivated by primal-dual algorithms, this paper proposes the first \underline{L}in\underline{EA}r convergent \underline{D}ecentralized algorithm with compression, LEAD. Our theory describes the coupled dynamics of the inexact...	Jiliang Tang, Ming Yan, Rongrong Wang, Xiaorui Liu, Yao Li
313	Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation	Many real-world applications such as robotics provide hard constraints on power and compute that limit the viable model complexity of Reinforcement Learning (RL) agents. Similarly, in many distributed RL settings, acting is done on un-accelerated hardware such as CPUs, which likewise restricts model size to prevent intractable experiment run times. These "actor-latency" constrained settings present a major obstruction to the scaling up of model complexity that has recently been extremely successful in supervised learning. To be able to...	Emilio Parisotto, Ruslan Salakhutdinov
314	Large Associative Memory Problem in Neurobiology and Machine Learning	Dense Associative Memories or modern Hopfield networks permit storage and reliable retrieval of an exponentially large (in the dimension of feature space) number of memories. At the same time, their naive implementation is non-biological, since it seemingly requires the existence of many-body synaptic junctions between the neurons. We show that these models are effective descriptions of a more microscopic (written in terms of biological degrees of freedom) theory that has additional (hidden) neurons and only requires two-body...	Dmitry Krotov, John J. Hopfield
315	Protecting DNNs from Theft using an Ensemble of Diverse Models	Several recent works have demonstrated highly effective model stealing (MS) attacks on Deep Neural Networks (DNNs) in black-box settings, even when the training data is unavailable. These attacks typically use some form of Out of Distribution (OOD) data to query the target model and use the predictions obtained to train a clone model. Such a clone model learns to approximate the decision boundary of the target model, achieving high accuracy on in-distribution examples. We propose Ensemble of Diverse Models (EDM) to defend against such...	Atul Prakash, Moinuddin K. Qureshi, Sanjay Kariyappa
316	Proximal Gradient Descent-Ascent: Variable Convergence under KŁ Geometry	The gradient descent-ascent (GDA) algorithm has been widely applied to solve minimax optimization problems. In order to achieve convergent policy parameters for minimax optimization, it is important that GDA generates convergent variable sequences rather than convergent sequences of function value or gradient norm. However, the variable convergence of GDA has been proved only under convexity geometries, and it is lack of understanding in general nonconvex minimax optimization. This paper fills such a gap by studying the convergence of...	Tengyu Xu, Yi Zhou, Yingbin Liang, Ziyi Chen
317	Contextual Dropout: An Efficient Sample-Dependent Dropout Module	Dropout has been demonstrated as a simple and effective module to not only regularize the training process of deep neural networks, but also provide the uncertainty estimation for prediction. However, the quality of uncertainty estimation is highly dependent on the dropout probabilities. Most current models use the same dropout distributions across all data samples due to its simplicity. Despite the potential gains in the flexibility of modeling uncertainty, sample-dependent dropout, on the other hand, is less explored as it often...	Korawat Tanwisuth, Mingyuan Zhou, Shujian Zhang, Xiaoning Qian, Xinjie Fan
318	Mirostat: a Neural Text decoding Algorithm that directly controls perplexity	Neural text decoding algorithms strongly influence the quality of texts generated using language models, but popular algorithms like top-k, top-p (nucleus), and temperature-based sampling may yield texts that have objectionable repetition or incoherence. Although these methods generate high-quality text after ad hoc parameter tuning that depends on the language model and the length of generated text, not much is known about the control they provide over the statistics of the output. This is important, however, since recent reports show...	Govardana Sachitanandam Ramachandran, Lav R. Varshney, Nitish Shirish Keskar, Sourya Basu
319	DialoGraph: Incorporating Interpretable Strategy-Graph Networks into Negotiation Dialogues	To successfully negotiate a deal, it is not enough to communicate fluently: pragmatic planning of persuasive negotiation strategies is essential. While modern dialogue agents excel at generating fluent sentences, they still lack pragmatic grounding and cannot reason strategically. We present DialoGraph, a negotiation system that incorporates pragmatic strategies in a negotiation dialogue using graph neural networks. DialoGraph explicitly incorporates dependencies between sequences of strategies to enable improved and interpretable...	Alan W. Black, Rishabh Joshi, Shikhar Vashishth, Vidhisha Balachandran, Yulia Tsvetkov
320	Multi-Time Attention Networks for Irregularly Sampled Time Series	Irregular sampling occurs in many time series modeling applications where it presents a significant challenge to standard deep learning models. This work is motivated by the analysis of physiological time series data in electronic health records, which are sparse, irregularly sampled, and multivariate. In this paper, we propose a new deep learning framework for this setting that we call Multi-Time Attention Networks. Multi-Time Attention Networks learn an embedding of continuous time values and use an attention mechanism to produce a...	Benjamin M. Marlin, Satya Narayan Shukla
321	Learning Energy-Based Generative Models via Coarse-to-Fine Expanding and Sampling	Energy-based models (EBMs) parameterized by neural networks can be trained by the Markov chain Monte Carlo (MCMC) sampling-based maximum likelihood estimation. Despite the recent significant success of EBMs in image generation, the current approaches to train EBMs are unstable and have difficulty synthesizing diverse and high-fidelity images. In this paper, we propose to train EBMs via a multistage coarse-to-fine expanding and sampling strategy, which starts with learning a coarse-level EBM from images at low resolution and then...	Jianwen Xie, Ping Li, Yang Zhao
322	Unsupervised Audiovisual Synthesis via Exemplar Autoencoders	We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use exemplar autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar speech. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers and styles using...	Aayush Bansal, Deva Ramanan, Kangle Deng
323	A Learning Theoretic Perspective on Local Explainability	In this paper, we explore connections between interpretable machine learning and learning theory through the lens of local approximation explanations. First, we tackle the traditional problem of performance generalization and bound the test-time predictive accuracy of a model using a notion of how locally explainable it is. Second, we explore the novel problem of explanation generalization which is an important concern for a growing class of finite sample-based local approximation explanations. Finally, we validate our theoretical...	Ameet Talwalkar, Gregory Plumb, Jeffrey Li, Vaishnavh Nagarajan
324	SEED: Self-supervised Distillation For Visual Representation	This paper is concerned with self-supervised learning for small models. The problem is motivated by our empirical studies that while the widely used contrastive self-supervised learning method has shown great progress on large model training, it does not work well for small models. To address this problem, we propose a new learning paradigm, named $\textbf{SE}$ lf-Sup $\textbf{E}$ rvised $\textbf{D}$ istillation ( ${\large S}$ EED), where we leverage a larger network (as Teacher) to transfer its representational knowledge into a smaller...	Jianfeng Wang, Lei Zhang, Lijuan Wang, Yezhou Yang, Zhiyuan Fang, Zicheng Liu
325	Isometric Propagation Network for Generalized Zero-shot Learning	Zero-shot learning (ZSL) aims to classify images of an unseen class only based on a few attributes describing that class but no access to any training sample. A popular strategy is to learn a mapping between the semantic space of class attributes and the visual space of images based on the seen classes and their data. Thus, an unseen class image can be ideally mapped to its corresponding class attributes. The key challenge is how to align the representations in the two spaces. For most ZSL settings, the attributes for each seen/unseen...	Chengqi Zhang, Guodong Long, Jing Jiang, Lu Liu, Tianyi Zhou, Xuanyi Dong
326	Effective and Efficient Vote Attack on Capsule Networks	Standard Convolutional Neural Networks (CNNs) can be easily fooled by images with small quasi-imperceptible artificial perturbations. As alternatives to CNNs, the recently proposed Capsule Networks (CapsNets) are shown to be more robust to white-box attack than CNNs under popular attack protocols. Besides, the class-conditional reconstruction part of CapsNets is also used to detect adversarial examples. In this work, we investigate the adversarial robustness of CapsNets, especially how the inner workings of CapsNets change when the...	Baoyuan Wu, Jindong Gu, Volker Tresp
327	Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization	Real-world large-scale datasets are heteroskedastic and imbalanced --- labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently....	Adrien Gaidon, Junwei Lu, Kaidi Cao, Nikos Aréchiga, Tengyu Ma, Yining Chen
328	Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization	Wasserstein barycenters provide a geometric notion of the weighted average of probability measures based on optimal transport. In this paper, we present a scalable algorithm to compute Wasserstein-2 barycenters given sample access to the input measures, which are not restricted to being discrete. While past approaches rely on entropic or quadratic regularization, we employ input convex neural networks and cycle-consistency regularization to avoid introducing bias. As a result, our approach does not resort to minimax optimization. We...	Alexander Korotin, Evgeny Burnaev, Justin Solomon, Lingxiao Li
329	Neural Thompson Sampling	Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward...	Dongruo Zhou, Lihong Li, Quanquan Gu, Weitong Zhang
330	Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics	Understanding the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in high-dimensional parameter space undergoes discrete finite steps along complex stochastic gradients derived from real-world datasets. We circumvent this obstacle through a unifying theoretical framework based on intrinsic symmetries embedded in a network's architecture that are present for any dataset. We show that any such...	Daniel Kunin, Daniel L. K. Yamins, Hidenori Tanaka, Javier SagastuyBreña, Surya Ganguli
331	Neural gradients are near-lognormal: improved quantized and sparse training	While training can mostly be accelerated by reducing the time needed to propagate neural gradients (loss gradients with respect to the intermediate neural layer outputs) back throughout the model, most previous works focus on the quantization/pruning of weights and activations. These methods are often not applicable to neural gradients, which have very different statistical properties. Distinguished from weights and activations, we find that the distribution of neural gradients is approximately lognormal. Considering this, we suggest...	Brian Chmiel, Daniel Soudry, Elad Hoffer, Liad BenUri, Moran Shkolnik, Ron Banner
332	RODE: Learning Roles to Decompose Multi-Agent Tasks	Role-based learning holds the promise of achieving scalable multi-agent learning by decomposing complex tasks using roles. However, it is largely unclear how to efficiently discover such a set of roles. To solve this problem, we propose to first decompose joint action spaces into restricted role action spaces by clustering actions according to their effects on the environment and other agents. Learning a role selector based on action effects makes role discovery much easier because it forms a bi-level learning hierarchy: the role...	Anuj Mahajan, Bei Peng, Chongjie Zhang, Shimon Whiteson, Tarun Gupta, Tonghan Wang
333	Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction	Learning to predict the long-term future of video frames is notoriously challenging due to the inherent ambiguities in a distant future and dramatic amplification of prediction error over time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to destruction in structure and content. In this work, we revisit the hierarchical models in video prediction. Our method generates future frames by...	Han Zhang, Honglak Lee, Hyungsuk Yoon, Jing Yu Koh, Seunghoon Hong, Thomas E. Huang, Ting Chen, Whie Jung, Wonkwang Lee
334	Physics-aware, probabilistic model order reduction with guaranteed stability	Given (small amounts of) time-series' data from a high-dimensional, fine-grained, multiscale dynamical system, we propose a generative framework for learning an effective, lower-dimensional, coarse-grained dynamical model that is predictive of the fine-grained system's long-term evolution but also of its behavior under different initial conditions. We target fine-grained models as they arise in physical applications (e.g. molecular dynamics, agent-based models), the dynamics of which are strongly non-stationary but their transition to...	PhaedonStelios Koutsourelakis, Sebastian Kaltenbach
335	Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System	Designing task-oriented dialogue systems is a challenging research topic, since it needs not only to generate utterances fulfilling user requests but also to guarantee the comprehensibility. Many previous works trained end-to-end (E2E) models with supervised learning (SL), however, the bias in annotated system utterances remains as a bottleneck. Reinforcement learning (RL) deals with the problem through using non-differentiable evaluation metrics (e.g., the success rate) as rewards. Nonetheless, existing works with RL showed that the...	Jianhong Wang, TaeKyun Kim, Yuan Zhang, Yunjie Gu
336	Learning explanations that are hard to vary	In this paper, we investigate the principle that good explanations are hard to vary in the context of deep learning. We show that averaging gradients across examples -- akin to a logical OR of patterns -- can favor memorization and `patchwork' solutions that sew together different strategies, instead of identifying invariances. To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. We then propose and experimentally...	Alexander Neitz, Antonio Orvieto, Bernhard Schölkopf, Giambattista Parascandolo, Luigi Gresele
337	Efficient Generalized Spherical CNNs	Many problems across computer vision and the natural sciences require the analysis of spherical data, for which representations may be learned efficiently by encoding equivariance to rotational symmetries. We present a generalized spherical CNN framework that encompasses various existing approaches and allows them to be leveraged alongside each other. The only existing non-linear spherical CNN layer that is strictly equivariant has complexity $\mathcal{O}(C^2L^5)$ , where $C$ is a measure of representational capacity and $L$ the...	Augustin Marignier, Augustine N. MavorParker, Christopher G. R. Wallis, Jason D. McEwen, Matthew A. Price, Mayeul d'Avezac, Oliver J. Cobb
338	Collective Robustness Certificates: Exploiting Interdependence in Graph Neural Networks	In tasks like node classification, image segmentation, and named-entity recognition we have a classifier that simultaneously outputs multiple predictions (a vector of labels) based on a single input, i.e. a single graph, image, or document respectively. Existing adversarial robustness certificates consider each prediction independently and are thus overly pessimistic for such tasks. They implicitly assume that an adversary can use different perturbed inputs to attack different predictions, ignoring the fact that we have a single shared...	Aleksandar Bojchevski, Jan Schuchardt, Johannes Klicpera, Stephan Günnemann
339	Entropic gradient descent algorithms and wide flat minima	The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. In this work we first discuss the relationship between alternative measures of flatness: The local entropy, which is useful for analysis and algorithm development, and the local energy, which is easier to compute and was shown empirically in extensive tests on state-of-the-art networks to be the best predictor of...	Carlo Baldassi, Carlo Lucibello, Christoph Feinauer, Elizaveta Demyanenko, Fabrizio Pittorino, Gabriele Perugini, Riccardo Zecchina
340	Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning	Federated learning (FL) is a distributed machine learning architecture that leverages a large number of workers to jointly learn a model with decentralized data. FL has received increasing attention in recent years thanks to its data privacy protection, communication efficiency and a linear speedup for convergence in training (i.e., convergence performance increases linearly with respect to the number of workers). However, existing studies on linear speedup for convergence are only limited to the assumptions of i.i.d. datasets across...	Haibo Yang, Jia Liu, Minghong Fang
341	Categorical Normalizing Flows via Continuous Transformations	Despite their popularity, to date, the application of normalizing flows on categorical data stays limited. The current practice of using dequantization to map discrete data to a continuous space is inapplicable as categorical data has no intrinsic order. Instead, categorical data have complex and latent relations that must be inferred, like the synonymy between words. In this paper, we investigate Categorical Normalizing Flows, that is normalizing flows for categorical data. By casting the encoding of categorical data in continuous...	Efstratios Gavves, Phillip Lippe
342	Learning to Represent Action Values as a Hypergraph on the Action Vertices	Action-value estimation is a critical component of many reinforcement learning (RL) methods whereby sample complexity relies heavily on how fast a good estimator for action value can be learned. By viewing this problem through the lens of representation learning, good representations of both state and action can facilitate action-value estimation. While advances in deep learning have seamlessly driven progress in learning state representations, given the specificity of the notion of agency to RL, little attention has been paid to...	Arash Tavakoli, Mehdi Fatemi, Petar Kormushev
343	Debiasing Concept-based Explanations with Causal Analysis	Concept-based explanation approach is a popular model interpertability tool because it expresses the reasons for a model's predictions in terms of concepts that are meaningful for the domain experts. In this work, we study the problem of the concepts being correlated with confounding information in the features. We propose a new causal prior graph for modeling the impacts of unobserved variables and a method to remove the impact of confounding information and noise using a two-stage regression technique borrowed from the instrumental...	David Heckerman, Mohammad Taha Bahadori
344	Lifelong Learning of Compositional Structures	A hallmark of human intelligence is the ability to construct self-contained chunks of knowledge and adequately reuse them in novel combinations for solving different yet structurally related problems. Learning such compositional structures has been a significant challenge for artificial systems, due to the combinatorial nature of the underlying search problem. To date, research into compositional learning has largely proceeded separately from work on lifelong or continual learning. We integrate these two lines of work to present a...	Eric Eaton, Jorge A. Mendez
345	Rethinking Embedding Coupling in Pre-trained Language Models	We re-evaluate the standard practice of sharing weights between input and output embeddings in state-of-the-art pre-trained language models. We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation in the input embedding of multilingual models. By reallocating the input embedding parameters in the Transformer layers, we achieve dramatically better performance on standard natural language understanding tasks with the same number of parameters...	Henry Tsai, Hyung Won Chung, Melvin Johnson, Sebastian Ruder, Thibault Févry
346	Creative Sketch Generation	Sketching or doodling is a popular creative activity that people engage in. However, most existing work in automatic sketch understanding or generation has focused on sketches that are quite mundane. In this work, we introduce two datasets of creative sketches -- Creative Birds and Creative Creatures -- containing 10k sketches each along with part annotations. We propose DoodlerGAN -- a part-based Generative Adversarial Network (GAN) -- to generate unseen compositions of novel part appearances. Quantitative evaluations as well as human...	Devi Parikh, Larry Zitnick, Songwei Ge, Vedanuj Goswami
347	Concept Learners for Few-Shot Learning	Developing algorithms that are able to generalize to a novel task given only a few labeled examples represents a fundamental challenge in closing the gap between machine- and human-level performance. The core of human cognition lies in the structured, reusable concepts that help us to rapidly adapt to new tasks and provide reasoning behind our decisions. However, existing meta-learning methods learn complex representations across prior labeled tasks without imposing any structure on the learned representations. Here we propose COMET, a...	Jure Leskovec, Kaidi Cao, Maria Brbic
348	Domain Generalization with MixStyle	Though convolutional neural networks (CNNs) have demonstrated remarkable ability in learning discriminative features, they often generalize poorly to unseen domains. Domain generalization aims to address this problem by learning from a set of source domains a model that is generalizable to any unseen domain. In this paper, a novel approach is proposed based on probabilistically mixing instance-level feature statistics of training samples across source domains. Our method, termed MixStyle, is motivated by the observation that visual...	Kaiyang Zhou, Tao Xiang, Yongxin Yang, Yu Qiao
349	DeLighT: Deep and Light-weight Transformer	We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5...	Hannaneh Hajishirzi, Luke Zettlemoyer, Marjan Ghazvininejad, Sachin Mehta, Srinivasan Iyer
350	Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy	We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms. While most existing works on actor-critic employ bi-level or two-timescale updates, we focus on the more practical single-timescale setting, where the actor and critic are updated simultaneously. Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic....	Zhaoran Wang, Zhuoran Yang, Zuyue Fu
351	Mastering Atari with Discrete World Models	Intelligent agents need to generalize from past experience to achieve goals in complex environments. World models facilitate such generalization and allow learning behaviors from imagined outcomes to increase sample-efficiency. While learning world models from image inputs has recently become feasible for some tasks, modeling Atari games accurately enough to derive successful behaviors has remained an open challenge for many years. We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in...	Danijar Hafner, Jimmy Ba, Mohammad Norouzi, Timothy P. Lillicrap
352	Learning Neural Event Functions for Ordinary Differential Equations	The existing Neural ODE formulation relies on an explicit knowledge of the termination time. We extend Neural ODEs to implicitly defined termination criteria modeled by neural event functions, which can be chained together and differentiated through. Neural Event ODEs are capable of modeling discrete and instantaneous changes in a continuous-time system, without prior knowledge of when these changes should occur or how many such changes should exist. We test our approach in modeling hybrid discrete- and continuous- systems such as...	Brandon Amos, Maximilian Nickel, Ricky T. Q. Chen
353	Contemplating Real-World Object Classification	Deep object recognition models have been very successful over benchmark datasets such as ImageNet. How accurate and robust are they to distribution shifts arising from natural and synthetic variations in datasets? Prior research on this problem has primarily focused on ImageNet variations (e.g., ImageNetV2, ImageNet-A). To avoid potential inherited biases in these studies, we take a different approach. Specifically, we reanalyze the ObjectNet dataset recently proposed by Barbu et al. containing objects in daily life situations. They...	Ali Borji
354	Neural Spatio-Temporal Point Processes	We propose a new class of parameterizations for spatio-temporal point processes which leverage Neural ODEs as a computational method and enable flexible, high-fidelity models of discrete events that are localized in continuous time and space. Central to our approach is a combination of continuous-time neural networks with two novel neural architectures, \ie, Jump and Attentive Continuous-time Normalizing Flows. This approach allows us to learn complex distributions for both the spatial and temporal domain and to condition non-trivially...	Brandon Amos, Maximilian Nickel, Ricky T. Q. Chen
355	Generative Time-series Modeling with Fourier Flows	Generating synthetic time-series data is crucial in various application domains, such as medical prognosis, wherein research is hamstrung by the lack of access to data due to concerns over privacy. Most of the recently proposed methods for generating synthetic time-series rely on implicit likelihood modeling using generative adversarial networks (GANs)—but such models can be difficult to train, and may jeopardize privacy by “memorizing” temporal patterns in training data. In this paper, we propose an explicit likelihood model based on...	Ahmed M. Alaa, Alex James Chan, Mihaela van der Schaar
356	DOP: Off-Policy Multi-Agent Decomposed Policy Gradients	Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and...	Beining Han, Chongjie Zhang, Heng Dong, Tonghan Wang, Yihan Wang
357	The Risks of Invariant Risk Minimization	Invariant Causal Prediction (Peters et al., 2016) is a technique for out-of-distribution generalization which assumes that some aspects of the data distribution vary across the training set but that the underlying causal mechanisms remain constant. Recently, Arjovsky et al. (2019) proposed Invariant Risk Minimization (IRM), an objective based on this idea for learning deep, invariant features of data which are a complex function of latent variables; many alternatives have subsequently been suggested. However, formal guarantees for all...	Andrej Risteski, Elan Rosenfeld, Pradeep Kumar Ravikumar
358	DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation	Recently, the DL compiler, together with Learning to Compile has proven to be a powerful technique for optimizing deep learning models. However, existing methods focus on accelerating the convergence speed of the individual tensor operator rather than the convergence speed of the entire model, which results in long optimization time to obtain a desired latency. In this paper, we present a new method called DynaTune, which provides significantly faster convergence speed to optimize a DNN model. In particular, we consider a Multi-Armed...	Chi Wang, Menghao Li, Mingqin Li, Minjia Zhang
359	Bag of Tricks for Adversarial Training	Adversarial training (AT) is one of the most effective strategies for promoting model robustness. However, recent benchmarks show that most of the proposed improvements on AT are less effective than simply early stopping the training procedure. This counter-intuitive fact motivates us to investigate the implementation details of tens of AT methods. Surprisingly, we find that the basic settings (e.g., weight decay, training schedule, etc.) used in these methods are highly inconsistent. In this work, we provide comprehensive evaluations...	Hang Su, Jun Zhu, Tianyu Pang, Xiao Yang, Yinpeng Dong
360	Learning with Instance-Dependent Label Noise: A Sample Sieve Approach	Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent of features. Practically, annotations errors tend to be instance-dependent and often depend on the difficulty levels of recognizing a certain task. Applying existing results from instance-independent settings would require a significant...	Hao Cheng, Xing Sun, Xingyu Li, Yang Liu, Yifei Gong, Zhaowei Zhu
361	Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL	Reinforcement learning (RL) in episodic, factored Markov decision processes (FMDPs) is studied. We propose an algorithm called FMDP-BF, which leverages the factorization structure of FMDP. The regret of FMDP-BF is shown to be exponentially smaller than that of optimal algorithms designed for non-factored MDPs, and improves on the best previous result for FMDPs~\citep{osband2014near} by a factor of $\sqrt{nH\\|\mathcal{S}_i\\|}$ , where $\\|\mathcal{S}_i\\|$ is the cardinality of the factored state subspace, $H$ is the planning horizon and...	Jiachen Hu, Lihong Li, Liwei Wang, Xiaoyu Chen
362	Unbiased Teacher for Semi-Supervised Object Detection	Semi-supervised learning, i.e., training networks with both labeled and unlabeled data, has made significant progress recently. However, existing works have primarily focused on image classification tasks and neglected object detection which requires more annotation effort. In this work, we revisit the Semi-Supervised Object Detection (SS-OD) and identify the pseudo-labeling bias issue in SS-OD. To address this, we introduce Unbiased Teacher, a simple yet effective approach that jointly trains a student and a gradually progressing...	Bichen Wu, ChiaWen Kuo, ChihYao Ma, Kan Chen, Peizhao Zhang, Peter Vajda, YenCheng Liu, Zijian He, Zsolt Kira
363	Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks	Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make the incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase...	Bo Li, Lingjuan Lyu, Nodens Koren, Xingjun Ma, Xixiang Lyu, Yige Li
364	Contrastive Learning with Adversarial Perturbations for Conditional Text Generation	Recently, sequence-to-sequence (seq2seq) models with the Transformer architecture have achieved remarkable performance on various conditional text generation tasks, such as machine translation. However, most of them are trained with teacher forcing with the ground truth label given at each time step, without being exposed to incorrectly generated tokens during training, which hurts its generalization to unseen inputs, that is known as the "exposure bias" problem. In this work, we propose to solve the conditional text generation problem...	Dong Bok Lee, Seanie Lee, Sung Ju Hwang
365	When Optimizing f-Divergence is Robust with Label Noise	We show when maximizing a properly defined $f$ -divergence measure with respect to a classifier's predictions and the supervised labels is robust with label noise. Leveraging its variational form, we derive a nice decoupling property for a family of $f$ -divergence measures when label noise presents, where the divergence is shown to be a linear combination of the variational difference defined on the clean distribution and a bias term introduced due to the noise. The above derivation helps us analyze the robustness of different...	Jiaheng Wei, Yang Liu
366	Conditional Generative Modeling via Learning the Latent Space	Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings. We propose a novel general-purpose framework for conditional generation in multimodal spaces, that uses latent variables to model generalizable learning patterns while minimizing a family of regression cost functions. At inference, the latent variables are optimized to find solutions corresponding to multiple output modes. Compared to existing...	Kanchana Nisal Ranasinghe, Nick Barnes, Salman H. Khan, Sameera Ramasinghe, Stephen Gould
367	Text Generation by Learning from Demonstrations	Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as an offline reinforcement learning (RL) problem with expert demonstrations (i.e., the reference), where the goal is to maximize quality...	He He, Richard Yuanzhe Pang
368	Learning Long-term Visual Dynamics with Region Proposal Interaction Networks	Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object...	Deepak Pathak, Haozhi Qi, Jitendra Malik, Xiaolong Wang, Yi Ma
369	ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations	Structured pruning methods are among the effective strategies for extracting small resource-efficient convolutional neural networks from their dense counterparts with minimal loss in accuracy. However, most existing methods still suffer from one or more limitations, that include 1) the need for training the dense model from scratch with pruning-related parameters embedded in the architecture, 2) requiring model-specific hyperparameter settings, 3) inability to include budget-related constraint in the training process, and 4)...	Arnav Chavan, Deepak K. Gupta, Rishabh Tiwari, Udbhav Bamba
370	Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation	Knowledge graphs (KGs) have helped neural models improve performance on various knowledge-intensive tasks, like question answering and item recommendation. By using attention over the KG, such KG-augmented models can also "explain" which KG information was most relevant for making a given prediction. In this paper, we question whether these models are really behaving as we expect. We show that, through a reinforcement learning policy (or even simple heuristics), one can produce deceptively perturbed KGs, which maintain the downstream...	Aaron Chan, Handong Zhao, Hansen Wang, Mrigank Raman, Nedim Lipka, Peifeng Wang, Ryan A. Rossi, Siddhant Agarwal, Sungchul Kim, Xiang Ren
371	IEPT: Instance-Level and Episode-Level Pretext Tasks for Few-Shot Learning	The need of collecting large quantities of labeled training data for each new task has limited the usefulness of deep neural networks. Given data from a set of source tasks, this limitation can be overcome using two transfer learning approaches: few-shot learning (FSL) and self-supervised learning (SSL). The former aims to learn `how to learn' by designing learning episodes using source tasks to simulate the challenge of solving the target new task with few labeled samples. In contrast, the latter exploits an annotation-free pretext...	Jianhong Zhang, Manli Zhang, Mingyu Ding, Songfang Huang, Tao Xiang, Zhiwu Lu
372	The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods	The adaptive stochastic gradient descent (SGD) with momentum has been widely adopted in deep learning as well as convex optimization. In practice, the last iterate is commonly used as the final solution. However, the available regret analysis and the setting of constant momentum parameters only guarantee the optimal convergence of the averaged solution. In this paper, we fill this theory-practice gap by investigating the convergence of the last iterate (referred to as {\it individual convergence}), which is a more difficult task than...	Gaowei Wu, Qing Tao, Sheng Long, Wei Tao
373	Training with Quantization Noise for Extreme Model Compression	We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work with extreme compression methods where the approximations introduced by STE are severe. Our proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased...	Angela Fan, Armand Joulin, Benjamin Graham, Edouard Grave, Hervé Jégou, Pierre Stock, Rémi Gribonval
374	Adaptive Extra-Gradient Methods for Min-Max Optimization and Games	We present a new family of min-max optimization algorithms that automatically exploit the geometry of the gradient data observed at earlier iterations to perform more informative extra-gradient steps in later ones. Thanks to this adaptation mechanism, the proposed method automatically detects whether the problem is smooth or not, without requiring any prior tuning by the optimizer. As a result, the algorithm simultaneously achieves order-optimal convergence rates, \ie it converges to an $\varepsilon$ -optimal solution within...	Elena Veronica Belmega, Kimon Antonakopoulos, Panayotis Mertikopoulos
375	Distilling Knowledge from Reader to Retriever for Question Answering	The task of information retrieval is an important component of many natural language processing systems, such as open domain question answering. While traditional methods were based on hand-crafted features, continuous representations based on neural networks recently obtained competitive results. A challenge of using such methods is to obtain supervised data to train the retriever model, corresponding to pairs of query and support documents. In this paper, we propose a technique to learn retriever models for downstream tasks, inspired...	Edouard Grave, Gautier Izacard
376	Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization	We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games. Combining reward randomization and policy gradient, we derive a new algorithm, Reward-Randomized Policy Gradient (RPG). RPG is able to discover a set of multiple distinctive human-interpretable strategies in challenging temporal trust dilemmas, including grid-world games and a real-world game Agar.io, where multiple equilibria exist but standard multi-agent policy gradient algorithms always...	Boyuan Chen, Chao Yu, Fei Fang, Huazhe Xu, Simon Shaolei Du, Xiaolong Wang, Yi Wu, Yu Wang, Zhenggang Tang
377	not-MIWAE: Deep Generative Modelling with Missing not at Random Data	When a missing process depends on the missing values themselves, it needs to be explicitly modelled and taken into account while doing likelihood-based inference. We present an approach for building and fitting deep latent variable models (DLVMs) in cases where the missing process is dependent on the missing data. Specifically, a deep neural network enables us to flexibly model the conditional distribution of the missingness pattern given the data. This allows for incorporating prior information about the type of missingness...	Jes Frellsen, Niels Bruun Ipsen, PierreAlexandre Mattei
378	IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression	In this paper we analyse and improve integer discrete flows for lossless compression. Integer discrete flows are a recently proposed class of models that learn invertible transformations for integer-valued random variables. Their discrete nature makes them particularly suitable for lossless compression with entropy coding schemes. We start by investigating a recent theoretical claim that states that invertible flows for discrete random variables are less flexible than their continuous counterparts. We demonstrate with a proof that this...	Alexey A. Gritsenko, Casper Kaae Sønderby, Mostafa Dehghani, Rianne van den Berg, Tim Salimans
379	Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning	Understanding human behavior from observed data is critical for transparency and accountability in decision-making. Consider real-world settings such as healthcare, in which modeling a decision-maker’s policy is challenging—with no access to underlying states, no knowledge of environment dynamics, and no allowance for live experimentation. We desire learning a data-driven representation of decision- making behavior that (1) inheres transparency by design, (2) accommodates partial observability, and (3) operates completely offline. To...	Alihan Hüyük, Cem Tekin, Daniel Jarrett, Mihaela van der Schaar
380	Learning with AMIGo: Adversarially Motivated Intrinsic Goals	A key challenge for reinforcement learning (RL) consists of learning in environments with sparse extrinsic rewards. In contrast to current RL methods, humans are able to learn new skills with little or no reward by using various forms of intrinsic motivation. We propose AMIGo, a novel agent incorporating -- as form of meta-learning -- a goal-generating teacher that proposes Adversarially Motivated Intrinsic Goals to train a goal-conditioned "student" policy in the absence of (or alongside) environment reward. Specifically, through a...	Andres Campero, Edward Grefenstette, Heinrich Küttler, Joshua B. Tenenbaum, Roberta Raileanu, Tim Rocktäschel
381	Incorporating Symmetry into Deep Dynamics Models for Improved Generalization	Recent work has shown deep learning can accelerate the prediction of physical dynamics relative to numerical solvers. However, limited physical accuracy and an inability to generalize under distributional shift limit its applicability to the real world. We propose to improve accuracy and generalization by incorporating symmetries into convolutional neural networks. Specifically, we employ a variety of methods each tailored to enforce a different symmetry. Our models are both theoretically and experimentally robust to distributional...	Robin Walters, Rose Yu, Rui Wang
382	CaPC Learning: Confidential and Private Collaborative Learning	Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other's data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through...	Adam Dziedzic, Christopher A. ChoquetteChoo, Natalie Dullerud, Nicolas Papernot, Somesh Jha, Xiao Wang, Yunxiang Zhang
383	Heating up decision boundaries: isocapacitory saturation, adversarial scenarios and generalization bounds	In the present work we study classifiers' decision boundaries via Brownian motion processes in ambient data space and associated probabilistic techniques. Intuitively, our ideas correspond to placing a heat source at the decision boundary and observing how effectively the sample points warm up. We are largely motivated by the search for a soft measure that sheds further light on the decision boundary's geometry. En route, we bridge aspects of potential theory and geometric analysis (Maz'ya 2011, Grigor'Yan and Saloff-Coste 2002) with...	Bogdan Georgiev, Lukas Franken, Mayukh Mukherjee
384	A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks	In this paper, we derive generalization bounds for two primary classes of graph neural networks (GNNs), namely graph convolutional networks (GCNs) and message passing GNNs (MPGNNs), via a PAC-Bayesian approach. Our result reveals that the maximum node degree and the spectral norm of the weights govern the generalization bounds of both models. We also show that our bound for GCNs is a natural generalization of the results developed in \citep{neyshabur2017pac} for fully-connected and convolutional neural networks. For MPGNNs, our...	Raquel Urtasun, Renjie Liao, Richard S. Zemel
385	Clairvoyance: A Pipeline Toolkit for Medical Time Series	Time-series learning is the bread and butter of data-driven clinical decision support, and the recent explosion in ML research has demonstrated great potential in various healthcare settings. At the same time, medical time-series problems in the wild are challenging due to their highly composite nature: They entail design choices and interactions among components that preprocess data, impute missing values, select features, issue predictions, estimate uncertainty, and interpret models. Despite exponential growth in electronic...	Ari Ercole, Daniel Jarrett, Ioana Bica, Jinsung Yoon, Mihaela van der Schaar, Zhaozhi Qian
386	Self-supervised Representation Learning with Relative Predictive Coding	This paper introduces Relative Predictive Coding (RPC), a new contrastive representation learning objective that maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. The key to the success of RPC is two-fold. First, RPC introduces the relative parameters to regularize the objective for boundedness and low variance. Second, RPC contains no logarithm and exponential score functions, which are the main cause of training instability in prior contrastive objectives. We empirically...	Han Zhao, LouisPhilippe Morency, Martin Q. Ma, Muqiao Yang, Ruslan Salakhutdinov, YaoHung Hubert Tsai
387	Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation	In this work we consider data-driven optimization problems where one must maximize a function given only queries at a fixed set of points. This problem setting emerges in many domains where function evaluation is a complex and expensive process, such as in the design of materials, vehicles, or neural network architectures. Because the available data typically only covers a small manifold of the possible space of inputs, a principal challenge is to be able to construct algorithms that can reason about uncertainty and out-of-distribution...	Justin Fu, Sergey Levine
388	On the Impossibility of Global Convergence in Multi-Loss Optimization	Under mild regularity conditions, gradient-based methods converge globally to a critical point in the single-loss setting. This is known to break down for vanilla gradient descent when moving to multi-loss optimization, but can we hope to build some algorithm with global guarantees? We negatively resolve this open problem by proving that desirable convergence properties cannot simultaneously hold for any algorithm. Our result has more to do with the existence of games with no satisfactory outcomes, than with algorithms per se. More...	Alistair Letcher
389	A Block Minifloat Representation for Training Deep Neural Networks	Training Deep Neural Networks (DNN) with high efficiency can be difficult to achieve with native floating-point representations and commercially available hardware. Specialized arithmetic with custom acceleration offers perhaps the most promising alternative. Ongoing research is trending towards narrow floating-point representations, called minifloats, that pack more operations for a given silicon area and consume less power. In this paper, we introduce Block Minifloat (BM), a new spectrum of minifloat formats capable of training DNNs...	David Boland, Julian Faraone, Philip H. W. Leong, Sean Fox, Seyedramin Rasoulinezhad
390	Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs	The properties of individual neurons are often analyzed in order to understand the biological and artificial neural networks in which they're embedded. Class selectivity—typically defined as how different a neuron's responses are across different classes of stimuli or data samples—is commonly used for this purpose. However, it remains an open question whether it is necessary and/or sufficient for deep neural networks (DNNs) to learn class selectivity in individual units. We investigated the causal impact of class selectivity on network...	Ari S. Morcos, Matthew L. Leavitt
391	Discrete Graph Structure Learning for Forecasting Multiple Time Series	Time series forecasting is an extensively studied subject in statistics, economics, and computer science. Exploration of the correlation and causation among the variables in a multivariate time series shows promise in enhancing the performance of a time series model. When using deep neural networks as forecasting models, we hypothesize that exploiting the pairwise information among multiple (multivariate) time series also improves their forecast. If an explicit graph structure is known, graph neural networks (GNNs) have been...	Chao Shang, Jie Chen, Jinbo Bi
392	Contrastive Learning with Hard Negative Samples	We consider the question: how can you sample good negative examples for contrastive learning? We argue that, as with metric learning, learning contrastive representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use label information. In response, we develop a new class of unsupervised methods for...	ChingYao Chuang, Joshua David Robinson, Stefanie Jegelka, Suvrit Sra
393	Intraclass clustering: an implicit learning ability that regularizes DNNs	Several works have shown that the regularization mechanisms underlying deep neural networks' generalization performances are still poorly understood. In this paper, we hypothesize that deep neural networks are regularized through their ability to extract meaningful clusters among the samples of a class. This constitutes an implicit form of regularization, as no explicit training mechanisms or supervision target such behaviour. To support our hypothesis, we design four different measures of intraclass clustering, based on the neuron-...	Christophe De Vleeschouwer, Simon Carbonnelle
394	Sliced Kernelized Stein Discrepancy	Kernelized Stein discrepancy (KSD), though being extensively used in goodness-of-fit tests and model learning, suffers from the curse-of-dimensionality. We address this issue by proposing the sliced Stein discrepancy and its scalable and kernelized variants, which employs kernel-based test functions defined on the optimal one-dimensional projections. When applied to goodness-of-fit tests, extensive experiments show the proposed discrepancy significantly outperforms KSD and various baselines in high dimensions. For model learning, we...	José Miguel HernándezLobato, Wenbo Gong, Yingzhen Li
395	Denoising Diffusion Implicit Models	Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps in order to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a particular Markovian diffusion process. We generalize DDPMs via a class of...	Chenlin Meng, Jiaming Song, Stefano Ermon
396	Hierarchical Reinforcement Learning by Discovering Intrinsic Options	We propose a hierarchical reinforcement learning method, HIDIO, that can learn task-agnostic options in a self-supervised manner while jointly learning to utilize them to solve sparse-reward tasks. Unlike current hierarchical RL approaches that tend to formulate goal-reaching low-level tasks or pre-define ad hoc lower-level policies, HIDIO encourages lower-level option learning that is independent of the task at hand, requiring few assumptions or little knowledge about the task structure. These options are learned through an intrinsic...	Haonan Yu, Jesse Zhang, Wei Xu
397	Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval	We propose a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on two multi-hop datasets, HotpotQA and multi-evidence FEVER. Contrary to previous work, our method does not require access to any corpus-specific information, such as inter-document hyperlinks or human-annotated entity markers, and can be applied to any unstructured text corpus. Our system also yields a much better efficiency-accuracy trade-off, matching the best published...	Barlas Oguz, Douwe Kiela, Jingfei Du, Patrick Lewis, Scott Yih, Sebastian Riedel, Srini Iyer, Wenhan Xiong, William Yang Wang, Xiang Lorraine Li, Yashar Mehdad
398	Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective	Knowledge distillation is an effective approach to leverage a well-trained network or an ensemble of them, named as the teacher, to guide the training of a student network. The outputs from the teacher network are used as soft labels for supervising the training of a new network. Recent studies (M ̈uller et al., 2019; Yuan et al., 2020) revealed an intriguing property of the soft labels that making labels soft serves as a good regularization to the student network. From the perspective of statistical learning, regularization aims to...	Guoli Wang, Helong Zhou, Jiajie Chen, Junsong Yuan, Liangchen Song, Qian Zhang, Ye Zhou
399	A Design Space Study for LISTA and Beyond	In recent years, great success has been witnessed in building problem-specific deep networks from unrolling iterative algorithms, for solving inverse problems and beyond. Unrolling is believed to incorporate the model-based prior with the learning capacity of deep learning. This paper revisits \textit{the role of unrolling as a design approach for deep networks}: to what extent its resulting special architecture is superior, and can we find better? Using LISTA for sparse recovery as a representative example, we conduct the first...	Tianjian Meng, Xiaohan Chen, Yifan Jiang, Zhangyang Wang
400	What Should Not Be Contrastive in Contrastive Learning	Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our...	Alexei A. Efros, Tete Xiao, Trevor Darrell, Xiaolong Wang
401	Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth	A key factor in the success of deep neural networks is the ability to scale models to improve performance by varying the architecture depth and width. This simple property of neural network design has resulted in highly effective architectures for a variety of tasks. Nevertheless, there is limited understanding of effects of depth and width on the learned representations. In this paper, we study this fundamental question. We begin by investigating how varying depth and width affects model hidden representations, finding a...	Maithra Raghu, Simon Kornblith, Thao Nguyen
402	Learning to Set Waypoints for Audio-Visual Navigation	In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements: 1) waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an...	Changan Chen, Kristen Grauman, Ruohan Gao, Sagnik Majumder, Santhosh Kumar Ramakrishnan, Ziad AlHalah
403	Semi-supervised Keypoint Localization	Knowledge about the locations of keypoints of an object in an image can assist in fine-grained classification and identification tasks, particularly for the case of objects that exhibit large variations in poses that greatly influence their visual appearance, such as wild animals. However, supervised training of a keypoint detection network requires annotating a large image dataset for each animal species, which is a labor-intensive task. To reduce the need for labeled data, we propose to learn simultaneously keypoint heatmaps and pose...	Feras Dayoub, Frédéric Maire, Mahsa Baktashmotlagh, Olga Moskvyak
404	Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective	Neural Architecture Search (NAS) has been explosively studied to automate the discovery of top-performer neural networks. Current works require heavy training of supernet or intensive architecture evaluations, thus suffering from heavy resource consumption and often incurring search bias due to truncated training or approximations. Can we select the best neural architectures without involving any training and eliminate a drastic portion of the search cost? We provide an affirmative answer, by proposing a novel framework called...	Wuyang Chen, Xinyu Gong, Zhangyang Wang
405	Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers	We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. Our approach stems from the idea that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish source-domain transitions from target-domain...	Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine, Shreyas Chaudhari, Swapnil Asawa
406	Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning	While existing federated learning approaches mostly require that clients have fully-labeled data to train on, in realistic settings, data obtained at the client-side often comes without any accompanying labels. Such deficiency of labels may result from either high labeling cost, or difficulty of annotation due to the requirement of expert knowledge. Thus the private data at each client may be either partly labeled, or completely unlabeled with labeled data being available only at the server, which leads us to a new practical federated...	Eunho Yang, Jaehong Yoon, Sung Ju Hwang, Wonyong Jeong
407	Learning Safe Multi-agent Control with Decentralized Neural Barrier Certificates	We study the multi-agent safe control problem where agents should avoid collisions to static obstacles and collisions with each other while reaching their goals. Our core idea is to learn the multi-agent control policy jointly with learning the control barrier functions as safety certificates. We propose a new joint-learning framework that can be implemented in a decentralized fashion, which can adapt to an arbitrarily large number of agents. Building upon this framework, we further improve the scalability by incorporating neural...	Chuchu Fan, Jingkai Chen, Kaiqing Zhang, Yuxiao Chen, Zengyi Qin
408	Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate	Understanding the algorithmic bias of stochastic gradient descent (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus on very small or even infinitesimal learning rate regime, and fail to cover practical scenarios where the learning rate is moderate and annealing. In this paper, we make an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an...	Difan Zou, Jingfeng Wu, Quanquan Gu, Vladimir Braverman
409	Fast And Slow Learning Of Recurrent Independent Mechanisms	Decomposing knowledge into interchangeable pieces promises a generalization advantage when there are changes in distribution. A learning agent interacting with its environment is likely to be faced with situations requiring novel combinations of existing pieces of knowledge. We hypothesize that such a decomposition of knowledge is particularly relevant for being able to generalize in a systematic way to out-of-distribution changes. To study these ideas, we propose a particular training framework in which we assume that the pieces of...	Anirudh Goyal, Bernhard Schölkopf, Kanika Madan, Nan Rosemary Ke, Yoshua Bengio
410	Policy-Driven Attack: Learning to Query for Hard-label Black-box Adversarial Examples	To craft black-box adversarial examples, adversaries need to query the victim model and take proper advantage of its feedback. Existing black-box attacks generally suffer from high query complexity, especially when only the top-1 decision (i.e., the hard-label prediction) of the victim model is available. In this paper, we propose a novel hard-label black-box attack named Policy-Driven Attack, to reduce the query complexity. Our core idea is to learn promising search directions of the adversarial examples using a well-designed policy...	Changshui Zhang, Jian Liang, Yiwen Guo, Ziang Yan
411	A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks	Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions: (1) What is the intuitive connection between the pretraining task of next word prediction and text classification? (2) How can we...	Nikunj Saunshi, Sadhika Malladi, Sanjeev Arora
412	Representation Learning for Sequence Data with Deep Autoencoding Predictive Components	We propose Deep Autoencoding Predictive Components (DAPC) -- a self-supervised representation learning method for sequence data, based on the intuition that useful representations of sequence data should exhibit a simple structure in the latent space. We encourage this latent structure by maximizing an estimate of \emph{predictive information} of latent feature sequences, which is the mutual information between the past and future windows at each time step. In contrast to the mutual information lower bound commonly used by contrastive...	Caiming Xiong, Junwen Bai, Weiran Wang, Yingbo Zhou
413	A unifying view on implicit bias in training linear neural networks	We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize the convergence direction of the network parameters as singular vectors of a tensor defined by the network. For $L$ -layer linear tensor...	Chulhee Yun, Hossein Mobahi, Shankar Krishnan
414	What Makes Instance Discrimination Good for Transfer Learning?	Contrastive visual pretraining based on the instance discrimination pretext task has made significant progress. Notably, recent work on unsupervised pretraining has shown to surpass the supervised counterpart for finetuning downstream applications such as object detection and segmentation. It comes as a surprise that image annotations would be better left unused for transfer learning. In this work, we investigate the following problems: What makes instance discrimination pretraining good for transfer learning? What knowledge is...	Nanxuan Zhao, Rynson W. H. Lau, Stephen Lin, Zhirong Wu
415	Learning Accurate Entropy Model with Global Reference for Image Compression	In recent deep image compression neural networks, the entropy model plays a critical role in estimating the prior distribution of deep image encodings. Existing methods combine hyperprior with local context in the entropy estimation function. This greatly limits their performance due to the absence of a global vision. In this work, we propose a novel Global Reference Model for image compression to effectively leverage both the local and the global context information, leading to an enhanced compression rate. The proposed method scans...	Dongyang Li, Hao Li, Ming Lin, Rong Jin, Xiuyu Sun, Yichen Qian, Zhenhong Sun, Zhiyu Tan
416	Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search	Designing proper loss functions for vision tasks has been a long-standing research direction to advance the capability of existing models. For object detection, the well-established classification and regression loss functions have been carefully designed by considering diverse learning challenges (e.g. class imbalance, hard negative samples, and scale variances). Inspired by the recent progress in network architecture search, it is interesting to explore the possibility of discovering new loss function formulations via directly...	Bochao Wang, Gengwei Zhang, Hang Xu, Peidong Liu, Xiaodan Liang, Yong Jiang, Zhenguo Li
417	Effective Abstract Reasoning with Dual-Contrast Network	As a step towards improving the abstract reasoning capability of machines, we aim to solve Raven’s Progressive Matrices (RPM) with neural networks, since solving RPM puzzles is highly correlated with human intelligence. Unlike previous methods that use auxiliary annotations or assume hidden rules to produce appropriate feature representation, we only use the ground truth answer of each question for model learning, aiming for an intelligent agent to have a strong learning capability with a small amount of supervision. Based on the RPM...	Mohan S. Kankanhalli, Tao Zhuo
418	Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning	The privacy leakage of the model about the training data can be bounded in the differential privacy mechanism. However, for meaningful privacy parameters, a differentially private model degrades the utility drastically when the model comprises a large number of trainable parameters. In this paper, we propose an algorithm \emph{Gradient Embedding Perturbation (GEP)} towards training differentially private deep models with decent accuracy. Specifically, in each gradient descent step, GEP first projects individual private gradient into a...	Da Yu, Huishuai Zhang, TieYan Liu, Wei Chen
419	Set Prediction without Imposing Structure as Conditional Density Estimation	Set prediction is about learning to predict a collection of unordered variables with unknown interrelations. Training such models with set losses imposes the structure of a metric space over sets. We focus on stochastic and underdefined cases, where an incorrectly chosen loss function leads to implausible predictions. Example tasks include conditional point-cloud reconstruction and predicting future states of molecules. In this paper we propose an alternative to training via set losses, by viewing learning as conditional density...	Cees G. M. Snoek, David W. Zhang, Gertjan J. Burghouts
420	Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation	Clustering is one of the most fundamental tasks in machine learning. Recently, deep clustering has become a major trend in clustering techniques. Representation learning often plays an important role in the effectiveness of deep clustering, and thus can be a principal cause of performance degradation. In this paper, we propose a clustering-friendly representation learning method using instance discrimination and feature decorrelation. Our deep-learning-based representation learning method is motivated by the properties of classical...	Kentaro Takagi, Kouta Nakata, Yaling Tao
421	Language-Agnostic Representation Learning of Source Code from Structure and Context	Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on...	Daniel Zügner, Jure Leskovec, Michele Catasta, Stephan Günnemann, Tobias Kirschstein
422	Training GANs with Stronger Augmentations via Contrastive Discriminator	Recent works in Generative Adversarial Networks (GANs) are actively revisiting various data augmentation techniques as an effective way to prevent discriminator overfitting. It is still unclear, however, that which augmentations could actually improve GANs, and in particular, how to apply a wider range of augmentations in training. In this paper, we propose a novel way to address these questions by incorporating a recent contrastive representation learning scheme into the GAN discriminator, coined ContraD. This "fusion" enables the...	Jinwoo Shin, Jongheon Jeong
423	Influence Functions in Deep Learning Are Fragile	Influence functions approximate the effect of training samples in test-time predictions and have a wide variety of applications in machine learning interpretability and uncertainty estimation. A commonly-used (first-order) influence function can be implemented efficiently as a post-hoc method requiring access only to the gradients and Hessian of the model. For linear models, influence functions are well-defined due to the convexity of the underlying loss function and are generally accurate even across difficult settings where model...	Phillip Pope, Samyadeep Basu, Soheil Feizi
424	Separation and Concentration in Deep Networks	Numerical experiments demonstrate that deep neural network classifiers progressively separate class distributions around their mean, achieving linear separability on the training set, and increasing the Fisher discriminant ratio. We explain this mechanism with two types of operators. We prove that a rectifier without biases applied to sign-invariant tight frames can separate class means and increase Fisher ratios. On the opposite, a soft-thresholding on tight frames can reduce within-class variabilities while preserving class means....	Florentin Guth, John Zarka, Stéphane Mallat
425	Colorization Transformer	We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution...	Dirk Weissenborn, Manoj Kumar, Nal Kalchbrenner
426	Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization	Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this...	Cosmin Paduraru, George Tucker, Michael R. Zhang, Mohammad Norouzi, Ofir Nachum, Thomas Paine, Ziyu Wang
427	FedBN: Federated Learning on Non-IID Features via Local Batch Normalization	The emerging paradigm of federated learning (FL) strives to enable collaborative training of deep models on the network edge without centrally aggregating raw data and hence improving data privacy. In most cases, the assumption of independent and identically distributed samples across local clients does not hold for federated learning setups. Under this setting, neural network training performance may vary significantly according to the data distribution and even hurt training convergence. Most of the previous work has focused on a...	Meirui Jiang, Michael Kamp, Qi Dou, Xiaofei Zhang, Xiaoxiao Li
428	Learning Robust State Abstractions for Hidden-Parameter Block MDPs	Many control tasks exhibit similar dynamics that can be modeled as having common latent structure. Hidden-Parameter Markov Decision Processes (HiP-MDPs) explicitly model this structure to improve sample efficiency in multi-task settings. However, this setting makes strong assumptions on the observability of the state that limit its application in real-world scenarios with rich observation spaces. In this work, we leverage ideas of common structure from the HiP-MDP setting, and extend it to enable robust state abstractions inspired by...	Amy Zhang, Joelle Pineau, Khimya Khetarpal, Shagun Sodhani
429	Meta-Learning with Neural Tangent Kernels	Model Agnostic Meta-Learning (MAML) has emerged as a standard framework for meta-learning, where a meta-model is learned with the ability of fast adapting to new tasks. However, as a double-looped optimization problem, MAML needs to differentiate through the whole inner-loop optimization path for every outer-loop training step, which may lead to both computational inefficiency and sub-optimal solutions. In this paper, we generalize MAML to allow meta-learning to be defined in function spaces, and propose the first meta-learning...	Changyou Chen, Jiayi Xian, Jinhui Xu, Yufan Zhou, Zhenyi Wang
430	Continual learning in recurrent neural networks	While a diverse collection of continual learning (CL) methods has been proposed to prevent catastrophic forgetting, a thorough investigation of their effectiveness for processing sequential data with recurrent neural networks (RNNs) is lacking. Here, we provide the first comprehensive evaluation of established CL methods on a variety of sequential data benchmarks. Specifically, we shed light on the particularities that arise when applying weight-importance methods, such as elastic weight consolidation, to RNNs. In contrast to...	Alexander Meulemans, Benjamin Ehret, Benjamin F. Grewe, Christian Henning, Johannes von Oswald, Maria R. Cervera
431	A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention	We address the problem of learning on sets of features, motivated by the need of performing pooling operations in long biological sequences of varying sizes, with long-range dependencies, and possibly few labeled data. To address this challenging task, we introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the...	Alexandre d'Aspremont, Dexiong Chen, Grégoire Mialon, Julien Mairal
432	Learning "What-if" Explanations for Sequential Decision-Making	Building interpretable parameterizations of real-world decision-making on the basis of demonstrated behavior--i.e. trajectories of observations and actions made by an expert maximizing some unknown reward function--is essential for introspecting and auditing policies in different institutions. In this paper, we propose learning explanations of expert decisions by modeling their reward function in terms of preferences with respect to ``"what if'' outcomes: Given the current history of observations, what would happen if we took a...	Alihan Hüyük, Daniel Jarrett, Ioana Bica, Mihaela van der Schaar
433	Improving Transformation Invariance in Contrastive Representation Learning	We propose methods to strengthen the invariance properties of representations obtained by contrastive learning. While existing approaches implicitly induce a degree of invariance as representations are learned, we look to more directly enforce invariance in the encoding process. To this end, we first introduce a training objective for contrastive learning that uses a novel regularizer to control how the representation changes under transformation. We show that representations trained with this objective perform better on downstream...	Adam Foster, Rattana Pukdee, Tom Rainforth
434	Shapley explainability on the data manifold	Explainability in AI is crucial for model development, compliance with regulation, and providing operational nuance to predictions. The Shapley framework for explainability attributes a model’s predictions to its input features in a mathematically principled and model-agnostic way. However, general implementations of Shapley explainability make an untenable assumption: that the model’s features are uncorrelated. In this work, we demonstrate unambiguous drawbacks of this assumption and develop two solutions to Shapley explainability...	Christopher Frye, Damien de Mijolla, Ilya Feige, Laurence Cowton, Megan Stanley, Tom Begley
435	Noise or Signal: The Role of Image Backgrounds in Object Recognition	We assess the tendency of state-of-the-art object recognition models to depend on signals from image backgrounds. We create a toolkit for disentangling foreground and background signal on ImageNet images, and find that (a) models can achieve non-trivial accuracy by relying on the background alone, (b) models often misclassify images even in the presence of correctly classified foregrounds--up to 88% of the time with adversarially chosen backgrounds, and (c) more accurate models tend to depend on backgrounds less. Our analysis of...	Aleksander Madry, Andrew Ilyas, Kai Yuanqing Xiao, Logan Engstrom
436	Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation	Controllable semantic image editing enables a user to change entire image attributes with a few clicks, e.g., gradually making a summer scene look like it was taken in winter. Classic approaches for this task use a Generative Adversarial Net (GAN) to learn a latent space and suitable latent-space transformations. However, current approaches often suffer from attribute edits that are entangled, global image identity changes, and diminished photo-realism. To address these concerns, we learn multiple attribute transformations...	Alexander G. Schwing, Oluwasanmi Koyejo, Peiye Zhuang
437	Perceptual Adversarial Robustness: Defense Against Unseen Threat Models	A key challenge in adversarial robustness is the lack of a precise mathematical characterization of human perception, used in the definition of adversarial attacks that are imperceptible to human eyes. Most current attacks and defenses try to get around this issue by considering restrictive adversarial threat models such as those bounded by $L_2$ or $L_\infty$ distance, spatial perturbations, etc. However, models that are robust against any of these restrictive threat models are still fragile against other threat models, i.e. they have...	Cassidy Laidlaw, Sahil Singla, Soheil Feizi
438	Zero-Cost Proxies for Lightweight NAS	Neural Architecture Search (NAS) is quickly becoming the standard methodology to design neural network models. However, NAS is typically compute-intensive because multiple models need to be evaluated before choosing the best one. To reduce the computational power and time needed, a proxy task is often used for evaluating each model instead of full training. In this paper, we evaluate conventional reduced-training proxies and quantify how well they preserve ranking between neural network models during search when compared with the...	Abhinav Mehrotra, Lukasz Dudziak, Mohamed S. Abdelfattah, Nicholas Donald Lane
439	Usable Information and Evolution of Optimal Representations During Training	We introduce a notion of usable information contained in the representation learned by a deep network, and use it to study how optimal representations for the task emerge during training. We show that the implicit regularization coming from training with Stochastic Gradient Descent with a high learning-rate and small batch size plays an important role in learning minimal sufficient representations for the task. In the process of arriving at a minimal sufficient representation, we find that the content of the representation changes...	Alessandro Achille, Daksh Idnani, Jonathan C. Kao, Michael Kleinman
440	Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit	Modern deep learning models have achieved great success in predictive accuracy for many data modalities. However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-of-distribution (OOD) data and ungraceful failing under distributional shift. Previous benchmarks have found that ensembles of neural networks (NNs) are typically the best calibrated models on OOD data. Inspired by this, we leverage recent theoretical advances that characterize the function-space prior of an...	Ben Adlam, Jaehoon Lee, Jasper Snoek, Jeffrey Pennington, Lechao Xiao
441	On the geometry of generalization and memorization in deep neural networks	Understanding how large neural networks avoid memorizing training data is key to explaining their high generalization performance. To examine the structure of when and where memorization occurs in a deep network, we use a recently developed replica-based mean field theoretic geometric analysis method. We find that all layers preferentially learn from examples which share features, and link this behavior to generalization performance. Memorization predominately occurs in the deeper layers, due to decreasing object manifolds’ radius and...	Abhinav Ganesh, Cory Stephenson, Hanlin Tang, Suchismita Padhy, SueYeon Chung, Yue Hui
442	Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks	Adversarial poisoning attacks distort training data in order to corrupt the test-time behavior of a classifier. A provable defense provides a certificate for each test sample, which is a lower bound on the magnitude of any adversarial distortion of the training set that can corrupt the test sample's classification. We propose two novel provable defenses against poisoning attacks: (i) Deep Partition Aggregation (DPA), a certified defense against a general poisoning threat model, defined as the insertion or deletion of a bounded number...	Alexander Levine, Soheil Feizi
443	DC3: A learning method for optimization with hard constraints	Large optimization problems with hard constraints arise in many settings, yet classical solvers are often prohibitively slow, motivating the use of deep networks as cheap "approximate solvers." Unfortunately, naive deep learning approaches typically cannot enforce the hard constraints of such problems, leading to infeasible solutions. In this work, we present Deep Constraint Completion and Correction (DC3), an algorithm to address this challenge. Specifically, this method enforces feasibility via a differentiable procedure, which...	David Rolnick, J. Zico Kolter, Priya L. Donti
444	Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study	This work aims to empirically clarify a recently discovered perspective that label smoothing is incompatible with knowledge distillation. We begin by introducing the motivation behind on how this incompatibility is raised, i.e., label smoothing erases relative information between teacher logits. We provide a novel connection on how label smoothing affects distributions of semantically similar and dissimilar classes. Then we propose a metric to quantitatively measure the degree of erased information in sample's representation. After...	Dejia Xu, KwangTing Cheng, Marios Savvides, Zechun Liu, Zhiqiang Shen, Zitian Chen
445	Shape-Texture Debiased Neural Network Training	Shape and texture are two prominent and complementary cues for recognizing objects. Nonetheless, Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. Our ablation shows that such bias degenerates model performance. Motivated by this observation, we develop a simple algorithm for shape-texture debiased learning. To prevent models from exclusively attending on a single cue in representation learning, we augment training data with images with conflicting shape and texture...	Alan L. Yuille, Cihang Xie, Jieru Mei, Mingxing Tan, Peng Tang, Qihang Yu, Wei Shen, Yingwei Li
446	Using latent space regression to analyze and leverage compositionality in GANs	In recent years, Generative Adversarial Networks have become ubiquitous in both research and public perception, but how GANs convert an unstructured latent code to a high quality output is still an open question. In this work, we investigate regression into the latent space as a probe to understand the compositional properties of GANs. We find that combining the regressor and a pretrained generator provides a strong image prior, allowing us to create composite images from a collage of random image parts at inference time while...	Jonas Wulff, Lucy Chai, Phillip Isola
447	Blending MPC & Value Function Approximation for Efficient Reinforcement Learning	Model-Predictive Control (MPC) is a powerful tool for controlling complex, real-world systems that uses a model to make predictions about future behavior. For each state encountered, MPC solves an online optimization problem to choose a control action that will minimize future cost. This is a surprisingly effective strategy, but real-time performance requirements warrant the use of simple models. If the model is not sufficiently accurate, then the resulting controller can be biased, limiting performance. We present a framework for...	Byron Boots, Mohak Bhardwaj, Sanjiban Choudhury
448	Model Patching: Closing the Subgroup Performance Gap with Data Augmentation	Classifiers in machine learning are often brittle when deployed. Particularly concerning are models with inconsistent performance on specific subgroups of a class, e.g., exhibiting disparities in skin cancer classification in the presence or absence of a spurious bandage. To mitigate these performance differences, we introduce model patching, a two-stage framework for improving robustness that encourages the model to be invariant to subgroup differences, and focus on class information shared by subgroups. Model patching first models...	Albert Gu, Christopher Ré, Karan Goel, Sharon Li
449	Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds	Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. Therefore, OPE is a key step in applying reinforcement learning to real-world domains such as medical treatment, where interactive data collection is expensive or even unsafe. As the observed data tends to be noisy and limited, it is essential to provide rigorous uncertainty quantification, not just a point estimation, when applying OPE to make high stakes decisions. This work...	Na Zhang, Qiang Liu, Yihao Feng, Ziyang Tang
450	Linear Mode Connectivity in Multitask and Continual Learning	Continual (sequential) training and multitask (simultaneous) training are often attempting to solve the same overall objective: to find a solution that performs well on all considered tasks. The main difference is in the training regimes, where continual learning can only have access to one task at a time, which for neural networks typically leads to catastrophic forgetting. That is, the solution found for a subsequent task does not perform well on the previous ones anymore. However, the relationship between the different minima that...	Dilan Görür, Hassan Ghasemzadeh, Mehrdad Farajtabar, Razvan Pascanu, SeyedIman Mirzadeh
451	Robust and Generalizable Visual Representation Learning via Random Convolutions	While successful for various computer vision tasks, deep neural networks have shown to be vulnerable to texture style shifts and small perturbations to which humans are robust. In this work, we show that the robustness of neural networks can be greatly improved through the use of random convolutions as data augmentation. Random convolutions are approximately shape-preserving and may distort local textures. Intuitively, randomized convolutions create an infinite number of new domains with similar global shapes but random local texture....	Colin Raffel, Deyi Liu, Junlin Yang, Marc Niethammer, Zhenlin Xu
452	Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures	Proteins perform a large variety of functions in living organisms and thus play a key role in biology. However, commonly used algorithms in protein representation learning were not specifically designed for protein data, and are therefore not able to capture all relevant structural levels of a protein during learning. To fill this gap, we propose two new learning operators, specifically designed to process protein structures. First, we introduce a novel convolution operator that considers the primary, secondary, and tertiary structure...	Barbora Kozlíková, Gloria Fackelmann, Marco Schäfer, Matej Lang, Michael Krone, Pedro Hermosilla, PerePau Vázquez, Timo Ropinski, Tobias Ritschel
453	Variational State-Space Models for Localisation and Dense 3D Mapping in 6 DoF	We solve the problem of 6-DoF localisation and 3D dense reconstruction in spatial environments as approximate Bayesian inference in a deep state-space model. Our approach leverages both learning and domain knowledge from multiple-view geometry and rigid-body dynamics. This results in an expressive predictive model of the world, often missing in current state-of-the-art visual SLAM solutions. The combination of variational inference, neural networks and a differentiable raycaster ensures that our model is amenable to end-to-end...	Atanas Mirchev, Baris Kayalibay, Justin Bayer, Patrick van der Smagt
454	AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights	Normalization techniques, such as batch normalization (BN), are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers...	Byeongho Heo, Dongyoon Han, Gyuwan Kim, JungWoo Ha, Sangdoo Yun, Sanghyuk Chun, Seong Joon Oh, Youngjung Uh
455	MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering	We present Mixture of Contrastive Experts (MiCE), a unified probabilistic clustering framework that simultaneously exploits the discriminative representations learned by contrastive learning and the semantic structures captured by a latent mixture model. Motivated by the mixture of experts, MiCE employs a gating function to partition an unlabeled dataset into subsets according to the latent semantics and multiple experts to discriminate distinct subsets of instances assigned to them in a contrastive learning manner. To solve the...	Chongxuan Li, Jun Zhu, Tsung Wei Tsai
456	HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents	Motion forecasting is essential for making intelligent decisions in robotic navigation. As a result, the multi-agent behavioral prediction has become a core component of modern human-robot interaction applications such as autonomous driving. Due to various intentions and interactions among agents, agent trajectories can have multiple possible futures. Hence, the motion forecasting model's ability to cover possible modes becomes essential to enable accurate prediction. Towards this goal, we introduce HalentNet to better model the future...	Deyao Zhu, Li Erran Li, Mohamed Elhoseiny, Mohamed Zahran
457	Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose?	We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models using a fixed (random shooting) control agent. We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin. When multimodality is not required, our surprising finding is that we do not need probabilistic posterior predictives: deterministic models are on par, in fact they consistently (although non-significantly) outperform their...	Albert Thomas, Balázs Kégl, Gabriel Hurtado
458	Private Image Reconstruction from System Side Channels Using Generative Models	System side channels denote effects imposed on the underlying system and hardware when running a program, such as its accessed CPU cache lines. Side channel analysis (SCA) allows attackers to infer program secrets based on observed side channel signals. Given the ever-growing adoption of machine learning as a service (MLaaS), image analysis software on cloud platforms has been exploited by reconstructing private user images from system side channels. Nevertheless, to date, SCA is still highly challenging, requiring technical knowledge...	Junping Zhang, Shuai Wang, Yuanyuan Yuan
459	Contextual Transformation Networks for Online Continual Learning	Continual learning methods with fixed architectures rely on a single network to learn models that can perform well on all tasks. As a result, they often only accommodate common features of those tasks but neglect each task's specific features. On the other hand, dynamic architecture methods can have a separate network for each task, but they are too expensive to train and not scalable in practice, especially in online settings. To address this problem, we propose a novel online continual learning method named ``Contextual...	Chenghao Liu, Doyen Sahoo, Quang Pham, Steven C. H. Hoi
460	A Unified Approach to Interpreting and Boosting Adversarial Transferability	In this paper, we use the interaction inside adversarial perturbations to explain and boost the adversarial transferability. We discover and prove the negative correlation between the adversarial transferability and the interaction inside adversarial perturbations. The negative correlation is further verified through different DNNs with various inputs. Moreover, this negative correlation can be regarded as a unified perspective to understand current transferability-boosting methods. To this end, we prove that some classic methods of...	Jie Ren, Quanshi Zhang, Shuyun Lin, Xiangming Zhu, Xin Wang, Yisen Wang
461	The inductive bias of ReLU networks on orthogonally separable data	We study the inductive bias of two-layer ReLU networks trained by gradient flow. We identify a class of easy-to-learn (`orthogonally separable') datasets, and characterise the solution that ReLU networks trained on such datasets converge to. Irrespective of network width, the solution turns out to be a combination of two max-margin classifiers: one corresponding to the positive data subset and one corresponding to the negative data subset. The proof is based on the recently introduced concept of extremal sectors, for which we prove a...	Christoph H. Lampert, Mary Phuong
462	A statistical theory of cold posteriors in deep neural networks	To get Bayesian neural networks to perform comparably to standard neural networks it is usually necessary to artificially reduce uncertainty using a tempered or cold posterior. This is extremely concerning: if the prior is accurate, Bayes inference/decision theory is optimal, and any artificial changes to the posterior should harm performance. While this suggests that the prior may be at fault, here we argue that in fact, BNNs for image classification use the wrong likelihood. In particular, standard image benchmark datasets such as...	Laurence Aitchison
463	IOT: Instance-wise Layer Reordering for Transformer Structures	With sequentially stacked self-attention, (optional) encoder-decoder attention, and feed-forward layers, Transformer achieves big success in natural language processing (NLP), and many variants have been proposed. Currently, almost all these models assume that the \emph{layer order} is fixed and kept the same across data samples. We observe that different data samples actually favor different orders of the layers. Based on this observation, in this work, we break the assumption of the fixed layer order in Transformer and introduce...	Houqiang Li, Jinhua Zhu, Lijun Wu, Shufang Xie, Tao Qin, TieYan Liu, Wengang Zhou, Yingce Xia
464	Counterfactual Generative Networks	Neural networks are prone to learning shortcuts -- they often model simple correlations, ignoring more complex ones that potentially generalize better. Prior works on image classification show that instead of learning a connection to object shape, deep classifiers tend to exploit spurious correlations with low-level texture or the background for solving the classification task. In this work, we take a step towards more robust and interpretable classifiers that explicitly expose the task's causal structure. Building on current advances...	Andreas Geiger, Axel Sauer
465	Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data	Multi-Task Learning (MTL) networks have emerged as a promising method for transferring learned knowledge across different tasks. However, MTL must deal with challenges such as: overfitting to low resource tasks, catastrophic forgetting, and negative task transfer, or learning interference. Often, in Natural Language Processing (NLP), a separate model per task is needed to obtain the best performance. However, many fine-tuning approaches are both parameter inefficient, i.e., potentially involving one new model per task, and highly...	Amine Elhattami, Christopher J. Pal, Jonathan Pilault
466	Towards Impartial Multi-task Learning	Multi-task learning (MTL) has been widely used in representation learning. However, naively training all tasks simultaneously may lead to the partial training issue, where specific tasks are trained more adequately than others. In this paper, we propose to learn multiple tasks impartially. Specifically, for the task-shared parameters, we optimize the scaling factors via a closed-form solution, such that the aggregated gradient (sum of raw gradients weighted by the scaling factors) has equal projections onto individual tasks. For the...	JingHao Xue, Liyang Liu, Qingmin Liao, Wayne Zhang, Wenming Yang, Yi Li, Yimin Chen, Zhanghui Kuang
467	Theoretical bounds on estimation error for meta-learning	Machine learning models have traditionally been developed under the assumption that the training and test distributions match exactly. However, recent success in few-shot learning and related problems are encouraging signs that these models can be adapted to more realistic settings where train and test distributions differ. Unfortunately, there is severely limited theoretical support for these algorithms and little is known about the difficulty of these problems. In this work, we provide novel information-theoretic lower-bounds on...	Irene Raissa Kameni, James Lucas, Mengye Ren, Richard S. Zemel, Toniann Pitassi
468	Domain-Robust Visual Imitation Learning with Mutual Information Constraints	Human beings are able to understand objectives and learn by simply observing others perform a task. Imitation learning methods aim to replicate such capabilities, however, they generally depend on access to a full set of optimal states and actions taken with the agent's actuators and from the agent's point of view. In this paper, we introduce a new algorithm - called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL) - with the purpose of bypassing such constraints. Our algorithm enables autonomous agents to learn...	Edoardo Cetin, Oya Çeliktutan
469	Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding	Time series are often complex and rich in information but sparsely labeled and therefore challenging to model. In this paper, we propose a self-supervised framework for learning robust and generalizable representations for time series. Our approach, called Temporal Neighborhood Coding (TNC), takes advantage of the local smoothness of a signal's generative process to define neighborhoods in time with stationary properties. Using a debiased contrastive objective, our framework learns time series representations by ensuring that in the...	Anna Goldenberg, Danny Eytan, Sana Tonekaboni
470	Enforcing robust control guarantees within neural network policies	When designing controllers for safety-critical systems, practitioners often face a challenging tradeoff between robustness and performance. While robust control methods provide rigorous guarantees on system stability under certain worst-case disturbances, they often yield simple controllers that perform poorly in the average (non-worst) case. In contrast, nonlinear control methods trained using deep learning have achieved state-of-the-art performance on many control tasks, but often lack robustness guarantees. In this paper, we propose...	J. Zico Kolter, Mahyar Fazlyab, Melrose Roderick, Priya L. Donti
471	Active Contrastive Learning of Audio-Visual Video Representations	Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower bound requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that random...	Daniel McDuff, Shuang Ma, Yale Song, Zhaoyang Zeng
472	Parameter Efficient Multimodal Transformers for Video Representation Learning	The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual...	Gunhee Kim, Jan Kautz, Sangho Lee, Thomas M. Breuel, Yale Song, Youngjae Yu
473	Robust Pruning at Initialization	Overparameterized Neural Networks (NN) display state-of-the-art performance. However, there is a growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources. A popular approach consists of using pruning techniques. While these techniques have traditionally focused on pruning pre-trained NN (LeCun et al.,1990; Hassibi et al., 1993), recent work by Lee et al. (2018) has shown promising results when pruning at initialization. However, for Deep...	Arnaud Doucet, JeanFrancois Ton, Soufiane Hayou, Yee Whye Teh
474	Efficient Wasserstein Natural Gradients for Reinforcement Learning	A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient \emph{Wasserstein natural gradient} (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including divergence penalties in the objective to establish trust regions. Experiments on challenging tasks demonstrate improvements in both computational cost...	Arthur Gretton, Ferenc Huszar, Michael Arbel, Ted Moskovitz
475	Probing BERT in Hyperbolic Spaces	Recently, a variety of probing tasks are proposed to discover linguistic properties learned in contextualized word embeddings. Many of these works implicitly assume these embeddings lay in certain metric spaces, typically the Euclidean space. This work considers a family of geometrically special spaces, the hyperbolic spaces, that exhibit better inductive biases for hierarchical structures and may better reveal linguistic hierarchies encoded in contextualized representations. We introduce a $\textit{Poincaré probe}$ , a structural probe...	Boli Chen, Chuanqi Tan, Guangwei Xu, Liping Jing, Mosha Chen, Pengjun Xie, Yao Fu
476	On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning	Model-agnostic meta-learning (MAML) has emerged as one of the most successful meta-learning techniques in few-shot learning. It enables us to learn a $\textit{meta-initialization}$ of model parameters (that we call $\textit{meta-model}$ ) to rapidly adapt to new tasks using a small amount of labeled training data. Despite the generalization power of the meta-model, it remains elusive that how $\textit{adversarial robustness}$ can be maintained by MAML in few-shot learning. In addition to generalization, robustness is also desired for a...	Chuang Gan, Kaidi Xu, Meng Wang, PinYu Chen, Ren Wang, Sijia Liu, TsuiWei Weng
477	Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics	Catastrophic forgetting is a recurring challenge to developing versatile deep learning models. Despite its ubiquity, there is limited understanding of its connections to neural network (hidden) representations and task semantics. In this paper, we address this important knowledge gap. Through quantitative analysis of neural representations, we find that deeper layers are disproportionately responsible for forgetting, with sequential training resulting in an erasure of earlier task representational subspaces. Methods to mitigate...	Ethan Dyer, Maithra Raghu, Vinay Venkatesh Ramasesh
478	Trusted Multi-View Classification	Multi-view classification (MVC) generally focuses on improving classification accuracy by using information from different views, typically integrating them into a unified comprehensive representation for downstream tasks. However, it is also crucial to dynamically assess the quality of a view for different samples in order to provide reliable uncertainty estimations, which indicate whether predictions can be trusted. To this end, we propose a novel multi-view classification method, termed trusted multi-view classification, which...	Changqing Zhang, Huazhu Fu, Joey Tianyi Zhou, Zongbo Han
479	i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning	Contrastive representation learning has shown to be effective to learn representations from unlabeled data. However, much progress has been made in vision domains relying on data augmentations carefully designed using domain knowledge. In this work, we propose i-Mix, a simple yet effective domain-agnostic regularization strategy for improving contrastive representation learning. We cast contrastive learning as training a non-parametric classifier by assigning a unique virtual class to each data in a batch. Then, data instances are...	ChunLiang Li, Honglak Lee, Jinwoo Shin, Kibok Lee, Kihyuk Sohn, Yian Zhu
480	Initialization and Regularization of Factorized Neural Layers	Factorized layers—operations parameterized by products of two or more matrices—occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head self-attention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as...	Lester Mackey, Mikhail Khodak, Neil A. Tenenholtz, Nicolò Fusi
481	Learning to Generate 3D Shapes with Generative Cellular Automata	In this work, we present a probabilistic 3D generative model, named Generative Cellular Automata, which is able to produce diverse and high quality shapes. We formulate the shape generation process as sampling from the transition kernel of a Markov chain, where the sampling chain eventually evolves to the full shape of the learned distribution. The transition kernel employs the local update rules of cellular automata, effectively reducing the search space in a high-resolution 3D grid space by exploiting the connectivity and sparsity of...	Changwoon Choi, Dongsu Zhang, Jeonghwan Kim, Young Min Kim
482	Self-Supervised Learning of Compressed Video Representations	Self-supervised learning of video representations has received great attention. Existing methods typically require frames to be decoded before being processed, which increases compute and storage requirements and ultimately hinders large-scale training. In this work, we propose an efficient self-supervised approach to learn video representations by eliminating the expensive decoding step. We use a three-stream video architecture that encodes I-frames and P-frames of a compressed video. Unlike existing approaches that encode I-frames...	Gunhee Kim, Sangho Lee, Yale Song, Youngjae Yu
483	Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments	Modeling a structured, dynamic environment like a video game requires keeping track of the objects and their states (declarative knowledge) as well as predicting how objects behave (procedural knowledge). Black-box models with a monolithic hidden state often fail to apply procedural knowledge consistently and uniformly, i.e., they lack systematicity. For example, in a video game, correct prediction of one enemy's trajectory does not ensure correct prediction of another's. We address this issue via an architecture that factorizes...	Alex Lamb, Anirudh Goyal, Charles Blundell, Michael Curtis Mozer, Phanideep Gampa, Philippe Beaudoin, Sergey Levine, Yoshua Bengio
484	Cut out the annotator, keep the cutout: better segmentation with weak supervision	Constructing large, labeled training datasets for segmentation models is an expensive and labor-intensive process. This is a common challenge in machine learning, addressed by methods that require few or no labeled data points such as few-shot learning (FSL) and weakly-supervised learning (WS). Such techniques, however, have limitations when applied to image segmentation---FSL methods often produce noisy results and are strongly dependent on which few datapoints are labeled, while WS models struggle to fully exploit rich image...	Christopher Ré, Curtis P. Langlotz, Frederic Sala, Hui Xue, Michael Wornow, Peter Kellman, Sarah M. Hooper, Ying Hang Seah
485	FastSpeech 2: Fast and High-Quality End-to-End Text to Speech	Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several...	Chenxu Hu, Sheng Zhao, Tao Qin, TieYan Liu, Xu Tan, Yi Ren, Zhou Zhao
486	On Learning Universal Representations Across Languages	Recent studies have demonstrated the overwhelming advantage of cross-lingual pre-trained models (PTMs), such as multilingual BERT and XLM, on cross-lingual NLP tasks. However, existing approaches essentially capture the co-occurrence among tokens through involving the masked language model (MLM) objective with token-level cross entropy. In this work, we extend these approaches to learn sentence-level representations and show the effectiveness on cross-lingual understanding and generation. Specifically, we propose a Hierarchical...	Heng Yu, Luxi Xing, Rongxiang Weng, Weihua Luo, Xiangpeng Wei, Yue Hu
487	Effective Distributed Learning with Random Features: Improved Bounds and Algorithms	In this paper, we study the statistical properties of distributed kernel ridge regression together with random features (DKRR-RF), and obtain optimal generalization bounds under the basic setting, which can substantially relax the restriction on the number of local machines in the existing state-of-art bounds. Specifically, we first show that the simple combination of divide-and-conquer technique and random features can achieve the same statistical accuracy as the exact KRR in expectation requiring only $\mathcal{O}(\\|\mathcal{D}\\|)$ ...	Jiankun Liu, Shuqiang Wang, Yong Liu
488	Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning	Post-hoc multi-class calibration is a common approach for providing high-quality confidence estimates of deep neural network predictions. Recent work has shown that widely used scaling methods underestimate their calibration error, while alternative Histogram Binning (HB) methods often fail to preserve classification accuracy. When classes have small prior probabilities, HB also faces the issue of severe sample-inefficiency after the conversion into K one-vs-rest class-wise calibration problems. The goal of this paper is to resolve the...	Bin Yang, Dan Zhang, Kanil Patel, Michael Pfeiffer, William H. Beluch
489	Neural ODE Processes	Neural Ordinary Differential Equations (NODEs) use a neural network to model the instantaneous rate of change in the state of a system. However, despite their apparent suitability for dynamics-governed time-series, NODEs present a few disadvantages. First, they are unable to adapt to incoming data-points, a fundamental requirement for real-time applications imposed by the natural direction of time. Second, time-series are often composed of a sparse set of measurements that could be explained by many possible underlying dynamics. NODEs...	Alexander Norcliffe, Ben Day, Cristian Bodnar, Jacob Moss, Pietro Liò
490	Conformation-Guided Molecular Representation with Hamiltonian Neural Networks	Well-designed molecular representations (fingerprints) are vital to combine medical chemistry and deep learning. Whereas incorporating 3D geometry of molecules (i.e. conformations) in their representations seems beneficial, current 3D algorithms are still in infancy. In this paper, we propose a novel molecular representation algorithm which preserves 3D conformations of molecules with a Molecular Hamiltonian Network (HamNet). In HamNet, implicit positions and momentums of atoms in a molecule interact in the Hamiltonian Engine following...	Guojie Song, Lingsheng Cai, Shuwen Yang, Ziyao Li
491	An Unsupervised Deep Learning Approach for Real-World Image Denoising	Designing an unsupervised image denoising approach in practical applications is a challenging task due to the complicated data acquisition process. In the real-world case, the noise distribution is so complex that the simplified additive white Gaussian (AWGN) assumption rarely holds, which significantly deteriorates the Gaussian denoisers' performance. To address this problem, we apply a deep neural network that maps the noisy image into a latent space in which the AWGN assumption holds, and thus any existing Gaussian denoiser is...	Chenglong Bao, Dihan Zheng, Kaisheng Ma, Sia Huat Tan, Xiaowen Zhang, Zuoqiang Shi
492	Uncertainty in Gradient Boosting via Ensembles	For many practical, high-risk applications, it is essential to quantify uncertainty in a model's predictions to avoid costly mistakes. While predictive uncertainty is widely studied for neural networks, the topic seems to be under-explored for models based on gradient boosting. However, gradient boosting often achieves state-of-the-art results on tabular data. This work examines a probabilistic ensemble-based framework for deriving uncertainty estimates in the predictions of gradient boosting classification and regression models. We...	Aleksei Ustimenko, Andrey Malinin, Liudmila Prokhorenkova
493	Lossless Compression of Structured Convolutional Models via Lifting	Lifting is an efficient technique to scale up graphical models generalized to relational domains by exploiting the underlying symmetries. Concurrently, neural models are continuously expanding from grid-like tensor data into structured representations, such as various attributed graphs and relational databases. To address the irregular structure of the data, the models typically extrapolate on the idea of convolution, effectively introducing parameter sharing in their, dynamically unfolded, computation graphs. The computation graphs...	Filip Zelezný, Gustav Sourek, Ondrej Kuzelka
494	Neural networks with late-phase weights	The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD). Here, we show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space. To avoid incurring increased computational costs, we investigate a family of low-dimensional late-phase weight models which interact multiplicatively with the...	Alexander Meulemans, Benjamin F. Grewe, Christian Henning, Johannes von Oswald, João Sacramento, Seijin Kobayashi
495	Disambiguating Symbolic Expressions in Informal Documents	We propose the task of \emph{disambiguating} symbolic expressions in informal STEM documents in the form of \LaTeX files -- that is, determining their precise semantics and abstract syntax tree -- as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid \LaTeX before overfitting. Consequently, we describe a methodology using a \emph{transformer} language model...	Cezary Kaliszyk, Dennis Müller
496	Learning Parametrised Graph Shift Operators	In many domains data is currently represented as graphs and therefore, the graph representation of this data becomes increasingly important in machine learning. Network data is, implicitly or explicitly, always represented using a graph shift operator (GSO) with the most common choices being the adjacency, Laplacian matrices and their normalisations. In this paper, a novel parametrised GSO (PGSO) is proposed, where specific parameter values result in the most commonly used GSOs and message-passing operators in graph neural network...	George Dasoulas, Johannes F. Lutzeyer, Michalis Vazirgiannis
497	Efficient Conformal Prediction via Cascaded Inference with Expanded Admission	In this paper, we present a novel approach for conformal prediction (CP), in which we aim to identify a set of promising prediction candidates---in place of a single prediction. This set is guaranteed to contain a correct answer with high probability, and is well-suited for many open-ended classification tasks. In the standard CP paradigm, the predicted set can often be unusably large and also costly to obtain. This is particularly pervasive in settings where the correct answer is not unique, and the number of total possible answers is...	Adam Fisch, Regina Barzilay, Tal Schuster, Tommi S. Jaakkola
498	GANs Can Play Lottery Tickets Too	Deep generative adversarial networks (GANs) have gained growing popularity in numerous scenarios, while usually suffer from high parameter complexities for resource-constrained real-world applications. However, the compression of GANs has less been explored. A few works show that heuristically applying compression techniques normally leads to unsatisfactory results, due to the notorious training instability of GANs. In parallel, the lottery ticket hypothesis shows prevailing success on discriminative models, in locating sparse matching...	Tianlong Chen, Xuxi Chen, Yongduo Sui, Zhenyu Zhang
499	ResNet After All: Neural ODEs and Their Numerical Solution	A key appeal of the recently proposed Neural Ordinary Differential Equation (ODE) framework is that it seems to provide a continuous-time extension of discrete residual neural networks. As we show herein, though, trained Neural ODE models actually depend on the specific numerical method used during training. If the trained model is supposed to be a flow generated from an ODE, it should be possible to choose another numerical solver with equal or smaller numerical error without loss of performance. We observe that if training relies on...	Katharina Ott, Michael Tiemann, Philipp Hennig, Prateek Katiyar
500	Semantic Re-tuning with Contrastive Tension	Extracting semantically useful natural language sentence representations from pre-trained deep neural networks such as Transformers remains a challenge. We first demonstrate that pre-training objectives impose a significant task bias onto the final layers of models with a layer-wise survey of the Semantic Textual Similarity (STS) correlations for multiple common Transformer language models. We then propose a new self-supervised method called Contrastive Tension (CT) to counter such biases. CT frames the training objective as a...	Amaru Cuba Gyllensten, Erik Ylipää Hellqvist, Evangelia Gogoulou, Fredrik Carlsson, Magnus Sahlgren
501	Property Controllable Variational Autoencoder via Invertible Mutual Dependence	Deep generative models have made important progress towards modeling complex, high dimensional data via learning latent representations. Their usefulness is nevertheless often limited by a lack of control over the generative process or a poor understanding of the latent representation. To overcome these issues, attention is now focused on discovering latent variables correlated to the data properties and ways to manipulate these properties. This paper presents the new Property controllable VAE (PCVAE), where a new Bayesian model is...	Liang Zhao, Xiaojie Guo, Yuanqi Du
502	Latent Convergent Cross Mapping	Discovering causal structures of temporal processes is a major tool of scientific inquiry because it helps us better understand and explain the mechanisms driving a phenomenon of interest, thereby facilitating analysis, reasoning, and synthesis for such systems. However, accurately inferring causal structures within a phenomenon based on observational data only is still an open problem. Indeed, this type of data usually consists in short time series with missing or noisy values for which causal inference is increasingly difficult. In...	Adam Arany, Edward De Brouwer, Jaak Simm, Yves Moreau
503	Adaptive Universal Generalized PageRank Graph Neural Network	In many important graph data processing applications the acquired information includes both node features and observations of the graph topology. Graph neural networks (GNNs) are designed to exploit both sources of evidence but they do not optimally trade-off their utility and integrate them in a manner that is also universal. Here, universality refers to independence on homophily or heterophily graph assumptions. We address these issues by introducing a new Generalized PageRank (GPR) GNN architecture that adaptively learns the GPR...	Eli Chien, Jianhao Peng, Olgica Milenkovic, Pan Li
504	Neural Learning of One-of-Many Solutions for Combinatorial Problems in Structured Output Spaces	Recent research has proposed neural architectures for solving combinatorial problems in structured output spaces. In many such problems, there may exist multiple solutions for a given input, e.g. a partially filled Sudoku puzzle may have many completions satisfying all constraints. Further, we are often interested in finding any "one" of the possible solutions, without any preference between them. Existing approaches completely ignore this solution multiplicity. In this paper, we argue that being oblivious to the presence of multiple...	Deepanshu Jindal, Mausam, Parag Singla, Yatin Nandwani
505	My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control	Multitask Reinforcement Learning is a promising way to obtain models with better performance, generalisation, data efficiency, and robustness. Most existing work is limited to compatible settings, where the state and action space dimensions are the same across tasks. Graph Neural Networks (GNN) are one way to address incompatible environments, because they can process graphs of arbitrary size. They also allow practitioners to inject biases encoded in the structure of the input graph. Existing work in graph-based continuous control uses...	Maximilian Igl, Shimon Whiteson, Tim Rocktäschel, Vitaly Kurin, Wendelin Boehmer
506	FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning	Federated learning aims to collaboratively train a strong global model by accessing users' locally trained models but not their own data. A crucial step is therefore to aggregate local models into a global model, which has been shown challenging when users have non-i.i.d. data. In this paper, we propose a novel aggregation algorithm named FedBE, which takes a Bayesian inference perspective by sampling higher-quality global models and combining them via Bayesian model Ensemble, leading to much robust aggregation. We show that an...	HongYou Chen, WeiLun Chao
507	MALI: A memory efficient and reverse accurate integrator for Neural ODEs	Neural ordinary differential equations (Neural ODEs) are a new family of deep-learning models with continuous depth. However, the numerical estimation of the gradient in the continuous case is not well solved: existing implementations of the adjoint method suffer from inaccuracy in reverse-time trajectory, while the naive method and the adaptive checkpoint adjoint method (ACA) have a memory cost that grows with integration time. In this project, based on the asynchronous leapfrog (ALF) solver, we propose the Memory-efficient ALF...	James S. Duncan, Juntang Zhuang, Nicha C. Dvornek, Sekhar Tatikonda
508	Reducing the Computational Cost of Deep Generative Models with Binary Neural Networks	Deep generative models provide a powerful set of tools to understand real-world data. But as these models improve, they increase in size and complexity, so their computational cost in memory and execution time grows. Using binary weights in neural networks is one method which has shown promise in reducing this cost. However, whether binary neural networks can be used in generative models is an open problem. In this work we show, for the first time, that we can successfully train generative models which utilize binary neural networks....	David Barber, Friso H. Kingma, Thomas Bird
509	In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness	Consider a prediction setting with few in-distribution labeled examples and many unlabeled examples both in- and out-of-distribution (OOD). The goal is to learn a model which performs well both in-distribution and OOD. In these settings, auxiliary information is often cheaply available for every input. How should we best leverage this auxiliary information for the prediction task? Empirically across three image and time-series datasets, and theoretically in a multi-task linear regression setting, we show that (i) using auxiliary...	Ananya Kumar, Fereshte Khani, Percy Liang, Robbie Jones, Sang Michael Xie, Tengyu Ma
510	Incremental few-shot learning via vector quantization in deep embedded space	The capability of incrementally learning new tasks without forgetting old ones is a challenging problem due to catastrophic forgetting. This challenge becomes greater when novel tasks contain very few labelled training samples. Currently, most methods are dedicated to class-incremental learning and rely on sufficient training data to learn additional weights for newly added classes. Those methods cannot be easily extended to incremental regression tasks and could suffer from severe overfitting when learning few-shot novel tasks. In...	ChiGuhn Lee, Kuilin Chen
511	Contrastive Syn-to-Real Generalization	Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance. To this end, we propose contrastive synthetic-to-real generalization (CSG), a novel framework that leverage the pre-trained ImageNet knowledge to prevent overfitting to the synthetic domain, while...	Anima Anandkumar, José M. Álvarez, Shalini De Mello, Sifei Liu, Wuyang Chen, Zhangyang Wang, Zhiding Yu
512	Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting	The goal of continual learning (CL) is to learn a sequence of tasks without suffering from the phenomenon of catastrophic forgetting. Previous work has shown that leveraging memory in the form of a replay buffer can reduce performance degradation on prior tasks. We hypothesize that forgetting can be further reduced when the model is encouraged to remember the \textit{evidence} for previously made decisions. As a first step towards exploring this hypothesis, we propose a simple novel training paradigm, called Remembering for the Right...	Akash Gokul, Joseph E. Gonzalez, Marcus Rohrbach, Sayna Ebrahimi, Suzanne Petryk, Trevor Darrell, William Gan
513	Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online	Recent work has shown that sparse representations---where only a small percentage of units are active---can significantly reduce interference. Those works, however, relied on relatively complex regularization or meta-learning approaches, that have only been used offline in a pre-training phase. In this work, we pursue a direction that achieves sparsity by design, rather than by learning. Specifically, we design an activation function that produces sparse representations deterministically by construction, and so is more amenable to...	Kirby Banman, Martha White, Yangchen Pan
514	High-Capacity Expert Binary Networks	Network binarization is a promising hardware-aware direction for creating efficient deep models. Despite its memory and computational advantages, reducing the accuracy gap between binary models and their real-valued counterparts remains an unsolved challenging research problem. To this end, we make the following 3 contributions: (a) To increase model capacity, we propose Expert Binary Convolution, which, for the first time, tailors conditional computing to binary networks by learning to select one data-specific expert binary filter at...	Adrian Bulat, Brais Martínez, Georgios Tzimiropoulos
515	Learning What To Do by Simulating the Past	Since reward functions are hard to specify, recent work has focused on learning policies from human feedback. However, such approaches are impeded by the expense of acquiring such feedback. Recent work proposed that agents have access to a source of information that is effectively free: in any environment that humans have acted in, the state will already be optimized for human preferences, and thus an agent can extract information about what humans want from the state. Such learning is possible in principle, but requires simulating all...	Anca D. Dragan, David Lindner, Pieter Abbeel, Rohin Shah
516	Progressive Skeletonization: Trimming more fat from a network at initialization	Recent studies have shown that skeletonization (pruning parameters) of networks at initialization provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx 95%), these approaches fail to preserve the network performance, and to our surprise, in many cases perform even worse than trivial random pruning. To this end, we propose an objective to find a skeletonized network with maximum foresight...	Amartya Sanyal, Grégory Rogez, Harkirat S. Behl, Pau de Jorge, Philip H. S. Torr, Puneet K. Dokania
517	Filtered Inner Product Projection for Crosslingual Embedding Alignment	Due to widespread interest in machine translation and transfer learning, there are numerous algorithms for mapping multiple embeddings to a shared representation space. Recently, these algorithms have been studied in the setting of bilingual lexicon induction where one seeks to align the embeddings of a source and a target language such that translated word pairs lie close to one another in a common representation space. In this paper, we propose a method, Filtered Inner Product Projection (FIPP), for mapping embeddings to a common...	Chenguang Zhu, Vin Sachidananda, Ziyi Yang
518	Learning Manifold Patch-Based Representations of Man-Made Shapes	Choosing the right representation for geometry is crucial for making 3D models compatible with existing applications. Focusing on piecewise-smooth man-made shapes, we propose a new representation that is usable in conventional CAD modeling pipelines and can also be learned by deep neural networks. We demonstrate its benefits by applying it to the task of sketch-based modeling. Given a raster image, our system infers a set of parametric surfaces that realize the input in 3D. To capture piecewise smooth geometry, we learn a special shape...	Dmitriy Smirnov, Justin Solomon, Mikhail Bessmeltsev
519	Aligning AI With Shared Human Values	We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current...	Andrew Critch, Collin Burns, Dan Hendrycks, Dawn Song, Jacob Steinhardt, Jerry Li, Steven Basart
520	Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory	Episodic and semantic memory are critical components of the human memory model. The theory of complementary learning systems (McClelland et al., 1995) suggests that the compressed representation produced by a serial event (episodic memory) is later restructured to build a more generalized form of reusable knowledge (semantic memory). In this work, we develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory via a hierarchical latent variable model. We take inspiration from...	Alexandros Kalousis, Jason Ramapuram, Yan Wu
521	Measuring Massive Multitask Language Understanding	We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial...	Andy Zou, Collin Burns, Dan Hendrycks, Dawn Song, Jacob Steinhardt, Mantas Mazeika, Steven Basart
522	Towards Robust Neural Networks via Close-loop Control	Despite their success in massive engineering applications, deep neural networks are vulnerable to various perturbations due to their black-box nature. Recent study has shown that a deep neural network can misclassify the data even if the input data is perturbed by an imperceptible amount. In this paper, we address the robustness issue of neural networks by a novel close-loop control method from the perspective of dynamic systems. Instead of modifying the parameters in a fixed neural network architecture, a close-loop control process is...	Qianxiao Li, Zheng Zhang, Zhuotong Chen
523	Statistical inference for individual fairness	As we rely on machine learning (ML) models to make more consequential decisions, the issue of ML models perpetuating unwanted social biases has come to the fore of the public's and the research community's attention. In this paper, we focus on the problem of detecting violations of individual fairness in ML models. We formalize the problem as measuring the susceptibility of ML models against a form of adversarial attack and develop a suite of inference tools for the adversarial loss. The tools allow practitioners to assess the...	Mikhail Yurochkin, Songkai Xue, Subha Maity, Yuekai Sun
524	HyperGrid Transformers: Towards A Single Model for Multiple Tasks	Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose HyperGrid Transformers, a new Transformer architecture that leverages task-conditioned hyper networks...	DaCheng Juan, Dara Bahri, Donald Metzler, Yi Tay, Zhe Zhao
525	Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity	Greedy-GQ is a value-based reinforcement learning (RL) algorithm for optimal control. Recently, the finite-time analysis of Greedy-GQ has been developed under linear function approximation and Markovian sampling, and the algorithm is shown to achieve an $\epsilon$ -stationary point with a sample complexity in the order of $\mathcal{O}(\epsilon^{-3})$ . Such a high sample complexity is due to the large variance induced by the Markovian samples. In this paper, we propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for off-policy...	Shaocong Ma, Shaofeng Zou, Yi Zhou, Ziyi Chen
526	On InstaHide, Phase Retrieval, and Sparse Matrix Factorization	In this work, we examine the security of InstaHide, a scheme recently proposed by \cite{hsla20} for preserving the security of private datasets in the context of distributed learning. To generate a synthetic training example to be shared among the distributed learners, InstaHide takes a convex combination of private feature vectors and randomly flips the sign of each entry of the resulting vector with probability 1/2. A salient question is whether this scheme is secure in any provable sense, perhaps under a plausible...	Danyang Zhuo, Sitan Chen, Xiaoxiao Li, Zhao Song
527	VA-RED2: Video Adaptive Redundancy Reduction	Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depends on the dynamics and type of events in the video: static videos have more temporal redundancy while videos focusing on objects tend to...	Alex J. Andonian, Aude Oliva, Bowen Pan, Camilo Luciano Fosco, ChungChing Lin, Kate Saenko, Rameswar Panda, Rogério Feris, Yue Meng
528	SEDONA: Search for Decoupled Neural Networks toward Greedy Block-wise Learning	Backward locking and update locking are well-known sources of inefficiency in backpropagation that prevent from concurrently updating layers. Several works have recently suggested using local error signals to train network blocks asynchronously to overcome these limitations. However, they often require numerous iterations of trial-and-error to find the best configuration for local training, including how to decouple network blocks and which auxiliary networks to use for each block. In this work, we propose a differentiable search...	Gunhee Kim, Jihwan Moon, Myeongjang Pyeon, Taeyoung Hahn
529	ALFWorld: Aligning Text and Embodied Environments for Interactive Learning	Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by...	Adam Trischler, MarcAlexandre Côté, Matthew J. Hausknecht, Mohit Shridhar, Xingdi Yuan, Yonatan Bisk
530	Learning Task Decomposition with Ordered Memory Policy Network	Many complex real-world tasks are composed of several levels of subtasks. Humans leverage these hierarchical structures to accelerate the learning process and achieve better generalization. In this work, we study the inductive bias and propose Ordered Memory Policy Network (OMPN) to discover subtask hierarchy by learning from demonstration. The discovered subtask hierarchy could be used to perform task decomposition, recovering the subtask boundaries in an unstructured demonstration. Experiments on Craft and Dial demonstrate that our...	Aaron C. Courville, Chuang Gan, Joshua B. Tenenbaum, Siyuan Zhou, Yikang Shen, Yuchen Lu
531	Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification	Transfer learning has emerged as a powerful methodology for adapting pre-trained deep neural networks on image recognition tasks to new domains. This process consists of taking a neural network pre-trained on a large feature-rich source dataset, freezing the early layers that encode essential generic image properties, and then fine-tuning the last few layers in order to capture specific information related to the target situation. This approach is particularly useful when only limited or weakly labeled data are available for the new...	Evan Kravitz, Francisco Utrera, Michael W. Mahoney, N. Benjamin Erichson, Rajiv Khanna
532	UMEC: Unified model and embedding compression for efficient recommendation systems	The recommendation system (RS) plays an important role in the content recommendation and retrieval scenarios. The core part of the system is the Ranking neural network, which is usually a bottleneck of whole system performance during online inference. In this work, we propose a unified model and embedding compression (UMEC) framework to hammer an efficient neural network-based recommendation system. Our framework jointly learns input feature selection and neural network compression together, and solve them as an end-to-end...	Haotao Wang, Ji Liu, Jianchao Tan, Jiayi Shen, Shupeng Gui, Zhangyang Wang
533	Exploring Balanced Feature Spaces for Representation Learning	Existing self-supervised learning (SSL) methods are mostly applied for training representation models from artificially balanced datasets (e.g., ImageNet). It is unclear how well they will perform in the practical scenarios where datasets are often imbalanced w.r.t. the classes. Motivated by this question, we conduct a series of studies on the performance of self-supervised contrastive learning and supervised learning methods over multiple datasets where training instance distributions vary from a balanced one to a long-tailed one. Our...	Bingyi Kang, Jiashi Feng, Sa Xie, Yu Li, Zehuan Yuan
534	Calibration of Neural Networks using Splines	Calibrating neural networks is of utmost importance when employing them in safety-critical applications where the downstream decision making depends on the predicted probabilities. Measuring calibration error amounts to comparing two empirical distributions. In this work, we introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions. From this, by approximating the empirical cumulative...	Amir Rahimi, Cristian Sminchisescu, Kartik Gupta, Richard Hartley, Thalaiyasingam Ajanthan, Thomas Mensink
535	Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein	Relational regularized autoencoder (RAE) is a framework to learn the distribution of data by minimizing a reconstruction loss together with a relational regularization on the prior of latent space. A recent attempt to reduce the inner discrepancy between the prior and aggregated posterior distributions is to incorporate sliced fused Gromov-Wasserstein (SFG) between these distributions. That approach has a weakness since it treats every slicing direction similarly, meanwhile several directions are not useful for the discriminative task....	Hung Bui, Khai Nguyen, Nhat Ho, Son Nguyen, Tung Pham
536	Rethinking Positional Encoding in Language Pre-training	In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second, we question whether treating the position of the...	Di He, Guolin Ke, TieYan Liu
537	Discovering Non-monotonic Autoregressive Orderings with Variational Inference	The predominant approach for language modeling is to encode a sequence of tokens from left to right, but this eliminates a source of information: the order by which the sequence was naturally generated. One strategy to recover this information is to decode both the content and ordering of tokens. Some prior work supervises content and ordering with hand-designed loss functions to encourage specific orders or bootstraps from a predefined ordering. These approaches require domain-specific insight. Other prior work searches over valid...	Brandon Trabucco, Dong Huk Park, Michael Luo, Sheng Shen, Trevor Darrell, Xuanlin Li, Yang Gao
538	Differentiable Trust Region Layers for Deep Reinforcement Learning	Trust region methods are a popular tool in reinforcement learning as they yield robust policy updates in continuous and discrete action spaces. However, enforcing such trust regions in deep reinforcement learning is difficult. Hence, many approaches, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), are based on approximations. Due to those approximations, they violate the constraints or fail to find the optimal solution within the trust region. Moreover, they are difficult to implement, often lack...	Fabian Otto, Gerhard Neumann, Hanna Carolin Maria Ziesche, Ngo Anh Vien, Philipp Becker
539	SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization	Advanced data augmentation strategies have widely been studied to improve the generalization ability of deep learning models. Regional dropout is one of the popular solutions that guides the model to focus on less discriminative parts by randomly removing image regions, resulting in improved regularization. However, such information removal is undesirable. On the other hand, recent strategies suggest to randomly cut and mix patches and their labels among training images, to enjoy the advantages of regional dropout without having any...	A. F. M. Shahab Uddin, Mst. Sirazam Monira, SungHo Bae, TaeChoong Chung, Wheemyung Shin
540	Task-Agnostic Morphology Evolution	Deep reinforcement learning primarily focuses on learning behavior, usually overlooking the fact that an agent's function is largely determined by form. So, how should one go about finding a morphology fit for solving tasks in a given environment? Current approaches that co-adapt morphology and behavior use a specific task's reward as a signal for morphology optimization. However, this often requires expensive policy optimization and results in task-dependent morphologies that are not built to generalize. In this work, we propose a new...	Donald Joseph Hejna III, Lerrel Pinto, Pieter Abbeel
541	Learning Associative Inference Using Fast Weight Memory	Humans can quickly associate stimuli to solve problems in novel contexts. Our novel neural network model learns state representations of facts that can be composed to perform such associative inference. To this end, we augment the LSTM model with an associative memory, dubbed \textit{Fast Weight Memory} (FWM). Through differentiable operations at every step of a given input sequence, the LSTM \textit{updates and maintains} compositional associations stored in the rapidly changing FWM weights. Our model is trained end-to-end by gradient...	Imanol Schlag, Jürgen Schmidhuber, Tsendsuren Munkhdalai
542	Boost then Convolve: Gradient Boosting Meets Graph Neural Networks	Graph neural networks (GNNs) are powerful models that have been successful in various graph representation learning tasks. Whereas gradient boosted decision trees (GBDT) often outperform other machine learning methods when faced with heterogeneous tabular data. But what approach should be used for graphs with tabular node features? Previous GNN models have mostly focused on networks with homogeneous sparse features and, as we show, are suboptimal in the heterogeneous setting. In this work, we propose a novel architecture that trains...	Liudmila Prokhorenkova, Sergei Ivanov
543	Degree-Quant: Quantization-Aware Training for Graph Neural Networks	Graph neural networks (GNNs) have demonstrated strong performance on a wide variety of tasks due to their ability to model non-uniform structured data. Despite their promise, there exists little research exploring methods to make them more efficient at inference time. In this work, we explore the viability of training quantized GNNs, enabling the usage of low precision integer arithmetic during inference. For GNNs seemingly unimportant choices in quantization implementation cause dramatic changes in performance. We identify the sources...	Javier FernándezMarqués, Nicholas Donald Lane, Shyam Anil Tailor
544	Network Pruning That Matters: A Case Study on Retraining Variants	Network pruning is an effective method to reduce the computational expense of over-parameterized neural networks for deployment on low-resource systems. Recent state-of-the-art techniques for retraining pruned networks such as weight rewinding and learning rate rewinding have been shown to outperform the traditional fine-tuning technique in recovering the lost accuracy (Renda et al., 2020), but so far it is unclear what accounts for such performance. In this work, we conduct extensive experiments to verify and analyze the uncanny...	BinhSon Hua, Duong H. Le
545	Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation	Designing proper loss functions is essential in training deep networks. Especially in the field of semantic segmentation, various evaluation metrics have been proposed for diverse scenarios. Despite the success of the widely adopted cross-entropy loss and its variants, the mis-alignment between the loss functions and evaluation metrics degrades the network performance. Meanwhile, manually designing loss functions for each specific metric requires expertise and significant manpower. In this paper, we propose to automate the design of...	Chenxin Tao, Gao Huang, Hao Li, Jifeng Dai, Xiaogang Wang, Xizhou Zhu
546	Differentiable Segmentation of Sequences	Segmented models are widely used to describe non-stationary sequential data with discrete change points. Their estimation usually requires solving a mixed discrete-continuous optimization problem, where the segmentation is the discrete part and all other model parameters are continuous. A number of estimation algorithms have been developed that are highly specialized for their specific model assumptions. The dependence on non-standard algorithms makes it hard to integrate segmented models in state-of-the-art deep learning architectures...	Emmanuel Müller, Erik Scharwächter, Jonathan Lennartz
547	Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning	We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature...	Chuang Gan, Jiajun Wu, Jiayuan Mao, Joshua B. Tenenbaum, KwanYee Kenneth Wong, Zhenfang Chen
548	Learning Deep Features in Instrumental Variable Regression	Instrumental variable (IV) regression is a standard strategy for learning causal relationships between confounded treatment and outcome variables from observational data by using an instrumental variable, which affects the outcome only through the treatment. In classical IV regression, learning proceeds in two stages: stage 1 performs linear regression from the instrument to the treatment; and stage 2 performs linear regression from the treatment to the outcome, conditioned on the instrument. We propose a novel method, deep feature...	Arnaud Doucet, Arthur Gretton, Liyuan Xu, Nando de Freitas, Siddarth Srinivasan, Yutian Chen
549	Online Adversarial Purification based on Self-supervised Learning	Deep neural networks are known to be vulnerable to adversarial examples, where a perturbation in the input space leads to an amplified shift in the latent network representation. In this paper, we combine canonical supervised learning with self-supervised representation learning, and present Self-supervised Online Adversarial Purification (SOAP), a novel defense strategy that uses a self-supervised loss to purify adversarial examples at test-time. Our approach leverages the label-independent nature of self-supervised signals and...	Changhao Shi, Chester Holtz, Gal Mishne
550	Graph Information Bottleneck for Subgraph Recognition	Given the input graph and its label/property, several key problems of graph learning, such as finding interpretable subgraphs, graph denoising and graph compression, can be attributed to the fundamental problem of recognizing a subgraph of the original one. This subgraph shall be as informative as possible, yet contains less redundant and noisy structure. This problem setting is closely related to the well-known information bottleneck (IB) principle, which, however, has less been studied for the irregular graph data and graph neural...	Junchi Yu, Junzhou Huang, Ran He, Tingyang Xu, Yatao Bian, Yu Rong
551	In Search of Lost Domain Generalization	The goal of domain generalization algorithms is to predict well on distributions different from those seen during training. While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions---datasets, network architectures, and model selection criteria---render fair comparisons difficult. The goal of this paper is to understand how useful domain generalization algorithms are in realistic settings. As a first step, we realize that model selection is non-trivial for domain generalization tasks, and we...	David LopezPaz, Ishaan Gulrajani
552	Robust Curriculum Learning: from clean label detection to noisy label self-correction	Neural network training can easily overfit noisy labels resulting in poor generalization performance. Existing methods address this problem by (1) filtering out the noisy data and only using the clean data for training or (2) relabeling the noisy data by the model during training or by another model trained only on a clean dataset. However, the former does not leverage the features' information of wrongly-labeled data, while the latter may produce wrong pseudo-labels for some data and introduce extra noises. In this paper, we propose a...	Jeff A. Bilmes, Shengjie Wang, Tianyi Zhou
553	Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization	Temporally localizing actions in videos is one of the key components for video understanding. Learning from weakly-labeled data is seen as a potential solution towards avoiding expensive frame-level annotations. Different from other works which only depend on visual-modality, we propose to learn richer audiovisual representation for weakly-supervised action localization. First, we propose a multi-stage cross-attention mechanism to collaboratively fuse audio and visual features, which preserves the intra-modal characteristics. Second,...	Hyoungwoo Park, JunTae Lee, Mihir Jain, Sungrack Yun
554	CoCon: A Self-Supervised Approach for Controlled Text Generation	Pretrained Transformer-based language models (LMs) display remarkable natural language generation capabilities. With their immense potential, controlling text generation of such LMs is getting attention. While there are studies that seek to control high-level attributes (such as sentiment and topic) of generated text, there is still a lack of more precise control over its content at the word- and phrase-level. Here, we propose Content-Conditioner (CoCon) to control an LM's output text with a content input, at a fine-grained level. In...	Alvin Chan, Aston Zhang, Bill Pung, Jie Fu, YewSoon Ong
555	Group Equivariant Generative Adversarial Networks	Recent improvements in generative adversarial visual synthesis incorporate real and fake image transformation in a self-supervised setting, leading to increased stability and perceptual fidelity. However, these approaches typically involve image augmentations via additional regularizers in the GAN objective and thus spend valuable network capacity towards approximating transformation equivariance instead of their desired task. In this work, we explicitly incorporate inductive symmetry priors into the network architectures via...	Antong Chen, Neel Dey, Soheil Ghafurian
556	What they do when in doubt: a study of inductive biases in seq2seq learners	Sequence-to-sequence (seq2seq) learners are widely used, but we still have only limited knowledge about what inductive biases shape the way they generalize. We address that by investigating how popular seq2seq learners generalize in tasks that have high ambiguity in the training data. We use four new tasks to study learners' preferences for memorization, arithmetic, hierarchical, and compositional reasoning. Further, we connect to Solomonoff's theory of induction and propose to use description length as a principled and sensitive...	Eugene Kharitonov, Rahma Chaabouni
557	A teacher-student framework to distill future trajectories	By learning to predict trajectories of dynamical systems, model-based methods can make extensive use of all observations from past experience. However, due to partial observability, stochasticity, compounding errors, and irrelevant dynamics, training to predict observations explicitly often results in poor models. Model-free techniques try to side-step the problem by learning to predict values directly. While breaking the explicit dependency on future observations can result in strong performance, this usually comes at the cost of low...	Alexander Neitz, Bernhard Schölkopf, Giambattista Parascandolo
558	Learning a Latent Search Space for Routing Problems using Variational Autoencoders	Methods for automatically learning to solve routing problems are rapidly improving in performance. While most of these methods excel at generating solutions quickly, they are unable to effectively utilize longer run times because they lack a sophisticated search component. We present a learning-based optimization approach that allows a guided search in the distribution of high-quality solutions for a problem instance. More precisely, our method uses a conditional variational autoencoder that learns to map points in a continuous...	André Hottung, Bhanu Bhandari, Kevin Tierney
559	Universal approximation power of deep residual neural networks via nonlinear control theory	In this paper, we explain the universal approximation capabilities of deep residual neural networks through geometric nonlinear control. Inspired by recent work establishing links between residual networks and control systems, we provide a general sufficient condition for a residual network to have the power of universal approximation by asking the activation function, or one of its derivatives, to satisfy a quadratic differential equation. Many activation functions used in practice satisfy this assumption, exactly or approximately,...	Bahman Gharesifard, Paulo Tabuada
560	On the Universality of Rotation Equivariant Point Cloud Networks	Learning functions on point clouds has applications in many fields, including computer vision, computer graphics, physics, and chemistry. Recently, there has been a growing interest in neural architectures that are invariant or equivariant to all three shape-preserving transformations of point clouds: translation, rotation, and permutation. In this paper, we present a first study of the approximation power of these architectures. We first derive two sufficient conditions for an equivariant architecture to have the universal...	Haggai Maron, Nadav Dym
561	CT-Net: Channel Tensorization Network for Video Classification	3D convolution is powerful for video classification but often computationally expensive, recent studies mainly focus on decomposing it on spatial-temporal and/or channel dimensions. Unfortunately, most approaches fail to achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency. For this reason, we propose a concise and novel Channel Tensorization Network (CT-Net), by treating the channel dimension of input feature as a multiplication of K sub-dimensions. On one hand, it naturally factorizes...	Jun Wang, Kunchang Li, Xianhang Li, Yali Wang, Yu Qiao
562	Learning to live with Dale's principle: ANNs with separate excitatory and inhibitory units	The units in artificial neural networks (ANNs) can be thought of as abstractions of biological neurons, and ANNs are increasingly used in neuroscience research. However, there are many important differences between ANN units and real neurons. One of the most notable is the absence of Dale's principle, which ensures that biological neurons are either exclusively excitatory or inhibitory. Dale's principle is typically left out of ANNs because its inclusion impairs learning. This is problematic, because one of the great advantages of ANNs...	Amélie Lamarquette, Blake Aaron Richards, Damjan Kalajdzievski, Dimitri Michael Kullmann, Jonathan Cornford, Marco Leite
563	Uncertainty Estimation in Autoregressive Structured Prediction	Uncertainty estimation is important for ensuring safety and robustness of AI systems. While most research in the area has focused on un-structured prediction tasks, limited work has investigated general uncertainty estimation approaches for structured prediction. Thus, this work aims to investigate uncertainty estimation for structured prediction tasks within a single unified and interpretable probabilistic ensemble-based framework. We consider: uncertainty estimation for sequence data at the token-level and complete sequence-level;...	Andrey Malinin, Mark J. F. Gales
564	Transformer protein language models are unsupervised structure learners	Unsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language...	Alexander Rives, Joshua Meier, Roshan Rao, Sergey Ovchinnikov, Tom Sercu
565	ANOCE: Analysis of Causal Effects with Multiple Mediators via Constrained Structural Learning	In the era of causal revolution, identifying the causal effect of an exposure on the outcome of interest is an important problem in many areas, such as epidemics, medicine, genetics, and economics. Under a general causal graph, the exposure may have a direct effect on the outcome and also an indirect effect regulated by a set of mediators. An analysis of causal effects that interprets the causal mechanism contributed through mediators is hence challenging but on demand. To the best of our knowledge, there are no feasible algorithms...	Hengrui Cai, Rui Song, Wenbin Lu
566	Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks	In high-dimensional state spaces, the usefulness of Reinforcement Learning (RL) is limited by the problem of exploration. This issue has been addressed using potential-based reward shaping (PB-RS) previously. In the present work, we introduce Final-Volume-Preserving Reward Shaping (FV-RS). FV-RS relaxes the strict optimality guarantees of PB-RS to a guarantee of preserved long-term behavior. Being less restrictive, FV-RS allows for reward shaping functions that are even better suited for improving the sample efficiency of RL...	Ingmar Schubert, Marc Toussaint, Ozgur S. Oguz
567	CcGAN: Continuous Conditional Generative Adversarial Networks for Image Generation	This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (e.g., class labels); conditioning on a continuous label is mathematically distinct and raises two fundamental problems: (P1) Since there may be very few (even zero) real images for some regression labels, minimizing existing empirical versions of cGAN...	William J. Welch, Xin Ding, Yongwei Wang, Z. Jane Wang, Zuheng Xu
568	Single-Photon Image Classification	Quantum Computing based Machine Learning mainly focuses on quantum computing hardware that is experimentally challenging to realize due to requiring quantum gates that operate at very low temperature. We demonstrate the existence of a "quantum computing toy model" that illustrates key aspects of quantum information processing while being experimentally accessible with room temperature optics. Pondering the question of the theoretical classification accuracy performance limit for MNIST (respectively "Fashion-MNIST") classifiers, subject...	Luciano Sbaiz, Thomas Fischbacher
569	Self-supervised Adversarial Robustness for the Low-label, High-data Regime	Recent work discovered that training models to be invariant to adversarial perturbations requires substantially larger datasets than those required for standard classification. Perhaps more surprisingly, these larger datasets can be "mostly" unlabeled. Pseudo-labeling, a technique simultaneously pioneered by four separate and simultaneous works in 2019, has been proposed as a competitive alternative to labeled data for training adversarially robust models. However, when the amount of labeled data decreases, the performance of...	Aäron van den Oord, PoSen Huang, Pushmeet Kohli, Sven Gowal, Timothy A. Mann
570	Uncertainty-aware Active Learning for Optimal Bayesian Classifier	For pool-based active learning, in each iteration a candidate training sample is chosen for labeling by optimizing an acquisition function. In Bayesian classification, expected Loss Reduction~(ELR) methods maximize the expected reduction in the classification error given a new labeled candidate based on a one-step-look-ahead strategy. ELR is the optimal strategy with a single query; however, since such myopic strategies cannot identify the long-term effect of a query on the classification error, ELR may get stuck before reaching the...	ByungJun Yoon, Edward R. Dougherty, Francis J. Alexander, Guang Zhao, Xiaoning Qian
571	Latent Skill Planning for Exploration and Transfer	To quickly solve new tasks in complex environments, intelligent agents need to build up reusable knowledge. For example, a learned world model captures knowledge about the environment that applies to new tasks. Similarly, skills capture general behaviors that can apply to new tasks. In this paper, we investigate how these two approaches can be integrated into a single reinforcement learning agent. Specifically, we leverage the idea of partial amortization for fast adaptation at test time. For this, actions are produced by a policy that...	Animesh Garg, Danijar Hafner, Florian Shkurti, Homanga Bharadhwaj, Kevin Xie
572	Learning continuous-time PDEs from sparse data with graph neural networks	The behavior of many dynamical systems follow complex, yet still unknown partial differential equations (PDEs). While several machine learning methods have been proposed to learn PDEs directly from data, previous methods are limited to discrete-time approximations or make the limiting assumption of the observations arriving at regular grids. We propose a general continuous-time differential model for dynamical systems whose governing equations are parameterized by message passing graph neural networks. The model admits arbitrary space...	Harri Lähdesmäki, Markus Heinonen, Valerii Iakovlev
573	Characterizing signal propagation to close the performance gap in unnormalized ResNets	Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation...	Andrew Brock, Samuel L. Smith, Soham De
574	Robust Overfitting may be mitigated by properly learned smoothening	A recent study (Rice et al., 2020) revealed overfitting to be a dominant phenomenon in adversarially robust training of deep networks, and that appropriate early-stopping of adversarial training (AT) could match the performance gains of most recent algorithmic improvements. This intriguing problem of robust overfitting motivates us to seek more remedies. As a pilot study, this paper investigates two empirical means to inject more learned smoothening during AT: one leveraging knowledge distillation and self-training to smooth the...	Shiyu Chang, Sijia Liu, Tianlong Chen, Zhangyang Wang, Zhenyu Zhang
575	Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning	The lottery ticket hypothesis states that a highly sparsified sub-network can be trained in isolation, given the appropriate weight initialization. This paper extends that hypothesis from one-shot task learning, and demonstrates for the first time that such extremely compact and independently trainable sub-networks can be also identified in the lifelong learning scenario, which we call lifelong tickets. We show that the resulting lifelong ticket can further be leveraged to improve the performance of learning over continual tasks....	Shiyu Chang, Sijia Liu, Tianlong Chen, Zhangyang Wang, Zhenyu Zhang
576	Symmetry-Aware Actor-Critic for 3D Molecular Design	Automating molecular design using deep reinforcement learning (RL) has the potential to greatly accelerate the search for novel materials. Despite recent progress on leveraging graph representations to design molecules, such methods are fundamentally limited by the lack of three-dimensional (3D) information. In light of this, we propose a novel actor-critic architecture for 3D molecular design that can generate molecular structures unattainable with previous approaches. This is achieved by exploiting the symmetries of the design...	Gregor N. C. Simm, Gábor Csányi, José Miguel HernándezLobato, Robert Pinsler
577	PseudoSeg: Designing Pseudo Labels for Semantic Segmentation	Recent advances in semi-supervised learning (SSL) demonstrate that a combination of consistency regularization and pseudo-labeling can effectively improve image classification accuracy in the low-data regime. Compared to classification, semantic segmentation tasks require much more intensive labeling costs. Thus, these tasks greatly benefit from data-efficient training methods. However, structured outputs in segmentation render particular difficulties (e.g., designing pseudo-labeling and augmentation) to apply existing SSL strategies....	ChunLiang Li, Han Zhang, JiaBin Huang, Tomas Pfister, Xiao Bian, Yuliang Zou, Zizhao Zhang
578	NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition	Powered by innovations in novel architecture design, noise tolerance techniques and increasing model capacity, Automatic Speech Recognition (ASR) has made giant strides in reducing word-error-rate over the past decade. ASR models are often trained with tens of thousand hours of high quality speech data to produce state-of-the-art (SOTA) results. Industry-scale ASR model training thus remains computationally heavy and time-consuming, and consequently has attracted little attention in adopting automatic techniques. On the other hand,...	Abhinav Mehrotra, Alberto Gil C. P. Ramos, Lukasz Dudziak, Mohamed S. Abdelfattah, Nicholas Donald Lane, Ravichander Vipperla, Samin Ishtiaq, Sourav Bhattacharya, Thomas Chau
579	Scaling the Convex Barrier with Active Sets	Tight and efficient neural network bounding is of critical importance for the scaling of neural network verification systems. A number of efficient specialised dual solvers for neural network bounds have been presented recently, but they are often too loose to verify more challenging properties. This lack of tightness is linked to the weakness of the employed relaxation, which is usually a linear program of size linear in the number of neurons. While a tighter linear relaxation for piecewise linear activations exists, it comes at the...	Alessandro De Palma, Harkirat S. Behl, M. Pawan Kumar, Philip H. S. Torr, Rudy Bunel
580	Local Convergence Analysis of Gradient Descent Ascent with Finite Timescale Separation	We study the role that a finite timescale separation parameter $\tau$ has on gradient descent-ascent in non-convex, non-concave zero-sum games where the learning rate of player 1 is denoted by $\gamma_1$ and the learning rate of player 2 is defined to be $\gamma_2=\tau\gamma_1$ . We provide a non-asymptotic construction of the finite timescale separation parameter $\tau^{\ast}$ such that gradient descent-ascent locally converges to $x^{\ast}$ for all $\tau \in (\tau^{\ast}, \infty)$ if and only if it is a strict local minmax...	Lillian J. Ratliff, Tanner Fiez
581	Activation-level uncertainty in deep neural networks	Current approaches for uncertainty estimation in deep learning often produce too confident results. Bayesian Neural Networks (BNNs) model uncertainty in the space of weights, which is usually high-dimensional and limits the quality of variational approximations. The more recent functional BNNs (fBNNs) address this only partially because, although the prior is specified in the space of functions, the posterior approximation is still defined in terms of stochastic weights. In this work we propose to move uncertainty from the weights...	Daniel HernándezLobato, José Miguel HernándezLobato, Pablo MoralesAlvarez, Rafael Molina
582	Efficient Continual Learning with Modular Networks and Task-Driven Priors	Existing literature in Continual Learning (CL) has focused on overcoming catastrophic forgetting, the inability of the learner to recall how to perform tasks observed in the past. There are however other desirable properties of a CL system, such as the ability to transfer knowledge from previous tasks and to scale memory and compute sub-linearly with the number of tasks. Since most current benchmarks focus only on forgetting using short streams of tasks, we first propose a new suite of benchmarks to probe CL algorithms across these new...	Ludovic Denoyer, Marc'Aurelio Ranzato, Tom Veniat
583	No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks	There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not always show practical improvement over the standard cross-entropy...	Ameya Prabhu, Puneet K. Dokania, Shyamgopal Karthik, Vineet Gandhi
584	Ringing ReLUs: Harmonic Distortion Analysis of Nonlinear Feedforward Networks	In this paper, we apply harmonic distortion analysis to understand the effect of nonlinearities in the spectral domain. Each nonlinear layer creates higher-frequency harmonics, which we call "blueshift", whose magnitude increases with network depth, thereby increasing the “roughness” of the output landscape. Unlike differential models (such as vanishing gradients, sharpness), this provides a more global view of how network architectures behave across larger areas of their parameter domain. For example, the model predicts that residual...	Christian H. X. Ali MehmetiGöpel, David Hartmann, Michael Wand
585	Distance-Based Regularisation of Deep Networks for Fine-Tuning	We investigate approaches to regularisation during fine-tuning of deep neural networks. First we provide a neural network generalisation bound based on Rademacher complexity that uses the distance the weights have moved from their initial values. This bound has no direct dependence on the number of weights and compares favourably to other bounds when applied to convolutional networks. Our bound is highly relevant for fine-tuning, because providing a network with a good initialisation based on transfer learning means that learning can...	Henry Gouk, Massimiliano Pontil, Timothy M. Hospedales
586	Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning	The combination of Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) has been recently proposed to merge the benefits of both solutions. Existing mixed approaches, however, have been successfully applied only to actor-critic methods and present significant overhead. We address these issues by introducing a novel mixed framework that exploits a periodical genetic evaluation to soft update the weights of a DRL agent. The resulting approach is applicable with any DRL method and, in a worst-case scenario, it does not...	Alessandro Farinelli, Davide Corsi, Enrico Marchesini
587	Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis	Training Generative Adversarial Networks (GAN) on high-fidelity images usually requires large-scale GPU-clusters and a vast number of training images. In this paper, we study the few-shot image synthesis task for GAN with minimum computing cost. We propose a light-weight GAN structure that gains superior quality on $1024^{2}$ resolution. Notably, the model converges from scratch with just a few hours of training on a single RTX-2080 GPU, and has a consistent performance, even with less than 100 training samples. Two technique designs...	Ahmed Elgammal, Bingchen Liu, Kunpeng Song, Yizhe Zhu
588	IsarStep: a Benchmark for High-level Mathematical Reasoning	A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to...	Lawrence C. Paulson, Lei Yu, Wenda Li, Yuhuai Wu
589	Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network	Recently, Frankle & Carbin (2019) demonstrated that randomly-initialized dense networks contain subnetworks that once found can be trained to reach test accuracy comparable to the trained dense network. However, finding these high performing trainable subnetworks is expensive, requiring iterative process of training and pruning weights. In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis: A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning...	Bhavya Kailkhura, James Diffenderfer
590	Average-case Acceleration for Bilinear Games and Normal Matrices	Advances in generative modeling and adversarial learning have given rise to renewed interest in smooth games. However, the absence of symmetry in the matrix of second derivatives poses challenges that are not present in the classical minimization framework. While a rich theory of average-case analysis has been developed for minimization problems, little is known in the context of smooth games. In this work we take a first step towards closing this gap by developing average-case optimal first-order methods for a subset of smooth games....	Carles DomingoEnrich, Damien Scieur, Fabian Pedregosa
591	Economic Hyperparameter Optimization with Blended Search Strategy	We study the problem of using low cost to search for hyperparameter configurations in a large search space with heterogeneous evaluation cost and model quality. We propose a blended search strategy to combine the strengths of global and local search, and prioritize them on the fly with the goal of minimizing the total cost spent in finding good configurations. Our approach demonstrates robust performance for tuning both tree-based models and deep neural networks on a large AutoML benchmark, as well as superior performance in model...	Amin Saied, Chi Wang, Qingyun Wu, Silu Huang
592	BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization	Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks, and thus, have been widely investigated. However, it lacks a systematic method to determine the exact quantization scheme. Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space. These approaches cannot lead to an optimal quantization scheme efficiently. This work proposes bit-level sparsity...	Hai Li, Huanrui Yang, Lin Duan, Yiran Chen
593	AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly	The learning rate (LR) schedule is one of the most important hyper-parameters needing careful tuning in training DNNs. However, it is also one of the least automated parts of machine learning systems and usually costs significant manual effort and computing. Though there are pre-defined LR schedules and optimizers with adaptive LR, they introduce new hyperparameters that need to be tuned separately for different tasks/datasets. In this paper, we consider the question: Can we automatically tune the LR over the course of training without...	Arvind Krishnamurthy, Chuanxiong Guo, Liangyu Zhao, Marco Canini, Tianyi Zhou, Yibo Zhu, Yuchen Jin
594	BERTology Meets Biology: Interpreting Attention in Protein Language Models	Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. In this work, we demonstrate a set of methods for analyzing protein Transformer models through the lens of attention. We show that attention: (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key...	Ali Madani, Caiming Xiong, Jesse Vig, Lav R. Varshney, Nazneen Fatema Rajani, Richard Socher
595	Learning Task-General Representations with Generative Neuro-Symbolic Modeling	People can learn rich, general-purpose conceptual representations from only raw perceptual inputs. Current machine learning approaches fall well short of these human standards, although different modeling traditions often have complementary strengths. Symbolic models can capture the compositional and causal knowledge that enables flexible generalization, but they struggle to learn from raw inputs, relying on strong abstractions and simplifying assumptions. Neural network models can learn directly from raw data, but they struggle to...	Brenden M. Lake, Reuben Feinman
596	Zero-shot Synthesis with Group-Supervised Learning	Visual cognition of primates is superior to that of artificial neural networks in its ability to “envision” a visual object, even a newly-introduced one, in different attributes including pose, position, color, texture, etc. To aid neural networks to envision objects with different attributes, we propose a family of objective functions, expressed on groups of examples, as a novel learning framework that we term Group-Supervised Learning (GSL). GSL allows us to decompose inputs into a disentangled representation with swappable...	Gan Xin, Laurent Itti, Sami AbuElHaija, Yunhao Ge
597	Selective Classification Can Magnify Disparities Across Groups	Selective classification, in which models can abstain on uncertain predictions, is a natural approach to improving accuracy in settings where errors are costly but abstentions are manageable. In this paper, we find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior consistently across five vision and NLP datasets. Surprisingly, increasing...	Ananya Kumar, Erik Jones, Pang Wei Koh, Percy Liang, Shiori Sagawa
598	Better Fine-Tuning by Reducing Representational Collapse	Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also...	Akshat Shrivastava, Anchit Gupta, Armen Aghajanyan, Luke Zettlemoyer, Naman Goyal, Sonal Gupta
599	Training independent subnetworks for robust prediction	Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant runtime cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved 'for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO)...	Andrew Mingbo Dai, Balaji Lakshminarayanan, Dustin Tran, Jasper Snoek, Jeremiah Zhe Liu, Marton Havasi, Rodolphe Jenatton, Stanislav Fort
600	Meta-Learning of Structured Task Distributions in Humans and Machines	In recent years, meta-learning, in which a model is trained on a family of tasks (i.e. a task distribution), has emerged as an approach to training neural networks to perform tasks that were previously assumed to require structured representations, making strides toward closing the gap between humans and machines. However, we argue that evaluating meta-learning remains a challenge, and can miss whether meta-learning actually uses the structure embedded within the tasks. These meta-learners might therefore still be significantly...	Ishita Dasgupta, Jonathan D. Cohen, Nathaniel D. Daw, Sreejan Kumar, Thomas L. Griffiths
601	BiPointNet: Binary Neural Network for Point Clouds	To alleviate the resource constraint for real-time point cloud applications that run on edge devices, in this paper we present BiPointNet, the first model binarization approach for efficient deep learning on point clouds. We discover that the immense performance drop of binarized models for point clouds mainly stems from two challenges: aggregation-induced feature homogenization that leads to a degradation of information entropy, and scale distortion that hinders optimization and invalidates scale-sensitive structures. With theoretical...	Haiyu Zhao, Hao Su, Haotong Qin, Mingyuan Zhang, Shuai Yi, Xianglong Liu, Yifu Ding, Zhongang Cai
602	Benchmarks for Deep Off-Policy Evaluation	Off-policy evaluation (OPE) holds the promise of being able to leverage large, offline datasets for both evaluating and selecting complex policies for decision making. The ability to learn offline is particularly important in many real-world domains, such as in healthcare, recommender systems, or robotics, where online data collection is an expensive and potentially dangerous process. Being able to accurately evaluate and select high-performing policies without requiring online interaction could yield significant benefits in safety,...	Alexander Novikov, Aviral Kumar, Cosmin Paduraru, George Tucker, Justin Fu, Mengjiao Yang, Michael R. Zhang, Mohammad Norouzi, Ofir Nachum, Sergey Levine, Tom Le Paine, Yutian Chen, Ziyu Wang
603	Planning from Pixels using Inverse Dynamics Models	Learning dynamics models in high-dimensional observation spaces can be challenging for model-based RL agents. We propose a novel way to learn models in a latent space by learning to predict sequences of future actions conditioned on task completion. These models track task-relevant environment dynamics over a distribution of tasks, while simultaneously serving as an effective heuristic for planning with sparse rewards. We evaluate our method on challenging visual goal completion tasks and show a substantial increase in performance...	Jimmy Ba, Keiran Paster, Sheila A. McIlraith
604	Understanding the effects of data parallelism and sparsity on neural network training	We study two factors in neural network training: data parallelism and sparsity; here, data parallelism means processing training data in parallel using distributed systems (or equivalently increasing batch size), so that training can be accelerated; for sparsity, we refer to pruning parameters in a neural network model, so as to reduce computational and memory cost. Despite their promising benefits, however, understanding of their effects on neural network training remains elusive. In this work, we first measure these effects...	Martin Jaggi, Namhoon Lee, Philip H. S. Torr, Thalaiyasingam Ajanthan
605	NOVAS: Non-convex Optimization via Adaptive Stochastic Search for End-to-end Learning and Control	In this work we propose the use of adaptive stochastic search as a building block for general, non-convex optimization operations within deep neural network architectures. Specifically, for an objective function located at some layer in the network and parameterized by some network parameters, we employ adaptive stochastic search to perform optimization over its output. This operation is differentiable and does not obstruct the passing of gradients during backpropagation, thus enabling us to incorporate it as a component in end-to-end...	Evangelos A. Theodorou, Ioannis Exarchos, Marcus Aloysius Pereira, Ziyi Wang
606	MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond	This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query (e.g. a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally expensive and limited in generalization, we propose a simple and effective alternative by revisiting modulated convolutions that fuse the query and the image locally. Following the design of residual bottleneck, we call our method MoVie, short for Modulated conVolutional bottlenecks. Notably, MoVie...	DuyKien Nguyen, Vedanuj Goswami, Xinlei Chen
607	NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation	3D pose estimation is a challenging but important task in computer vision. In this work, we show that standard deep learning approaches to 3D pose estimation are not robust to partial occlusion. Inspired by the robustness of generative vision models to partial occlusion, we propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo. In particular, NeMo learns a generative model of neural feature activations at each vertex on a dense 3D mesh. Using...	Adam Kortylewski, Alan L. Yuille, Angtian Wang
608	On Graph Neural Networks versus Graph-Augmented MLPs	From the perspectives of expressive power and learning, this work compares multi-layer Graph Neural Networks (GNNs) with a simplified alternative that we call Graph-Augmented Multi-Layer Perceptrons (GA-MLPs), which first augments node features with certain multi-hop operators on the graph and then applies learnable node-wise functions. From the perspective of graph isomorphism testing, we show both theoretically and numerically that GA-MLPs with suitable operators can distinguish almost all non-isomorphic graphs, just like the...	Joan Bruna, Lei Chen, Zhengdao Chen
609	Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling	Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context...	Anmol Gulati, Bo Li, ChungCheng Chiu, Jiahui Yu, Ruoming Pang, Tara N. Sainath, Wei Han, Yonghui Wu
610	Deep Learning meets Projective Clustering	A common approach for compressing Natural Language Processing (NLP) networks is to encode the embedding layer as a matrix $A\in\mathbb{R}^{n\times d}$ , compute its rank- $j$ approximation $A_j$ via SVD (Singular Value Decomposition), and then factor $A_j$ into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of $A$ represent points in $\mathbb{R}^d$ , and the rows of $A_j$ represent their projections onto the $j$ -dimensional subspace that minimizes the...	Alaa Maalouf, Dan Feldman, Daniela Rus, Harry Lang
611	Reinforcement Learning with Random Delays	Action and observation delays commonly occur in many Reinforcement Learning applications, such as remote control scenarios. We study the anatomy of randomly delayed environments, and show that partially resampling trajectory fragments in hindsight allows for off-policy multi-step value estimation. We apply this principle to derive Delay-Correcting Actor-Critic (DCAC), an algorithm based on Soft Actor-Critic with significantly better performance in environments with delays. This is shown theoretically and also demonstrated practically...	Christopher J. Pal, Giovanni Beltrame, Jonathan Binas, Simon Ramstedt, Yann Bouteiller
612	Isotropy in the Contextual Embedding Space: Clusters and Manifolds	The geometric properties of contextual embedding spaces for deep language models such as BERT and ERNIE, have attracted considerable attention in recent years. Investigations on the contextual embeddings demonstrate a strong anisotropic space such that most of the vectors fall within a narrow cone, leading to high cosine similarities. It is surprising that these LMs are as successful as they are, given that most of their embedding vectors are as similar to one another as they are. In this paper, we argue that the isotropy indeed exists...	Jiaji Huang, Kenneth Church, Xingyu Cai, Yuchen Bian
613	Spatio-Temporal Graph Scattering Transform	Although spatio-temporal graph neural networks have achieved great empirical success in handling multiple correlated time series, they may be impractical in some real-world scenarios due to a lack of sufficient high-quality training data. Furthermore, spatio-temporal graph neural networks lack theoretical interpretation. To address these issues, we put forth a novel mathematically designed framework to analyze spatio-temporal data. Our proposed spatio-temporal graph scattering transform (ST-GST) extends traditional scattering transform...	Antonio Ortega, Chao Pan, Siheng Chen
614	Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization	Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of...	Hiroki Furuta, Ofir Nachum, Shixiang Gu, Tatsuya Matsushima, Yutaka Matsuo
615	gradSim: Differentiable simulation for system identification and visuomotor control	In this paper, we tackle the problem of estimating object physical properties such as mass, friction, and elasticity directly from video sequences. Such a system identification problem is fundamentally ill-posed due to the loss of information during image formation. Current best solutions to the problem require precise 3D labels which are labor intensive to gather, and infeasible to create for many systems such as deformable solids or cloth. In this work we present gradSim, a framework that overcomes the dependence on 3D supervision by...	Breandan Considine, Derek Nowrouzezahrai, Florian Golemo, Florian Shkurti, J. Krishna Murthy, Jérôme ParentLévesque, Kenny Erleben, Kevin Xie, Liam Paull, Linda Petrini, Martin Weiss, Miles Macklin, Sanja Fidler, Vikram Voleti
616	Evaluations and Methods for Explanation through Robustness Analysis	Feature based explanations, that provide importance of each feature towards the model prediction, is arguably one of the most intuitive ways to explain a model. In this paper, we establish a novel set of evaluation criteria for such feature based explanations by robustness analysis. In contrast to existing evaluations which require us to specify some way to "remove" features that could inevitably introduces biases and artifacts, we make use of the subtler notion of smaller adversarial perturbations. By optimizing towards our proposed...	ChengYu Hsieh, ChihKuan Yeh, ChoJui Hsieh, Pradeep Kumar Ravikumar, Sanjiv Kumar, Seungyeon Kim, Xuanqing Liu
617	RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs	This paper studies learning logic rules for reasoning on knowledge graphs. Logic rules provide interpretable explanations when used for prediction as well as being able to generalize to other tasks, and hence are critical to learn. Existing methods either suffer from the problem of searching in a large search space (e.g., neural logic programming) or ineffective optimization due to sparse rewards (e.g., techniques based on reinforcement learning). To address these limitations, this paper proposes a probabilistic model called RNNLogic....	Jian Tang, JunKun Chen, LouisPascal A. C. Xhonneux, Meng Qu, Yoshua Bengio
618	Can a Fruit Fly Learn Word Embeddings?	The mushroom body of the fruit fly brain is one of the best studied systems in neuroscience. At its core it consists of a population of Kenyon cells, which receive inputs from multiple sensory modalities. These cells are inhibited by the anterior paired lateral neuron, thus creating a sparse high dimensional representation of the inputs. In this work we study a mathematical formalization of this network motif and apply it to learning the correlational structure between words and their context in a corpus of unstructured text, a common...	Benjamin Hoover, Chaitanya K. Ryali, Dmitry Krotov, Leopold Grinberg, Mohammed J. Zaki, Saket Navlakha, Yuchen Liang
619	Neural representation and generation for RNA secondary structures	Our work is concerned with the generation and targeted design of RNA, a type of genetic macromolecule that can adopt complex structures which influence their cellular activities and functions. The design of large scale and complex biological structures spurs dedicated graph-based deep generative modeling techniques, which represents a key but underappreciated aspect of computational drug discovery. In this work, we investigate the principles behind representing and generating different RNA structural modalities, and propose a flexible...	Mathieu Blanchette, William L. Hamilton, Zichao Yan
620	WaNet - Imperceptible Warping-based Backdoor Attack	With the thriving of deep learning and the widespread practice of using pre-trained networks, backdoor attacks have become an increasing security threat drawing many research interests in recent years. A third-party model can be poisoned in training to work well in normal conditions but behave maliciously when a trigger pattern appears. However, the existing backdoor attacks are all built on noise perturbation triggers, making them noticeable to humans. In this paper, we instead propose using warping-based triggers. The proposed...	Anh Tuan Tran, Tuan Anh Nguyen
621	LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial Recognition	Facial recognition systems are increasingly deployed by private corporations, government agencies, and contractors for consumer services and mass surveillance programs alike. These systems are typically built by scraping social media profiles for user images. Adversarial perturbations have been proposed for bypassing facial recognition systems. However, existing methods fail on full-scale systems and commercial APIs. We develop our own adversarial filter that accounts for the entire image processing pipeline and is demonstrably...	Gavin Taylor, Harrison Foley, John P. Dickerson, Micah Goldblum, Shiyuan Duan, Tom Goldstein, Valeriia Cherepanova
622	Learning from others' mistakes: Avoiding dataset biases without modeling them	State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended underlying task. Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available. We consider cases where the bias issues may not be explicitly identified, and show a method for training models that learn to ignore these problematic correlations. Our approach relies on the observation that models with limited capacity...	Alexander M. Rush, Thomas Wolf, Victor Sanh, Yonatan Belinkov
623	Prototypical Contrastive Learning of Unsupervised Representations	This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that bridges contrastive learning with clustering. PCL not only learns low-level features for the task of instance discrimination, but more importantly, it implicitly encodes semantic structures of the data into the learned embedding space. Specifically, we introduce prototypes as latent variables to help find the maximum-likelihood estimation of the network parameters in an Expectation-Maximization framework. We iteratively...	Caiming Xiong, Junnan Li, Pan Zhou, Steven C. H. Hoi
624	Extreme Memorization via Scale of Initialization	We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we find that the extent and manner in which generalization ability is affected depends on the activation and loss function used, with sin activation being the most extreme. In the case of the homogeneous ReLU activation, we show that this behavior...	Ashok Cutkosky, Behnam Neyshabur, Harsh Mehta
625	Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling	Obtaining large annotated datasets is critical for training successful machine learning models and it is often a bottleneck in practice. Weak supervision offers a promising alternative for producing labeled datasets without ground truth annotations by generating probabilistic labels using multiple noisy heuristics. This process can scale to large datasets and has demonstrated state of the art performance in diverse domains such as healthcare and e-commerce. One practical issue with learning from user-generated heuristics is that their...	Artur Dubrawski, Benedikt Boecking, Eric P. Xing, Willie Neiswanger
626	Adaptive Procedural Task Generation for Hard-Exploration Problems	We introduce Adaptive Procedural Task Generation (APT-Gen), an approach to progressively generate a sequence of tasks as curricula to facilitate reinforcement learning in hard-exploration problems. At the heart of our approach, a task generator learns to create tasks from a parameterized task space via a black-box procedural generation module. To enable curriculum learning in the absence of a direct indicator of learning progress, we propose to train the task generator by balancing the agent's performance in the generated tasks and the...	Kuan Fang, Li FeiFei, Silvio Savarese, Yuke Zhu
627	Multi-timescale Representation Learning in LSTM Language Models	Language models must capture statistical dependencies between words at timescales ranging from very short to very long. Earlier work has demonstrated that dependencies in natural language tend to decay with distance between words according to a power law. However, it is unclear how this knowledge can be used for analyzing or designing neural network language models. In this work, we derived a theory for how the memory gating mechanism in long short-term memory (LSTM) language models can capture power law decay. We found that unit...	Alexander Huth, Javier S. Turek, Shivangi Mahto, Vy Ai Vo
628	Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation	Much recent effort has been invested in non-autoregressive neural machine translation, which appears to be an efficient alternative to state-of-the-art autoregressive machine translation on modern GPUs. In contrast to the latter, where generation is sequential, the former allows generation to be parallelized across target token positions. Some of the latest non-autoregressive models have achieved impressive translation quality-speed tradeoffs compared to autoregressive baselines. In this work, we reexamine this tradeoff and argue that...	Hao Peng, James Cross, Jungo Kasai, Nikolaos Pappas, Noah A. Smith
629	Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing	Training with soft targets instead of hard targets has been shown to improve performance and calibration of deep neural networks. Label smoothing is a popular way of computing soft targets, where one-hot encoding of a class is smoothed with a uniform distribution. Owing to its simplicity, label smoothing has found wide-spread use for training deep neural networks on a wide variety of tasks, ranging from image and text classification to machine translation and semantic parsing. Complementing recent empirical justification for label...	Asish Ghoshal, Luke Zettlemoyer, Sonal Gupta, Xilun Chen, Yashar Mehdad
630	Predicting Inductive Biases of Pre-Trained Models	Most current NLP systems are based on a pre-train-then-fine-tune paradigm, in which a large neural network is first trained in a self-supervised way designed to encourage the network to extract broadly-useful linguistic features, and then fine-tuned for a specific task of interest. Recent work attempts to understand why this recipe works and explain when it fails. Currently, such analyses have produced two sets of apparently-contradictory results. Work that analyzes the representations that result from pre-training (via "probing...	Charles Lovering, Ellie Pavlick, Rohan Jha, Tal Linzen
631	Molecule Optimization by Explainable Evolution	Optimizing molecules for desired properties is a fundamental yet challenging task in chemistry, material science, and drug discovery. This paper develops a novel algorithm for optimizing molecular properties via an Expectation-Maximization (EM) like explainable evolutionary process. The algorithm is designed to mimic human experts in the process of searching for desirable molecules and alternate between two stages: the first stage on explainable local search which identifies rationales, i.e., critical subgraph patterns accounting for...	Binghong Chen, Chengtao Li, Hanjun Dai, Le Song, Tianzhe Wang
632	Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies	Learning continuous representations of discrete objects such as text, users, movies, and URLs lies at the heart of many applications including language and user modeling. When using discrete objects as input to neural networks, we often ignore the underlying structures (e.g., natural groupings and similarities) and embed the objects independently into individual vectors. As a result, existing methods do not scale to large vocabulary sizes. In this paper, we design a simple and efficient embedding algorithm that learns a small set of...	Amr Ahmed, Manzil Zaheer, Paul Pu Liang, Yuan Wang
633	PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences	Point cloud sequences are irregular and unordered in the spatial dimension while exhibiting regularities and order in the temporal dimension. Therefore, existing grid based convolutions for conventional video processing cannot be directly applied to spatio-temporal modeling of raw point cloud sequences. In this paper, we propose a point spatio-temporal (PST) convolution to achieve informative representations of point cloud sequences. The proposed PST convolution first disentangles space and time in point cloud sequences. Then, a...	Hehe Fan, Mohan S. Kankanhalli, Xin Yu, Yi Yang, Yuhang Ding
634	Group Equivariant Conditional Neural Processes	We present the group equivariant conditional neural process (EquivCNP), a meta-learning method with permutation invariance in a data set as in conventional conditional neural processes (CNPs), and it also has transformation equivariance in data space. Incorporating group equivariance, such as rotation and scaling equivariance, provides a way to consider the symmetry of real-world data. We give a decomposition theorem for permutation-invariant and group-equivariant maps, which leads us to construct EquivCNPs with an infinite-dimensional...	Akiyoshi Sannai, Makoto Kawano, Wataru Kumagai, Yusuke Iwasawa, Yutaka Matsuo
635	When does preconditioning help or hurt generalization?	While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the \textit{implicit bias} of optimizers affects the comparison of generalization properties. We provide an exact asymptotic bias-variance decomposition of the generalization error of preconditioned ridgeless regression in the overparameterized regime, and consider the inverse population Fisher information matrix (used in NGD) as a...	Atsushi Nitanda, Denny Wu, Ji Xu, Jimmy Ba, Roger Baker Grosse, Shunichi Amari, Taiji Suzuki, Xuechen Li
636	Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues	Compared to traditional visual question answering, video-grounded dialogues require additional reasoning over dialogue context to answer questions in a multi-turn setting. Previous approaches to video-grounded dialogues mostly use dialogue context as a simple text input without modelling the inherent information flows at the turn level. In this paper, we propose a novel framework of Reasoning Paths in Dialogue Context (PDC). PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical...	Hung Le, Nancy F. Chen, Steven C. H. Hoi
637	Prototypical Representation Learning for Relation Extraction	Recognizing relations between entities is a pivotal task of relational learning. Learning relation representations from distantly-labeled datasets is difficult because of the abundant label noise and complicated expressions in human language. This paper aims to learn predictive, interpretable, and robust relation representations from distantly-labeled data that are effective in different settings, including supervised, distantly supervised, and few-shot learning. Instead of solely relying on the supervision from noisy labels, we...	Fei Huang, Guangwei Xu, Haitao Zheng, Ning Ding, Pengjun Xie, Rui Wang, Rui Zhang, Xiaobin Wang, Yao Fu, Ying Shen
638	Layer-adaptive Sparsity for the Magnitude-based Pruning	Recent discoveries on neural network pruning reveal that, with a carefully chosen layerwise sparsity, a simple magnitude-based pruning achieves state-of-the-art tradeoff between sparsity and performance. However, without a clear consensus on ``how to choose,'' the layerwise sparsities are mostly selected algorithm-by-algorithm, often resorting to handcrafted heuristics or an extensive hyperparameter search. To fill this gap, we propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP)...	Jaeho Lee, Jinwoo Shin, Sangwoo Mo, Sejun Park, Sungsoo Ahn
639	Refining Deep Generative Models via Discriminator Gradient Flow	Deep generative modeling has seen impressive advances in recent years, to the point where it is now commonplace to see simulated samples (e.g., images) that closely resemble real-world data. However, generation quality is generally inconsistent for any given model and can vary dramatically between samples. We introduce Discriminator Gradient $f$ low (DG $f$ low), a new technique that improves generated samples via the gradient flow of entropy-regularized $f$ -divergences between the real and the generated data distributions. The gradient...	Abdul Fatir Ansari, Harold Soh, Ming Liang Ang
640	Explaining the Efficacy of Counterfactually Augmented Data	In attempts to produce machine learning models less reliant on spurious patterns in NLP datasets, researchers have recently proposed curating counterfactually augmented data (CAD) via a human-in-the-loop process in which given some documents and their (initial) labels, humans must revise the text to make a counterfactual label applicable. Importantly, edits that are not necessary to flip the applicable label are prohibited. Models trained on the augmented (original and revised) data appear, empirically, to rely less on semantically...	Amrith Setlur, Divyansh Kaushik, Eduard H. Hovy, Zachary Chase Lipton
641	Lipschitz Recurrent Neural Networks	Viewing recurrent neural networks (RNNs) as continuous-time dynamical systems, we propose a recurrent unit that describes the hidden state's evolution with two parts: a well-understood linear component plus a Lipschitz nonlinearity. This particular functional form facilitates stability analysis of the long-term behavior of the recurrent unit using tools from nonlinear systems theory. In turn, this enables architectural design decisions before experimentation. Sufficient conditions for global stability of the recurrent unit are...	Alejandro F. Queiruga, Liam Hodgkinson, Michael W. Mahoney, N. Benjamin Erichson, Omri Azencot
642	Learning Hyperbolic Representations of Topological Features	Learning task-specific representations of persistence diagrams is an important problem in topological data analysis and machine learning. However, current state of the art methods are restricted in terms of their expressivity as they are focused on Euclidean representations. Persistence diagrams often contain features of infinite persistence (i.e., essential features) and Euclidean spaces shrink their importance relative to non-essential features because they cannot assign infinite distance to finite points. To deal with this issue, we...	Iordanis Fostiropoulos, Panagiotis Kyriakis, Paul Bogdan
643	Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech	Although early text-to-speech (TTS) models such as Tacotron 2 have succeeded in generating human-like speech, their autoregressive architectures have several limitations: (1) They require a lot of time to generate a mel-spectrogram consisting of hundreds of steps. (2) The autoregressive speech generation shows a lack of robustness due to its error propagation property. In this paper, we propose a novel non-autoregressive TTS model called BVAE-TTS, which eliminates the architectural limitations and generates a mel-spectrogram in...	Joongbo Shin, Kyomin Jung, Yoonhyung Lee
644	Risk-Averse Offline Reinforcement Learning	Training Reinforcement Learning (RL) agents in high-stakes applications might be too prohibitive due to the risk associated to exploration. Thus, the agent can only use data previously collected by safe policies. While previous work considers optimizing the average performance using offline data, we focus on optimizing a risk-averse criteria, namely the CVaR. In particular, we present the Offline Risk-Averse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting. We show...	Andreas Krause, Núria Armengol Urpí, Sebastian Curi
645	Group Equivariant Stand-Alone Self-Attention For Vision	We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of the group considered. Since the group acts on the positional encoding directly, group equivariant self-attention networks (GSA-Nets) are steerable by nature. Our experiments on vision benchmarks demonstrate consistent improvements of GSA-Nets over non-equivariant self-attention networks.	David W. Romero, JeanBaptiste Cordonnier
646	A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning	Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computing systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the error induced by contractive compressors...	Peter Richtárik, Samuel Horváth
647	Neural Delay Differential Equations	Neural Ordinary Differential Equations (NODEs), a framework of continuous-depth neural networks, have been widely applied, showing exceptional efficacy in coping with some representative datasets. Recently, an augmented framework has been successfully developed for conquering some limitations emergent in application of the original framework. Here we propose a new class of continuous-depth neural networks with delay, named as Neural Delay Differential Equations (NDDEs), and, for computing the corresponding gradients, we use the adjoint...	Qunxi Zhu, Wei Lin, Yao Guo
648	Capturing Label Characteristics in VAEs	We present a principled approach to incorporating labels in variational autoencoders (VAEs) that captures the rich characteristic information associated with those labels. While prior work has typically conflated these by learning latent variables that directly correspond to label values, we argue this is contrary to the intended effect of supervision in VAEs—capturing rich label characteristics with the latents. For example, we may want to capture the characteristics of a face that make it look young, rather than just the age of the...	Philip H. S. Torr, Sebastian M. Schmon, Siddharth Narayanaswamy, Tom Joy, Tom Rainforth
649	Graph Edit Networks	While graph neural networks have made impressive progress in classification and regression, few approaches to date perform time series prediction on graphs, and those that do are mostly limited to edge changes. We suggest that graph edits are a more natural interface for graph-to-graph learning. In particular, graph edits are general enough to describe any graph-to-graph change, not only edge changes; they are sparse, making them easier to understand for humans and more efficient computationally; and they are local, avoiding the need...	Barbara Hammer, Benjamin Paassen, Cesare Alippi, Daniele Grattarola, Daniele Zambon
650	InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective	Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies, however, show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We aim to address this problem from an information-theoretic perspective, and propose InfoBERT, a novel learning framework for robust ﬁne-tuning of pre-trained language models. InfoBERT contains two mutual-information-based regularizers for model training: (i) an Information Bottleneck regularizer,...	Bo Li, Boxin Wang, Jingjing Liu, Ruoxi Jia, Shuohang Wang, Yu Cheng, Zhe Gan
651	DrNAS: Dirichlet Neural Architecture Search	This paper proposes a novel differentiable architecture search method by formulating it into a distribution learning problem. We treat the continuously relaxed architecture mixing weight as random variables, modeled by Dirichlet distribution. With recently developed pathwise derivatives, the Dirichlet parameters can be easily optimized with gradient-based optimizer in an end-to-end manner. This formulation improves the generalization ability and induces stochasticity that naturally encourages exploration in the search space....	ChoJui Hsieh, Minhao Cheng, Ruochen Wang, Xiangning Chen, Xiaocheng Tang
652	Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration	We propose a novel information bottleneck (IB) method named Drop-Bottleneck, which discretely drops features that are irrelevant to the target variable. Drop-Bottleneck not only enjoys a simple and tractable compression objective but also additionally provides a deterministic compressed representation of the input variable, which is useful for inference tasks that require consistent representation. Moreover, it can jointly learn a feature extractor and select features considering each feature dimension's relevance to the target task,...	Dongyeon Woo, Gunhee Kim, Jaekyeom Kim, Minjung Kim
653	Monte-Carlo Planning and Learning with Language Action Value Estimates	Interactive Fiction (IF) games provide a useful testbed for language-based reinforcement learning agents, posing significant challenges of natural language understanding, commonsense reasoning, and non-myopic planning in the combinatorial search space. Agents based on standard planning algorithms struggle to play IF games due to the massive search space of language actions. Thus, language-grounded planning is a key ability of such agents, since inferring the consequence of language action based on semantic understanding can drastically...	Jongmin Lee, KeeEung Kim, Seokin Seo, Youngsoo Jang
654	Robust early-learning: Hindering the memorization of noisy labels	The \textit{memorization effects} of deep networks show that they will first memorize training data with clean labels and then those with noisy labels. The \textit{early stopping} method therefore can be exploited for learning with noisy labels. However, the side effect brought by noisy labels will influence the memorization of clean labels before early stopping. In this paper, motivated by the \textit{lottery ticket hypothesis} which shows that only partial parameters are important for generalization, we find that only partial...	Bo Han, Chen Gong, Nannan Wang, Tongliang Liu, Xiaobo Xia, Yi Chang, Zongyuan Ge
655	Identifying Physical Law of Hamiltonian Systems via Meta-Learning	Hamiltonian mechanics is an effective tool to represent many physical processes with concise yet well-generalized mathematical expressions. A well-modeled Hamiltonian makes it easy for researchers to analyze and forecast many related phenomena that are governed by the same physical law. However, in general, identifying a functional or shared expression of the Hamiltonian is very difficult. It requires carefully designed experiments and the researcher's insight that comes from years of experience. We propose that meta-learning...	Haesang Yang, Seungjun Lee, Woojae Seong
656	Reweighting Augmented Samples by Minimizing the Maximal Expected Loss	Data augmentation is an effective technique to improve the generalization of deep neural networks. However, previous data augmentation methods usually treat the augmented samples equally without considering their individual impacts on the model. To address this, for the augmented samples from the same training example, we propose to assign different weights to them. We construct the maximal expected loss which is the supremum over any reweighted loss on augmented samples. Inspired by adversarial training, we minimize this maximal...	Lifeng Shang, Lu Hou, Mingyang Yi, Qun Liu, Xin Jiang, ZhiMing Ma
657	Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning	Encoder layer fusion (EncoderFusion) is a technique to fuse all the encoder layers (instead of the uppermost layer) for sequence-to-sequence (Seq2Seq) models, which has proven effective on various NLP tasks. However, it is still not entirely clear why and when EncoderFusion should work. In this paper, our main contribution is to take a step further in understanding EncoderFusion. Many of previous studies believe that the success of EncoderFusion comes from exploiting surface and syntactic information embedded in lower encoder layers....	Derek F. Wong, Liang Ding, Lidia S. Chao, Longyue Wang, Xuebo Liu, Zhaopeng Tu
658	Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling	How to improve generative modeling by better exploiting spatial regularities and coherence in images? We introduce a novel neural network for building image generators (decoders) and apply it to variational autoencoders (VAEs). In our spatial dependency networks (SDNs), feature maps at each level of a deep neural net are computed in a spatially coherent way, using a sequential gating-based mechanism that distributes contextual information across 2-D space. We show that augmenting the decoder of a hierarchical VAE by spatial dependency...	Aleksandar Stanic, Joachim M. Buhmann, Jürgen Schmidhuber, Stefan Bauer, Ðorðe Miladinovic
659	Deep Repulsive Clustering of Ordered Data Based on Order-Identity Decomposition	We propose the deep repulsive clustering (DRC) algorithm of ordered data for effective order learning. First, we develop the order-identity decomposition (ORID) network to divide the information of an object instance into an order-related feature and an identity feature. Then, we group object instances into clusters according to their identity features using a repulsive term. Moreover, we estimate the rank of a test instance, by comparing it with references within the same cluster. Experimental results on facial age estimation,...	ChangSu Kim, SeonHo Lee
660	Revisiting Locally Supervised Learning: an Alternative to End-to-end Training	Due to the need to store the intermediate activations for back-propagation, end-to-end (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid...	Gao Huang, Le Yang, Shiji Song, Yulin Wang, Zanlin Ni
661	Nonseparable Symplectic Neural Networks	Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled, while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fluid dynamics and quantum mechanics were rarely explored. The main computational challenge lies in the effective embedding of symplectic...	Bo Zhu, Cheng Yang, Shiying Xiong, Shuqi Yang, Xingzhe He, Yunjin Tong
662	Gradient Origin Networks	This paper proposes a new type of generative model that is able to quickly learn a latent representation without an encoder. This is achieved using empirical Bayes to calculate the expectation of the posterior, which is implemented by initialising a latent vector with zeros, then using the gradient of the log-likelihood of the data with respect to this zero vector as new latent points. The approach has similar characteristics to autoencoders, but with a simpler architecture, and is demonstrated in a variational autoencoder equivalent...	Chris G. Willcocks, Sam BondTaylor
663	Learning to Sample with Local and Global Contexts in Experience Replay Buffer	Experience replay, which enables the agents to remember and reuse experience from the past, has played a significant role in the success of off-policy reinforcement learning (RL). To utilize the experience replay efficiently, the existing sampling methods allow selecting out more meaningful experiences by imposing priorities on them based on certain metrics (e.g. TD-error). However, they may result in sampling highly biased, redundant transitions since they compute the sampling rate for each transition independently, without...	Eunho Yang, Jinwoo Shin, Kimin Lee, Sung Ju Hwang, Youngmin Oh
664	Provable Rich Observation Reinforcement Learning with Combinatorial Latent States	We propose a novel setting for reinforcement learning that combines two common real-world difficulties: presence of observations (such as camera images) and factored states (such as location of objects). In our setting, the agent receives observations generated stochastically from a "latent" factored state. These observations are "rich enough" to enable decoding of the latent state and remove partial observability concerns. Since the latent state is combinatorial, the size of state space is exponential in the number of latent factors....	Chi Jin, Dipendra Misra, John Langford, Qinghua Liu
665	Sharper Generalization Bounds for Learning with Gradient-dominated Objective Functions	Stochastic optimization has become the workhorse behind many successful machine learning applications, which motivates a lot of theoretical analysis to understand its empirical behavior. As a comparison, there is far less work to study the generalization behavior especially in a non-convex learning setting. In this paper, we study the generalization behavior of stochastic optimization by leveraging the algorithmic stability for learning with $\beta$ -gradient-dominated objective functions. We develop generalization bounds of the order...	Yiming Ying, Yunwen Lei
666	Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets	Despite the success of recent Neural Architecture Search (NAS) methods on various tasks which have shown to output networks that largely outperform human-designed networks, conventional NAS methods have mostly tackled the optimization of searching for the network architecture for a single task (dataset), which does not generalize well across multiple tasks (datasets). Moreover, since such task-specific methods search for a neural architecture from scratch for every given task, they incur a large computational cost, which is problematic...	Eunyoung Hyung, Hayeon Lee, Sung Ju Hwang
667	Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models	Multimodal learning for generative models often refers to the learning of abstract concepts from the commonality of information in multiple modalities, such as vision and language. While it has proven effective for learning generalisable representations, the training of such models often requires a large amount of related multimodal data that shares commonality, which can be expensive to come by. To mitigate this, we develop a novel contrastive framework for generative model learning, allowing us to train the model not just by the...	Brooks Paige, N. Siddharth, Philip H. S. Torr, Yuge Shi
668	FedMix: Approximation of Mixup under Mean Augmented Federated Learning	Federated learning (FL) allows edge devices to collectively learn a model without directly sharing data within each device, thus preserving privacy and eliminating the need to store data globally. While there are promising results under the assumption of independent and identically distributed (iid) local data, current state-of-the-art algorithms suffer a performance degradation as the heterogeneity of local data across clients increases. To resolve this issue, we propose a simple framework, \emph{Mean Augmented Federated Learning...	Eunho Yang, Sumin Shin, Sung Ju Hwang, Tehrim Yoon
669	Generalized Variational Continual Learning	Continual learning deals with training models on new tasks and datasets in an online fashion. One strand of research has used probabilistic regularization for continual learning, with two of the main approaches in this vein being Online Elastic Weight Consolidation (Online EWC) and Variational Continual Learning (VCL). VCL employs variational inference, which in other settings has been improved empirically by applying likelihood-tempering. We show that applying this modification to VCL recovers Online EWC as a limiting case, allowing...	Noel Loo, Richard E. Turner, Siddharth Swaroop
670	Understanding and Improving Lexical Choice in Non-Autoregressive Translation	Knowledge distillation (KD) is essential for training non-autoregressive translation (NAT) models by reducing the complexity of the raw data with an autoregressive teacher model. In this study, we empirically show that as a side effect of this training, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model. To alleviate this problem, we propose to expose the raw data to NAT models to restore the useful information of low-frequency words, which are missed in the distilled data. To this...	Dacheng Tao, Derek F. Wong, Liang Ding, Longyue Wang, Xuebo Liu, Zhaopeng Tu
671	Bayesian Context Aggregation for Neural Processes	Formulating scalable probabilistic regression models with reliable uncertainty estimates has been a long-standing challenge in machine learning research. Recently, casting probabilistic regression as a multi-task learning problem in terms of conditional latent variable (CLV) models such as the Neural Process (NP) has shown promising results. In this paper, we focus on context aggregation, a central component of such architectures, which fuses information from multiple context data points. So far, this aggregation operation has been...	Christian Daniel, Fabian Flürenbrock, Gerhard Neumann, Lukas Großberger, Michael Volpp
672	Variational Intrinsic Control Revisited	In this paper, we revisit variational intrinsic control (VIC), an unsupervised reinforcement learning method for finding the largest set of intrinsic options available to an agent. In the original work by Gregor et al. (2016), two VIC algorithms were proposed: one that represents the options explicitly, and the other that does it implicitly. We show that the intrinsic reward used in the latter is subject to bias in stochastic environments, causing convergence to suboptimal solutions. To correct this behavior, we propose two methods...	Taehwan Kwon
673	Implicit Gradient Regularization	Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test...	Benoit Dherin, David G. T. Barrett
674	Return-Based Contrastive Representation Learning for Reinforcement Learning	Recently, various auxiliary tasks have been proposed to accelerate representation learning and improve sample efficiency in deep reinforcement learning (RL). However, existing auxiliary tasks do not take the characteristics of RL problems into consideration and are unsupervised. By leveraging returns, the most important feedback signals in RL, we propose a novel auxiliary task that forces the learnt representations to discriminate state-action pairs with different returns. Our auxiliary loss is theoretically justified to learn...	Chuheng Zhang, Guoqing Liu, Jian Li, Jinhua Zhu, Li Zhao, Nenghai Yu, Tao Qin, TieYan Liu
675	Scalable Bayesian Inverse Reinforcement Learning	Bayesian inference over the reward presents an ideal solution to the ill-posed nature of the inverse reinforcement learning problem. Unfortunately current methods generally do not scale well beyond the small tabular setting due to the need for an inner-loop MDP solver, and even non-Bayesian methods that do themselves scale often require extensive interaction with the environment to perform well, being inappropriate for high stakes or costly applications such as healthcare. In this paper we introduce our method, Approximate Variational...	Alex James Chan, Mihaela van der Schaar
676	Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization	Feature visualizations such as synthetic maximally activating images are a widely used explanation method to better understand the information processing of convolutional neural networks (CNNs). At the same time, there are concerns that these visualizations might not accurately represent CNNs' inner workings. Here, we measure how much extremely activating images help humans to predict CNN activations. Using a well-controlled psychophysical paradigm, we compare the informativeness of synthetic images by Olah et al. (2017) with a simple...	Judith Schepers, Judy Borowski, Matthias Bethge, Robert Geirhos, Roland Simon Zimmermann, Thomas S. A. Wallis, Wieland Brendel
677	LiftPool: Bidirectional ConvNet Pooling	Pooling is a critical operation in convolutional neural networks for increasing receptive fields and improving robustness to input variations. Most existing pooling operations downsample the feature maps, which is a lossy process. Moreover, they are not invertible: upsampling a downscaled feature map can not recover the lost information in the downsampling. By adopting the philosophy of the classical Lifting Scheme from signal processing, we propose LiftPool for bidirectional pooling layers, including LiftDownPool and LiftUpPool....	Cees G. M. Snoek, Jiaojiao Zhao
678	Adversarial score matching and improved sampling for image generation	Denoising Score Matching with Annealed Langevin Sampling (DSM-ALS) has recently found success in generative modeling. The approach works by first training a neural network to estimate the score of a distribution, and then using Langevin dynamics to sample from the data distribution assumed by the score network. Despite the convincing visual quality of samples, this method appears to perform worse than Generative Adversarial Networks (GANs) under the Fréchet Inception Distance, a standard metric for generative models. We show that this...	Alexia JolicoeurMartineau, Ioannis Mitliagkas, Remi Tachet des Combes, Rémi PichéTaillefer
679	Transient Non-stationarity and Generalisation in Deep Reinforcement Learning	Non-stationarity can arise in Reinforcement Learning (RL) even in stationary environments. For example, most RL algorithms collect new data throughout training, using a non-stationary behaviour policy. Due to the transience of this non-stationarity, it is often not explicitly addressed in deep RL and a single neural network is continually updated. However, we find evidence that neural networks exhibit a memory effect, where these transient non-stationarities can permanently impact the latent representation and adversely affect...	Gregory Farquhar, Jelena Luketina, Maximilian Igl, Shimon Whiteson, Wendelin Boehmer
680	On the Origin of Implicit Regularization in Stochastic Gradient Descent	For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization benefit is not explained by convergence bounds, since the learning rate which maximizes test accuracy is often larger than the learning rate which minimizes training loss. To interpret this phenomenon we prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of...	Benoit Dherin, David G. T. Barrett, Samuel L. Smith, Soham De
681	Analyzing the Expressive Power of Graph Neural Networks in a Spectral Perspective	In the recent literature of Graph Neural Networks (GNN), the expressive power of models has been studied through their capability to distinguish if two given graphs are isomorphic or not. Since the graph isomorphism problem is NP-intermediate, and Weisfeiler-Lehman (WL) test can give sufficient but not enough evidence in polynomial time, the theoretical power of GNNs is usually evaluated by the equivalence of WL-test order, followed by an empirical analysis of the models on some reference inductive and transductive datasets. However,...	Benoit Gaüzère, Guillaume Renton, Muhammet Balcilar, Paul Honeine, Pierre Héroux, Sébastien Adam
682	Pruning Neural Networks at Initialization: Why Are We Missing the Mark?	Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, randomly shuffling the weights these methods prune within each layer or sampling new initial...	Daniel M. Roy, Gintare Karolina Dziugaite, Jonathan Frankle, Michael Carbin
683	SkipW: Resource Adaptable RNN with Strict Upper Computational Limit	We introduce Skip-Window, a method to allow recurrent neural networks (RNNs) to trade off accuracy for computational cost during the analysis of a sequence. Similarly to existing approaches, Skip-Window extends existing RNN cells by adding a mechanism to encourage the model to process fewer inputs. Unlike existing approaches, Skip-Window is able to respect a strict computational budget, making this model more suitable for limited hardware. We evaluate this approach on two datasets: a human activity recognition task and adding task. Our...	Anne Lambert, François Schnitzler, Françoise Le Bolzer, Pascal Leguyadec, Tsiry Mayet
684	Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds	Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the...	Aren Jansen, Dan Ellis, Efthymios Tzinis, John R. Hershey, Scott Wisdom, Shawn Hershey, Tal Remez
685	Simple Augmentation Goes a Long Way: ADRL for DNN Quantization	Mixed precision quantization improves DNN performance by assigning different layers with different bit-width values. Searching for the optimal bit-width for each layer, however, remains a challenge. Deep Reinforcement Learning (DRL) shows some recent promise. It however suffers instability due to function approximation errors, causing large variances in the early training stages, slow convergence, and suboptimal policies in the mixed-precision quantization problem. This paper proposes augmented DRL (ADRL) as a way to alleviate these...	Guoyang Chen, Lin Ning, Weifeng Zhang, Xipeng Shen
686	Few-Shot Bayesian Optimization with Deep Kernel Surrogates	Hyperparameter optimization (HPO) is a central pillar in the automation of machine learning solutions and is mainly performed via Bayesian optimization, where a parametric surrogate is learned to approximate the black box response function (e.g. validation error). Unfortunately, evaluating the response function is computationally intensive. As a remedy, earlier work emphasizes the need for transfer learning surrogates which learn to optimize hyperparameters for an algorithm from other tasks. In contrast to previous work, we propose to...	Josif Grabocka, Martin Wistuba
687	AdaSpeech: Adaptive Text to Speech for Custom Voice	Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech from her/him. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions which could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to...	Bohan Li, Mingjian Chen, Sheng Zhao, Tao Qin, TieYan Liu, Xu Tan, Yanqing Liu
688	HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients	Federated Learning (FL) is a method of training machine learning models on private data distributed over a large number of possibly heterogeneous clients such as mobile phones and IoT devices. In this work, we propose a new federated learning framework named HeteroFL to address heterogeneous clients equipped with very different computation and communication capabilities. Our solution can enable the training of heterogeneous local models with varying computation complexities and still produce a single global inference model. For the...	Enmao Diao, Jie Ding, Vahid Tarokh
689	DINO: A Conditional Energy-Based GAN for Domain Translation	Domain translation is the process of transforming data from one domain to another while preserving the common semantics. Some of the most popular domain translation systems are based on conditional generative adversarial networks, which use source domain data to drive the generator and as an input to the discriminator. However, this approach does not enforce the preservation of shared semantics since the conditional input can often be ignored by the discriminator. We propose an alternative method for conditioning and present a new...	Konstantinos Vougioukas, Maja Pantic, Stavros Petridis
690	Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning	Weakly supervised segmentation requires assigning a label to every pixel based on training instances with partial annotations such as image-level tags, object bounding boxes, labeled points and scribbles. This task is challenging, as coarse annotations (tags, boxes) lack precise pixel localization whereas sparse annotations (points, scribbles) lack broad region coverage. Existing methods tackle these two types of weak supervision differently: Class activation maps are used to localize coarse labels and iteratively refine the...	JyhJing Hwang, Stella X. Yu, TsungWei Ke
691	PC2WF: 3D Wireframe Reconstruction from Raw Point Clouds	We introduce PC2WF, the first end-to-end trainable deep network architecture to convert a 3D point cloud into a wireframe model. The network takes as input an unordered set of 3D points sampled from the surface of some object, and outputs a wireframe of that object, i.e., a sparse set of corner points linked by line segments. Recovering the wireframe is a challenging task, where the numbers of both vertices and edges are different for every instance, and a-priori unknown. Our architecture gradually builds up the model: It starts by...	Jan Dirk Wegner, Konrad Schindler, Stefano D'Aronco, Yujia Liu
692	Multi-resolution modeling of a discrete stochastic process identifies causes of cancer	Detection of cancer-causing mutations within the vast and mostly unexplored human genome is a major challenge. Doing so requires modeling the background mutation rate, a highly non-stationary stochastic process, across regions of interest varying in size from one to millions of positions. Here, we present the split-Poisson-Gamma (SPG) distribution, an extension of the classical Poisson-Gamma formulation, to model a discrete stochastic process at multiple resolutions. We demonstrate that the probability model has a closed-form...	Adam Uri Yaari, Andrei Barbu, Bonnie Berger, Boris Katz, Maxwell Sherman, Oliver Clarke Priebe, PoRu Loh
693	C-Learning: Horizon-Aware Cumulative Accessibility Estimation	Multi-goal reaching is an important problem in reinforcement learning needed to achieve algorithmic generalization. Despite recent advances in this field, current algorithms suffer from three major challenges: high sample complexity, learning only a single way of reaching the goals, and difficulties in solving complex motion planning tasks. In order to address these limitations, we introduce the concept of cumulative accessibility functions, which measure the reachability of a goal from a given state within a specified horizon. We show...	Animesh Garg, Anthony L. Caterini, Gabriel LoaizaGanem, Harry J. Braviner, Jesse C. Cresswell, Panteha Naderian, Tong Li
694	Shapley Explanation Networks	Shapley values have become one of the most popular feature attribution explanation methods. However, most prior work has focused on post-hoc Shapley explanations, which can be computationally demanding due to its exponential time complexity and preclude model regularization based on Shapley explanations during training. Thus, we propose to incorporate Shapley values themselves as latent representations in deep models� thereby making Shapley explanations first-class citizens in the modeling paradigm. This intrinsic explanation approach...	David I. Inouye, Rui Wang, Xiaoqian Wang
695	The role of Disentanglement in Generalisation	Combinatorial generalisation — the ability to understand and produce novel combinations of familiar elements — is a core capacity of human intelligence that current AI systems struggle with. Recently, it has been suggested that learning disentangled representations may help address this problem. It is claimed that such representations should be able to capture the compositional structure of the world which can then be combined to support combinatorial generalisation. In this study, we systematically tested how the degree of...	Casimir J. H. Ludwig, Gaurav Malhotra, Jeffrey S. Bowers, Milton Llera Montero, Rui Ponte Costa
696	Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch	Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand,...	Aojun Zhou, Hongsheng Li, Jianbo Liu, Junnan Zhu, Kun Yuan, Wenxiu Sun, Yukun Ma, Zhijie Zhang
697	Unsupervised Meta-Learning through Latent-Space Interpolation in Generative Models	Several recently proposed unsupervised meta-learning approaches rely on synthetic meta-tasks created using techniques such as random selection, clustering and/or augmentation. In this work, we describe a novel approach that generates meta-tasks using generative models. The proposed family of algorithms generate pairs of in-class and out-of-class samples from the latent space in a principled way, allowing us to create synthetic classes forming the training and validation data of a meta-task. We find that the proposed approach, LAtent...	Bill Lin, Ladislau Bölöni, Saeed Vahidian, Sharare Zehtabian, Siavash Khodadadeh, Weijia Wang
698	On Data-Augmentation and Consistency-Based Semi-Supervised Learning	Recently proposed consistency-based Semi-Supervised Learning (SSL) methods such as the Pi-model, temporal ensembling, the mean teacher, or the virtual adversarial training, achieve the state of the art results in several SSL tasks. These methods can typically reach performances that are comparable to their fully supervised counterparts while using only a fraction of labelled examples. Despite these methodological advances, the understanding of these methods is still relatively limited. To make progress, we analyse (variations of) the...	Alexandre H. Thiéry, Atin Ghosh
699	Learning from Demonstration with Weakly Supervised Disentanglement	Robotic manipulation tasks, such as wiping with a soft sponge, require control from multiple rich sensory modalities. Human-robot interaction, aimed at teach- ing robots, is difficult in this setting as there is potential for mismatch between human and machine comprehension of the rich data streams. We treat the task of interpretable learning from demonstration as an optimisation problem over a probabilistic generative model. To account for the high-dimensionality of the data, a high-capacity neural network is chosen to represent the...	Subramanian Ramamoorthy, Yordan Hristov
700	Neurally Augmented ALISTA	It is well-established that many iterative sparse reconstruction algorithms can be unrolled to yield a learnable neural network for improved empirical performance. A prime example is learned ISTA (LISTA) where weights, step sizes and thresholds are learned from training data. Recently, Analytic LISTA (ALISTA) has been introduced, combining the strong empirical performance of a fully learned approach like LISTA, while retaining theoretical guarantees of classical compressed sensing algorithms and significantly reducing the number of...	Freya Behrens, Jonathan Sauder, Peter Jung
701	Shape or Texture: Understanding Discriminative Features in CNNs	Contrasting the previous evidence that neurons in the later layers of a Convolutional Neural Network (CNN) respond to complex object shapes, recent studies have shown that CNNs actually exhibit a 'texture bias': given an image with both texture and shape cues (e.g., a stylized image), a CNN is biased towards predicting the category corresponding to the texture. However, these previous studies conduct experiments on the final classification output of the network, and fail to robustly evaluate the bias contained (i) in the latent...	Björn Ommer, Konstantinos G. Derpanis, Matthew Kowal, Md. Amirul Islam, Neil D. B. Bruce, Patrick Esser, Sen Jia
702	Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization	Flow-based models are powerful tools for designing probabilistic models with tractable density. This paper introduces Convex Potential Flows (CP-Flow), a natural and efficient parameterization of invertible models inspired by the optimal transport (OT) theory. CP-Flows are the gradient map of a strongly convex neural potential function. The convexity implies invertibility and allows us to resort to convex optimization to solve the convex conjugate for efficient inversion. To enable maximum likelihood training, we derive a new gradient...	Aaron C. Courville, ChinWei Huang, Christos Tsirigotis, Ricky T. Q. Chen
703	Wasserstein Embedding for Graph Learning	We present Wasserstein Embedding for Graph Learning (WEGL), a novel and fast framework for embedding entire graphs in a vector space, in which various machine learning models are applicable for graph-level prediction tasks. We leverage new insights on defining similarity between graphs as a function of the similarity between their node embedding distributions. Specifically, we use the Wasserstein distance to measure the dissimilarity between node embeddings of different graphs. Unlike prior work, we avoid pairwise calculation of...	Gustavo K. Rohde, Heiko Hoffmann, Navid Naderializadeh, Soheil Kolouri
704	Meta-learning with negative learning rates	Deep learning models require a large amount of data to perform well. When data is scarce for a target task, we can transfer the knowledge gained by training on similar tasks to quickly learn the target. A successful approach is meta-learning, or "learning to learn" a distribution of tasks, where "learning" is represented by an outer loop, and "to learn" by an inner loop of gradient descent. However, a number of recent empirical studies argue that the inner loop is unnecessary and more simple models work equally well or even better. We...	Alberto Bernacchia
705	Representing Partial Programs with Blended Abstract Semantics	Synthesizing programs from examples requires searching over a vast, combinatorial space of possible programs. In this search process, a key challenge is representing the behavior of a partially written program before it can be executed, to judge if it is on the right track and predict where to search next. We introduce a general technique for representing partially written programs in a program synthesis engine. We take inspiration from the technique of abstract interpretation, in which an approximate execution model is used to...	Armando SolarLezama, Jacob Andreas, Joshua B. Tenenbaum, Matthew Bowers, Maxwell I. Nye, Yewen Pu
706	Fast convergence of stochastic subgradient method under interpolation	This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to over-parameterized nonsmooth optimization problems that satisfy an interpolation condition. By leveraging the composite structure of the empirical risk minimization problems, we prove that SSGD converges, respectively, with rates $O(1/\epsilon)$ and $O(\log(1/\epsilon))$ for convex and strongly-convex objectives when interpolation holds. These rates coincide with established rates for the stochastic gradient descent (SGD) method applied to...	Huang Fang, Michael P. Friedlander, Zhenan Fan
707	A Hypergradient Approach to Robust Regression without Correspondence	We consider a regression problem, where the correspondence between the input and output data is not available. Such shuffled data are commonly observed in many real world problems. Take flow cytometry as an example: the measuring instruments are unable to preserve the correspondence between the samples and the measurements. Due to the combinatorial nature of the problem, most of the existing methods are only applicable when the sample size is small, and are limited to linear regression models. To overcome such bottlenecks, we propose a...	Hongteng Xu, Hongyuan Zha, Simiao Zuo, Tuo Zhao, Xiaojing Ye, Yixiu Mao, Yujia Xie
708	On the role of planning in model-based deep reinforcement learning	Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within...	Abram L. Friesen, Arthur Guez, Fabio Viola, Feryal M. P. Behbahani, Jessica B. Hamrick, Lars Holger Buesing, Petar Velickovic, Sims Witherspoon, Theophane Weber, Thomas Anthony
709	Trajectory Prediction using Equivariant Continuous Convolution	Trajectory prediction is a critical part of many AI applications, for example, the safe operation of autonomous vehicles. However, current methods are prone to making inconsistent and physically unrealistic predictions. We leverage insights from fluid dynamics to overcome this limitation by considering internal symmetry in real-world trajectories. We propose a novel model, Equivariant Continous COnvolution (ECCO) for improved trajectory prediction. ECCO uses rotationally-equivariant continuous convolutions to embed the symmetries of...	Jinxi Li, Robin Walters, Rose Yu
710	Grounding Language to Autonomously-Acquired Skills via Goal Generation	We are interested in the autonomous acquisition of repertoires of skills. Language-conditioned reinforcement learning (LC-RL) approaches are great tools in this quest, as they allow to express abstract goals as sets of constraints on the states. However, most LC-RL agents are not autonomous and cannot learn without external instructions and feedback. Besides, their direct language condition cannot account for the goal-directed behavior of pre-verbal infants and strongly limits the expression of behavioral diversity for a given language...	Ahmed Akakzia, Cédric Colas, Mohamed Chetouani, Olivier Sigaud, PierreYves Oudeyer
711	Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions	It is of primary interest for ML to understand how agents learn and interact dynamically in competitive environments and games (e.g. GANs). But this has been a difficult task, as irregular behaviors are commonly observed in such systems. This can be explained theoretically, for instance, by the works of Cheung and Piliouras (COLT 2019; NeurIPS 2020), which showed that in two-person zero-sum games, if agents employ one of the most well-known learning algorithms, Multiplicative Weights Update (MWU), then Lyapunov chaos occurs everywhere...	Yixin Tao, Yun Kuen Cheung
712	Isometric Transformation Invariant and Equivariant Graph Convolutional Networks	Graphs are one of the most important data structures for representing pairwise relations between objects. Specifically, a graph embedded in a Euclidean space is essential to solving real problems, such as physical simulations. A crucial requirement for applying graphs in Euclidean spaces to physical simulations is learning and inferring the isometric transformation invariant and equivariant features in a computationally efficient manner. In this paper, we propose a set of transformation invariant and equivariant models based on graph...	Masanobu Horie, Naoki Morita, Naoto Mitsume, Toshiaki Hishinuma, Yu Ihara
713	R-GAP: Recursive Gradient Attack on Privacy	Federated learning frameworks have been regarded as a promising approach to break the dilemma between demands on privacy and the promise of learning from large collections of distributed data. Many such frameworks only ask collaborators to share their local update of a common model, i.e. gradients with respect to locally stored data, instead of exposing their raw data to other collaborators. However, recent optimization-based gradient attacks show that raw data can often be accurately recovered from gradients. It has been shown that...	Junyi Zhu, Matthew B. Blaschko
714	Multi-Level Local SGD: Distributed SGD for Heterogeneous Hierarchical Networks	We propose Multi-Level Local SGD, a distributed stochastic gradient method for learning a smooth, non-convex objective in a multi-level communication network with heterogeneous workers. Our network model consists of a set of disjoint sub-networks, with a single hub and multiple workers; further, workers may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a...	Anirban Das, Stacy Patterson, Timothy Castiglia
715	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding	Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost,ease of programming, and efficient implementation on parallel devices. In this paper we demonstrate conditional computation as a remedy to the above mentioned impediments, and demonstrate its efficacy and utility....	Dehao Chen, Dmitry Lepikhin, HyoukJoong Lee, Maxim Krikun, Noam Shazeer, Orhan Firat, Yanping Huang, Yuanzhong Xu, Zhifeng Chen
716	Representation learning for improved interpretability and classification accuracy of clinical factors from EEG	Despite extensive standardization, diagnostic interviews for mental health disorders encompass substantial subjective judgment. Previous studies have demonstrated that EEG-based neural measures can function as reliable objective correlates of depression, or even predictors of depression and its course. However, their clinical utility has not been fully realized because of 1) the lack of automated ways to deal with the inherent noise associated with EEG data at scale, and 2) the lack of knowledge of which aspects of the EEG signal may...	Garrett Honke, Greg Hajcak, Irina Higgins, Julia Klawohn, Katie Link, Nina Thigpen, Pramod Gupta, Sunny Duan, Vladimir Miskovic
717	Multiplicative Filter Networks	Although deep networks are typically used to approximate functions over high dimensional inputs, recent work has increased interest in neural networks as function approximators for low-dimensional-but-complex functions, such as representing images as a function of pixel coordinates, solving differential equations, or representing signed distance fields or neural radiance fields. Key to these recent successes has been the use of new elements such as sinusoidal nonlinearities, or Fourier features in positional encodings, which vastly...	Anit Kumar Sahu, Devin Willmott, J. Zico Kolter, Rizal Fathony
718	Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks	Neural networks (NNs) whose subnetworks implement reusable functions are expected to offer numerous advantages, including compositionality through efficient recombination of functional building blocks, interpretability, preventing catastrophic interference, etc. Understanding if and how NNs are modular could provide insights into how to improve them. Current inspection methods, however, fail to link modules to their functionality. In this paper, we present a novel method based on learning binary weight masks to identify individual...	Jürgen Schmidhuber, Róbert Csordás, Sjoerd van Steenkiste
719	Modeling the Second Player in Distributionally Robust Optimization	Distributionally robust optimization (DRO) provides a framework for training machine learning models that are able to perform well on a collection of related data distributions (the "uncertainty set"). This is done by solving a min-max game: the model is trained to minimize its maximum expected loss among all distributions in the uncertainty set. While careful design of the uncertainty set is critical to the success of the DRO procedure, previous work has been limited to relatively simple alternatives that keep the min-max optimization...	Graham Neubig, Paul Michel, Tatsunori Hashimoto
720	Private Post-GAN Boosting	Differentially private GANs have proven to be a promising approach for generating realistic synthetic data without compromising the privacy of individuals. Due to the privacy-protective noise introduced in the training, the convergence of GANs becomes even more elusive, which often leads to poor utility in the output generator at the end of training. We propose Private post-GAN boosting (Private PGB), a differentially private method that combines samples produced by the sequence of generators obtained during GAN training to create a...	Cynthia Dwork, Marcel Neunhoeffer, Steven Wu
721	Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis	In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with style transfer and speech variation. Flowtron borrows insights from Autoregressive Flows and revamps Tacotron 2 in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be used to modulate many aspects of speech...	Bryan Catanzaro, Kevin J. Shih, Rafael Valle, Ryan Prenger
722	Learning Structural Edits via Incremental Tree Transformations	While most neural generative models generate outputs in a single pass, the human creative process is usually one of iterative building and refinement. Recent work has proposed models of editing processes, but these mostly focus on editing sequential data and/or only model a single editing pass. In this paper, we present a generic model for incremental editing of structured data (i.e. ''structural edits''). Particularly, we focus on tree-structured data, taking abstract syntax trees of computer programs as our canonical example. Our...	Frank F. Xu, Graham Neubig, Huan Sun, Pengcheng Yin, Ziyu Yao
723	Sample-Efficient Automated Deep Reinforcement Learning	Despite significant progress in challenging problems across various domains, applying state-of-the-art deep reinforcement learning (RL) algorithms remains challenging due to their sensitivity to the choice of hyperparameters. This sensitivity can partly be attributed to the non-stationarity of the RL problem, potentially requiring different hyperparameter settings at various stages of the learning process. Additionally, in the RL setting, hyperparameter optimization (HPO) requires a large number of environment interactions, hindering...	André Biedenkapp, Frank Hutter, Gregor Köhler, Jörg K. H. Franke
724	Unsupervised Discovery of 3D Physical Objects from Video	We study the problem of unsupervised physical object discovery. While existing frameworks aim to decompose scenes into 2D segments based off each object's appearance, we explore how physics, especially object interactions, facilitates disentangling of 3D geometry and position of objects from video, in an unsupervised manner. Drawing inspiration from developmental psychology, our Physical Object Discovery Network (POD-Net) uses both multi-scale pixel cues and physical motion cues to accurately segment observable and partially occluded...	Jiajun Wu, Joshua B. Tenenbaum, Kevin A. Smith, Tomer D. Ullman, Yilun Du
725	Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime	We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g. the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global...	Andrea Agazzi, Jianfeng Lu
726	Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers	Solving high-dimensional, continuous robotic tasks is a challenging optimization problem. Model-based methods that rely on zero-order optimizers like the cross-entropy method (CEM) have so far shown strong performance and are considered state-of-the-art in the model-based reinforcement learning community. However, this success comes at the cost of high computational complexity, being therefore not suitable for real-time control. In this paper, we propose a technique to jointly optimize the trajectory and distill a policy, which is...	Cristina Pinneri, Georg Martius, Sebastian Blaes, Shambhuraj Sawant
727	Temporally-Extended ε-Greedy Exploration	Recent work on exploration in reinforcement learning (RL) has led to a series of increasingly complex solutions to the problem. This increase in complexity often comes at the expense of generality. Recent empirical studies suggest that, when applied to a broader set of domains, some sophisticated exploration methods are outperformed by simpler counterparts, such as ε-greedy. In this paper we propose an exploration algorithm that retains the simplicity of ε-greedy while reducing dithering. We build on a simple hypothesis: the main...	André Barreto, Georg Ostrovski, Will Dabney
728	Rapid Task-Solving in Novel Environments	We propose the challenge of rapid task-solving in novel environments (RTS), wherein an agent must solve a series of tasks as rapidly as possible in an unfamiliar environment. An effective RTS agent must balance between exploring the unfamiliar environment and solving its current task, all while building a model of the new environment over which it can plan when faced with later tasks. While modern deep RL agents exhibit some of these abilities in isolation, none are suitable for the full RTS challenge. To enable progress toward RTS, we...	Adam Santoro, David Raposo, Laurent Sartran, Matthew M. Botvinick, Ryan Faulkner, Samuel Ritter
729	Tradeoffs in Data Augmentation: An Empirical Study	Though data augmentation has become a standard component of deep neural network training, the underlying mechanism behind the effectiveness of these techniques remains poorly understood. In practice, augmentation policies are often chosen using heuristics of distribution shift or augmentation diversity. Inspired by these, we conduct an empirical study to quantify how data augmentation improves model generalization. We introduce two interpretable and easy-to-compute measures: Affinity and Diversity. We find that augmentation performance...	Ekin Dogus Cubuk, Ethan Dyer, Raphael Gontijo Lopes, Sylvia J. Smullin
730	Multiscale Score Matching for Out-of-Distribution Detection	We present a new methodology for detecting out-of-distribution (OOD) images by utilizing norms of the score estimates at multiple noise scales. A score is defined to be the gradient of the log density with respect to the input data. Our methodology is completely unsupervised and follows a straight forward training scheme. First, we train a deep network to estimate scores for $L$ levels of noise. Once trained, we calculate the noisy score estimates for $N$ in-distribution samples and take the L2-norms across the input dimensions...	Ahsan Mahmood, Junier Oliva, Martin Andreas Styner
731	Understanding Over-parameterization in Generative Adversarial Networks	A broad class of unsupervised deep learning methods such as Generative Adversarial Networks (GANs) involve training of overparameterized models where the number of parameters of the model exceeds a certain threshold. Indeed, most successful GANs used in practice are trained using overparameterized generator and discriminator networks, both in terms of depth and width. A large body of work in supervised learning have shown the importance of model overparameterization in the convergence of the gradient descent (GD) to globally optimal...	Dominik Stöger, Mahdi Soltanolkotabi, Mohammadmahdi Sajedi, Mucong Ding, Neha Mukund Kalibhat, Soheil Feizi, Yogesh Balaji
732	Go with the flow: Adaptive control for Neural ODEs	Despite their elegant formulation and lightweight memory cost, neural ordinary differential equations (NODEs) suffer from known representational limitations. In particular, the single flow learned by NODEs cannot express all homeomorphisms from a given data space to itself, and their static weight parameterization restricts the type of functions they can learn compared to discrete architectures with layer-dependent weights. Here, we describe a new module called neurally-controlled ODE (N-CODE) designed to improve the expressivity of...	Mathieu Chalvidal, Matthew Ricci, Rufin VanRullen, Thomas Serre
733	Linear Last-iterate Convergence in Constrained Saddle-point Optimization	Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU) for saddle-point optimization have received growing attention due to their favorable last-iterate convergence. However, their behaviors for simple bilinear games over the probability simplex are still not fully understood --- previous analysis lacks explicit convergence rates, only applies to an exponentially small learning rate, or requires additional assumptions such as the uniqueness of the optimal solution. In this work, we significantly...	ChenYu Wei, ChungWei Lee, Haipeng Luo, Mengxiao Zhang
734	Learning advanced mathematical computations from examples	Using transformers over large generated datasets, we train models to learn mathematical properties of differential systems, such as local stability, behavior at infinity and controllability. We achieve near perfect prediction of qualitative characteristics, and good approximations of numerical features of the system. This demonstrates that neural networks can learn to perform complex computations, grounded in advanced theory, from examples, without built-in mathematical knowledge.	Amaury Hayat, François Charton, Guillaume Lample
735	WaveGrad: Estimating Gradients for Waveform Generation	This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive...	Heiga Zen, Mohammad Norouzi, Nanxin Chen, Ron J. Weiss, William Chan, Yu Zhang
736	SALD: Sign Agnostic Learning with Derivatives	Learning 3D geometry directly from raw data, such as point clouds, triangle soups, or unoriented meshes is still a challenging task that feeds many downstream computer vision and graphics applications. In this paper, we introduce SALD: a method for learning implicit neural representations of shapes directly from raw data. We generalize sign agnostic learning (SAL) to include derivatives: given an unsigned distance function to the input raw data, we advocate a novel sign agnostic regression loss, incorporating both pointwise values and...	Matan Atzmon, Yaron Lipman
737	Generalized Energy Based Models	We introduce the Generalized Energy Based Model (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the "generator"). GEBMs are trained by alternating between learning...	Arthur Gretton, Liang Zhou, Michael Arbel
738	Long Range Arena : A Benchmark for Efficient Transformers	Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model...	Dara Bahri, Donald Metzler, Jinfeng Rao, Liu Yang, Mostafa Dehghani, Philip Pham, Samira Abnar, Sebastian Ruder, Yi Tay, Yikang Shen
739	Beyond Categorical Label Representations for Image Classification	We find that the way we choose to represent data labels can have a profound effect on the quality of trained models. For example, training an image classifier to regress audio labels rather than traditional categorical probabilities produces a more reliable classification. This result is surprising, considering that audio labels are more complex than simpler numerical probabilities or text. We hypothesize that high dimensional, high entropy label representations are generally more useful because they provide a stronger error signal. We...	Boyuan Chen, Hod Lipson, Sunand Raghupathi, Yu Li
740	CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers	Dialogue state trackers have made significant progress on benchmark datasets, but their generalization capability to novel and realistic scenarios beyond the held- out conversations is less understood. We propose controllable counterfactuals (COCO) to bridge this gap and evaluate dialogue state tracking (DST) models on novel scenarios, i.e., would the system successfully tackle the request if the user responded differently but still consistently with the dialogue flow? COCO leverages turn-level belief states as counterfactual...	Caiming Xiong, Jia Li, Kazuma Hashimoto, Nazneen Fatema Rajani, Semih Yavuz, Shiyang Li, Tong Niu, Xifeng Yan, Yingbo Zhou
741	Stochastic Security: Adversarial Defense Using Long-Run Dynamics of Energy-Based Models	The vulnerability of deep networks to adversarial attacks is a central problem for deep learning from the perspective of both cognition and security. The current most successful defense method is to train a classifier using adversarial images created during learning. Another defense approach involves transformation or purification of the original input to remove adversarial signals before the image is classified. We focus on defending naturally-trained classifiers using Markov Chain Monte Carlo (MCMC) sampling with an Energy-Based...	Jonathan Craig Mitchell, Mitch Hill, SongChun Zhu
742	X2T: Training an X-to-Text Typing Interface with Online Learning from User Feedback	We aim to help users communicate their intent to machines using flexible, adaptive interfaces that translate arbitrary user input into desired actions. In this work, we focus on assistive typing applications in which a user cannot operate a keyboard, but can instead supply other inputs, such as webcam images that capture eye gaze or neural activity measured by a brain implant. Standard methods train a model on a fixed dataset of user inputs, then deploy a static interface that does not learn from its mistakes; in part, because...	Anca D. Dragan, Glen Berseth, Jensen Gao, Karunesh Ganguly, Nicholas Hardy, Nikhilesh Natraj, Sergey Levine, Siddharth Reddy
743	Mapping the Timescale Organization of Neural Language Models	In the human brain, sequences of language input are processed within a distributed and hierarchical architecture, in which higher stages of processing encode contextual information over longer timescales. In contrast, in recurrent neural networks which perform natural language processing, we know little about how the multiple timescales of contextual information are functionally organized. Therefore, we applied tools developed in neuroscience to map the “processing timescales” of individual units within a word-level LSTM language...	Christopher J. Honey, HsiangYun Sherry Chien, Jinhan Zhang
744	PDE-Driven Spatiotemporal Disentanglement	A recent line of work in the machine learning community addresses the problem of predicting high-dimensional spatiotemporal phenomena by leveraging specific tools from the differential equations theory. Following this direction, we propose in this article a novel and general paradigm for this task based on a resolution method for partial differential equations: the separation of variables. This inspiration allows us to introduce a dynamical interpretation of spatiotemporal disentanglement. It induces a principled model based on...	JeanYves Franceschi, Jérémie Donà, Patrick Gallinari, Sylvain Lamprier
745	OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning	Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent’s ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety...	Anurag Ajay, Aviral Kumar, Ofir Nachum, Pulkit Agrawal, Sergey Levine
746	Does enhanced shape bias improve neural network robustness to common corruptions?	Convolutional neural networks (CNNs) learn to extract representations of complex features, such as object shapes and textures to solve image recognition tasks. Recent work indicates that CNNs trained on ImageNet are biased towards features that encode textures and that these alone are sufficient to generalize to unseen test data from the same distribution as the training data but often fail to generalize to out-of-distribution data. It has been shown that augmenting the training data with different image styles decreases this texture...	Chaithanya Kumar Mummadi, Jan Hendrik Metzen, Julien Vitay, Ranjitha Subramaniam, Robin Hutmacher, Volker Fischer
747	Directed Acyclic Graph Neural Networks	Graph-structured data ubiquitously appears in science and engineering. Graph neural networks (GNNs) are designed to exploit the relational inductive bias exhibited in graphs; they have been shown to outperform other forms of neural networks in scenarios where structure information supplements node features. The most common GNN architecture aggregates information from neighborhoods based on message passing. Its generality has made it broadly applicable. In this paper, we focus on a special, yet widely used, type of graphs---DAGs---and...	Jie Chen, Veronika Thost
748	QPLEX: Duplex Dueling Multi-Agent Q-Learning	We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE has an important concept, Individual-Global-Max (IGM) principle, which requires the consistency between joint and local action selections to support efficient local decision-making. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may suffer from instability...	Chongjie Zhang, Jianhao Wang, Terry Liu, Yang Yu, Zhizhou Ren
749	Learning Energy-Based Models by Diffusion Recovery Likelihood	While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained with recovery likelihood, which maximizes the conditional probability of the data at a certain noise level given their noisy versions at a higher noise...	Ben Poole, Diederik P. Kingma, Ruiqi Gao, Yang Song, Ying Nian Wu
750	Neural Networks for Learning Counterfactual G-Invariances from Single Environments	Despite —or maybe because of— their astonishing capacity to fit data, neural networks are believed to have difficulties extrapolating beyond training data distribution. This work shows that, for extrapolations based on finite transformation groups, a model’s inability to extrapolate is unrelated to its capacity. Rather, the shortcoming is inherited from a learning hypothesis: Examples not explicitly observed with infinitely many training examples have underspecified outcomes in the learner’s model. In order to endow neural networks...	Bruno Ribeiro, S. Chandra Mouli
751	Model-Based Offline Planning	Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to...	Arthur Argenson, Gabriel DulacArnold
752	On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections	Disparate impact has raised serious concerns in machine learning applications and its societal impacts. In response to the need of mitigating discrimination, fairness has been regarded as a crucial property in algorithmic design. In this work, we study the problem of disparate impact on graph-structured data. Specifically, we focus on dyadic fairness, which articulates a fairness concept that a predictive relationship between two instances should be independent of the sensitive attributes. Based on this, we theoretically relate the...	Han Zhao, Hongfu Liu, Peizhao Li, Pengyu Hong, Yifei Wang
753	Coping with Label Shift via Distributionally Robust Optimisation	The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an unlabelled test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be...	Aditya Krishna Menon, Andreas Veit, Jingzhao Zhang, Sanjiv Kumar, Srinadh Bhojanapalli, Suvrit Sra
754	Faster Binary Embeddings for Preserving Euclidean Distances	We propose a fast, distance-preserving, binary embedding algorithm to transform a high-dimensional dataset $\mathcal{T}\subseteq\mathbb{R}^n$ into binary sequences in the cube $\{\pm 1\}^m$ . When $\mathcal{T}$ consists of well-spread (i.e., non-sparse) vectors, our embedding method applies a stable noise-shaping quantization scheme to $A x$ where $A\in\mathbb{R}^{m\times n}$ is a sparse Gaussian random matrix. This contrasts with most binary embedding methods, which usually use $x\mapsto \mathrm{sign}(Ax)$ for the embedding. Moreover,...	Jinjie Zhang, Rayan Saab
755	Learning and Evaluating Representations for Deep One-Class Classification	We present a two-stage framework for deep one-class classification. We first learn self-supervised representations from one-class data, and then build one-class classifiers on learned representations. The framework not only allows to learn better representations, but also permits building one-class classifiers that are faithful to the target task. We argue that classifiers inspired by the statistical perspective in generative or discriminative models are more effective than existing approaches, such as a normality score from a...	ChunLiang Li, Jinsung Yoon, Kihyuk Sohn, Minho Jin, Tomas Pfister
756	Conditional Negative Sampling for Contrastive Learning of Visual Representations	Recent methods for learning unsupervised visual representations, dubbed contrastive learning, optimize the noise-contrastive estimation (NCE) bound on mutual information between two transformations of an image. NCE typically uses randomly sampled negative examples to normalize the objective, but this may often include many uninformative examples either because they are too easy or too hard to discriminate. Taking inspiration from metric learning, we show that choosing semi-hard negatives can yield stronger contrastive representations....	Chengxu Zhuang, Daniel Yamins, Mike Wu, Milan Mosse, Noah D. Goodman
757	On Position Embeddings in BERT	Various Position Embeddings (PEs) have been proposed in Transformer based architectures~(e.g. BERT) to model word order. These are empirically-driven and perform well, but no formal framework exists to systematically study them. To address this, we present three properties of PEs that capture word distance in vector space: translation invariance, monotonicity, and symmetry. These properties formally capture the behaviour of PEs and allow us to reinterpret sinusoidal PEs in a principled way. Moreover, we propose a new probing test...	Benyou Wang, Christina Lioma, Hao Yang, Jakob Grue Simonsen, Lifeng Shang, Qun Liu, Xin Jiang
758	Repurposing Pretrained Models for Robust Out-of-domain Few-Shot Learning	Model-agnostic meta-learning (MAML) is a popular method for few-shot learning but assumes that we have access to the meta-training set. In practice, training on the meta-training set may not always be an option due to data privacy concerns, intellectual property issues, or merely lack of computing resources. In this paper, we consider the novel problem of repurposing pretrained MAML checkpoints to solve new few-shot classification tasks. Because of the potential distribution mismatch, the original MAML steps may no longer be optimal....	Gabriel Huang, Hwidong Na, Namyeong Kwon, Simon LacosteJulien
759	Dataset Meta-Learning from Kernel Ridge-Regression	One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept of $\epsilon$ -approximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar performance. We introduce a meta-learning algorithm Kernel Inducing Points (KIP) for obtaining such remarkable datasets, drawing inspiration from recent developments in the correspondence between infinitely-wide neural...	Jaehoon Lee, Timothy Nguyen, Zhourong Chen
760	AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition	Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical...	Aude Oliva, ChungChing Lin, Kate Saenko, Leonid Karlinsky, Prasanna Sattigeri, Rameswar Panda, Rogério Feris, Yue Meng
761	One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks	Can deep learning solve multiple, very different tasks simultaneously? We investigate how the representations of the underlying tasks affect the ability of a single neural network to learn them jointly. We present theoretical and empirical findings that a single neural network is capable of simultaneously learning multiple tasks from a combined data set, for a variety of methods for representing tasks---for example, when the distinct tasks are encoded by well-separated clusters or decision trees over some task-code attributes. Indeed,...	Abhimanyu Das, Atish Agarwala, Brendan Juba, Qiuyi Zhang, Rina Panigrahy, Vatsal Sharan, Xin Wang
762	Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms	We describe the convex semi-infinite dual of the two-layer vector-output ReLU neural network training problem. This semi-infinite dual admits a finite dimensional representation, but its support is over a convex set which is difficult to characterize. In particular, we demonstrate that the non-convex neural network training problem is equivalent to a finite-dimensional convex copositive program. Our work is the first to identify this strong connection between the global optima of neural networks and those of copositive programs. We...	Arda Sahiner, John M. Pauly, Mert Pilanci, Tolga Ergen
763	In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning	The recent research in semi-supervised learning (SSL) is mostly dominated by consistency regularization based methods which achieve strong performance. However, they heavily rely on domain-specific data augmentations, which are not easy to generate for all data modalities. Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation. We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models; these...	Kevin Duarte, Mamshad Nayeem Rizve, Mubarak Shah, Yogesh S. Rawat
764	MELR: Meta-Learning via Modeling Episode-Level Relationships for Few-Shot Learning	Most recent few-shot learning (FSL) approaches are based on episodic training whereby each episode samples few training instances (shots) per class to imitate the test condition. However, this strict adhering to test condition has a negative side effect, that is, the trained model is susceptible to the poor sampling of few shots. In this work, for the first time, this problem is addressed by exploiting inter-episode relationships. Specifically, a novel meta-learning via modeling episode-level relationships (MELR) framework is proposed....	Nanyi Fei, Songfang Huang, Tao Xiang, Zhiwu Lu
765	Why resampling outperforms reweighting for correcting sampling bias with stochastic gradients	A data set sampled from a certain population is biased if the subgroups of the population are sampled at proportions that are significantly different from their underlying proportions. Training machine learning models on biased data sets requires correction techniques to compensate for the bias. We consider two commonly-used techniques, resampling and reweighting, that rebalance the proportions of the subgroups to maintain the desired objective function. Though statistically equivalent, it has been observed that resampling outperforms...	Jing An, Lexing Ying, Yuhua Zhu
766	Prediction and generalisation over directed actions by grid cells	Knowing how the effects of directed actions generalise to new situations (e.g. moving North, South, East and West, or turning left, right, etc.) is key to rapid generalisation across new situations. Markovian tasks can be characterised by a state space and a transition matrix and recent work has proposed that neural grid codes provide an efficient representation of the state space, as eigenvectors of a transition matrix reflecting diffusion across states, that allows efficient prediction of future state distributions. Here we extend...	Changmin Yu, Neil Burgess, Timothy Behrens
767	Hopper: Multi-hop Transformer for Spatiotemporal Reasoning	This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a...	Alexandru NiculescuMizil, Asim Kadav, Farley Lai, Hans Peter Graf, Honglu Zhou, Martin Renqiang Min, Mubbasir Kapadia
768	Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization	Dimensionality reduction methods for count data are critical to a wide range of applications in medical informatics and other fields where model interpretability is paramount. For such data, hierarchical Poisson matrix factorization (HPF) and other sparse probabilistic non-negative matrix factorization (NMF) methods are considered to be interpretable generative models. They consist of sparse transformations for decoding their learned representations into predictions. However, sparsity in representation decoding does not necessarily...	Ayah Zirikly, Bart Desmet, Carson C. Chow, Joshua C. Chang, Jungmin Han, Patrick Fletcher, Shashaank Vattikuti, Ted L. Chang
769	Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning	Dancing to music is one of human's innate abilities since ancient times. In machine learning research, however, synthesizing dance movements from music is a challenging problem. Recently, researchers synthesize human motion sequences through autoregressive models like recurrent neural network (RNN). Such an approach often generates short sequences due to an accumulation of prediction errors that are fed back into the neural network. This problem becomes even more severe in the long motion sequence generation. Besides, the consistency...	Daxin Jiang, Huang Hu, Kei Sawada, Mi Zhang, Ruozi Huang, Wei Wu
770	PAC Confidence Predictions for Deep Neural Network Classifiers	A key challenge for deploying deep neural networks (DNNs) in safety critical settings is the need to provide rigorous ways to quantify their uncertainty. In this paper, we propose a novel algorithm for constructing predicted classification confidences for DNNs that comes with provable correctness guarantees. Our approach uses Clopper-Pearson confidence intervals for the Binomial distribution in conjunction with the histogram binning approach to calibrated prediction. In addition, we demonstrate how our predicted confidences can be used...	Insup Lee, Osbert Bastani, Sangdon Park, Shuo Li
771	BREEDS: Benchmarks for Subpopulation Shift	We develop a methodology for assessing the robustness of models to subpopulation shift---specifically, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing large-scale datasets. Applying this...	Aleksander Madry, Dimitris Tsipras, Shibani Santurkar
772	Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification	Differentially private SGD (DP-SGD) is one of the most popular methods for solving differentially private empirical risk minimization (ERM). Due to its noisy perturbation on each gradient update, the error rate of DP-SGD scales with the ambient dimension $p$ , the number of parameters in the model. Such dependence can be problematic for over-parameterized models where $p \gg n$ , the number of training samples. Existing lower bounds on private ERM show that such dependence on $p$ is inevitable in the worst case. In this paper, we...	Arindam Banerjee, Steven Wu, Yingxue Zhou
773	End-to-End Egospheric Spatial Memory	Spatial memory, or the ability to remember and recall specific locations and objects, is central to autonomous agents' ability to carry out tasks in real environments. However, most existing artificial memory modules are not very adept at storing spatial information. We propose a parameter-free module, Egospheric Spatial Memory (ESM), which encodes the memory in an ego-sphere around the agent, enabling expressive 3D representations. ESM can be trained end-to-end via either imitation or reinforcement learning, and improves both training...	Andrew J. Davison, Daniel James Lenton, Ronald Clark, Stephen James
774	Evaluating the Disentanglement of Deep Generative Models through Manifold Topology	Learning disentangled representations is regarded as a fundamental task for improving the generalization, robustness, and interpretability of generative models. However, measuring disentanglement has been challenging and inconsistent, often dependent on an ad-hoc external model or specific to a certain dataset. To address this, we present a method for quantifying disentanglement that only uses the generative model, by measuring the topological similarity of conditional submanifolds in the learned representation. This method showcases...	Andrew Y. Ng, Eric Zelikman, Fred Lu, Gunnar E. Carlsson, Sharon Zhou, Stefano Ermon
775	SCoRe: Pre-Training for Context Representation in Conversational Semantic Parsing	Conversational Semantic Parsing (CSP) is the task of converting a sequence of natural language queries to formal language (e.g., SQL, SPARQL) that can be executed against a structured ontology (e.g. databases, knowledge bases). To accomplish this task, a CSP system needs to model the relation between the unstructured language utterance and the structured ontology while representing the multi-turn dynamics of the dialog. Pre-trained language models (LMs) are the state-of-the-art for various natural language processing tasks. However,...	Ahmed Hassan Awadallah, Alex Polozov, Christopher Meek, Rui Zhang, Tao Yu
776	Decoupling Global and Local Representations via Invertible Generative Flows	In this work, we propose a new generative model that is capable of automatically decoupling global and local representations of images in an entirely unsupervised setting, by embedding a generative flow in the VAE framework to model the decoder. Specifically, the proposed model utilizes the variational auto-encoding framework to learn a (low-dimensional) vector of latent variables to capture the global information of an image, which is fed as a conditional input to a flow-based invertible decoder with architecture borrowed from style...	Eduard H. Hovy, Shanghang Zhang, Xiang Kong, Xuezhe Ma
777	Pre-training Text-to-Text Transformers for Concept-centric Common Sense	Pretrained language models (PTLM) have achieved impressive results in a range of natural language understanding (NLU) and generation (NLG) tasks that require a syntactic and semantic understanding of the text. However, current pre-training objectives such as masked token prediction (for BERT-style PTLMs) and masked span infilling (for T5-style PTLMs) do not explicitly model the relational and compositional commonsense knowledge about everyday concepts, which is crucial to many downstream tasks requiring commonsense reasoning. To...	DongHo Lee, Ravi Kiran Selvam, Seyeon Lee, Wangchunshu Zhou, Xiang Ren
778	Local Search Algorithms for Rank-Constrained Convex Optimization	We propose greedy and local search algorithms for rank-constrained convex optimization, namely solving $\underset{\mathrm{rank}(A)\leq r^\}{\min}\, R(A)$ given a convex function $R:\mathbb{R}^{m\times n}\rightarrow \mathbb{R}$ and a parameter $r^\$ . These algorithms consist of repeating two steps: (a) adding a new rank-1 matrix to $A$ and (b) enforcing the rank constraint on $A$ . We refine and improve the theoretical analysis of Shalev-Shwartz et al. (2011), and show that if the rank-restricted condition number of $R$ is $\kappa$ , a...	Kyriakos Axiotis, Maxim Sviridenko
779	Combining Label Propagation and Simple Models out-performs Graph Neural Networks	Graph Neural Networks (GNNs) are a predominant technique for learning over graphs. However, there is relatively little understanding of why GNNs are successful in practice and whether they are necessary for good performance. Here, we show that for many standard transductive node classification benchmarks, we can exceed or match the performance of state-of-the-art GNNs by combining shallow models that ignore the graph structure with two simple post-processing steps that exploit correlation in the label structure: (i) an “error...	Abhay Singh, Austin R. Benson, Horace He, Qian Huang, SerNam Lim
780	Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning	State-of-the-art natural language understanding classification models follow two-stages: pre-training a large language model on an auxiliary task, and then fine-tuning the model on a task-specific labeled dataset using cross-entropy loss. However, the cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a...	Alexis Conneau, Beliz Gunel, Jingfei Du, Veselin Stoyanov
781	SAFENet: A Secure, Accurate and Fast Neural Network Inference	The advances in neural networks have driven many companies to provide prediction services to users in a wide range of applications. However, current prediction systems raise privacy concerns regarding the user's private data. A cryptographic neural network inference service is an efficient way to allow two parties to execute neural network inference without revealing either party’s data or model. Nevertheless, existing cryptographic neural network inference services suffer from huge running latency; in particular, the latency of...	Hongxia Jin, Lei Jiang, Qian Lou, Yilin Shen
782	Provably robust classification of adversarial examples with detection	Adversarial attacks against deep networks can be defended against either by building robust classifiers or, by creating classifiers that can \emph{detect} the presence of adversarial perturbations. Although it may intuitively seem easier to simply detect attacks rather than build a robust classifier, this has not bourne out in practice even empirically, as most detection methods have subsequently been broken by adaptive attacks, thus necessitating \emph{verifiable} performance for detection mechanisms. In this paper, we propose a new...	Ali Lotfi, Fatemeh Sheikholeslami, J. Zico Kolter
783	Saliency is a Possible Red Herring When Diagnosing Poor Generalization	Poor generalization is one symptom of models that learn to predict target variables using spuriously-correlated image features present only in the training distribution instead of the true image features that denote a class. It is often thought that this can be diagnosed visually using attribution (aka saliency) maps. We study if this assumption is correct. In some prediction tasks, such as for medical images, one may have some images with masks drawn by a human expert, indicating a region of the image containing relevant information...	Becks Simpson, Francis Dutil, Joseph D. Viviano, Joseph Paul Cohen, Yoshua Bengio
784	Fourier Neural Operator for Parametric Partial Differential Equations	The classical development of neural networks has primarily focused on learning mappings between finite-dimensional Euclidean spaces. Recently, this has been generalized to neural operators that learn mappings between function spaces. For partial differential equations (PDEs), neural operators directly learn the mapping from any functional parametric dependence to the solution. Thus, they learn an entire family of PDEs, in contrast to classical methods which solve one instance of the equation. In this work, we formulate a new neural...	Andrew M. Stuart, Anima Anandkumar, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Nikola Borislavov Kovachki, Zongyi Li
785	Combining Ensembles and Data Augmentation Can Harm Your Calibration	Ensemble methods which average over multiple neural network predictions are a simple approach to improve a model’s calibration and robustness. Similarly, data augmentation techniques, which encode prior information in the form of invariant feature transformations, are effective for improving calibration and robustness. In this paper, we show a surprising pathology: combining ensembles and data augmentation can harm model calibration. This leads to a trade-off in practice, whereby improved accuracy by combining the two techniques comes...	Balaji Lakshminarayanan, Dustin Tran, Ghassen Jerfel, Jasper Snoek, Michael W. Dusenberry, Rafael Muller, Yeming Wen
786	SOLAR: Sparse Orthogonal Learned and Random Embeddings	Dense embedding models are commonly deployed in commercial search engines, wherein all the document vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this paper, we argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and...	Anshumali Shrivastava, Beidi Chen, Tharun Medini
787	Efficient Empowerment Estimation for Unsupervised Stabilization	Intrinsically motivated artificial agents learn advantageous behavior without externally-provided rewards. Previously, it was shown that maximizing mutual information between agent actuators and future states, known as the empowerment principle, enables unsupervised stabilization of dynamical systems at upright positions, which is a prototypical intrinsically motivated behavior for upright standing and walking. This follows from the coincidence between the objective of stabilization and the objective of empowerment. Unfortunately,...	Kevin Lu, Pieter Abbeel, Ruihan Zhao, Stas Tiomkin
788	More or Less: When and How to Build Convolutional Neural Network Ensembles	Convolutional neural networks are utilized to solve increasingly more complex problems and with more data. As a result, researchers and practitioners seek to scale the representational power of such models by adding more parameters. However, increasing parameters requires additional critical resources in terms of memory and compute, leading to increased training and inference cost. Thus a consistent challenge is to obtain as high as possible accuracy within a parameter budget. As neural network designers navigate this complex...	Abdul Wasay, Stratos Idreos
789	VTNet: Visual Transformer Network for Object Goal Navigation	Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited;...	Heming Du, Liang Zheng, Xin Yu
790	Class Normalization for (Continual)? Generalized Zero-Shot Learning	Normalization techniques have proved to be a crucial ingredient of successful training in a traditional supervised learning regime. However, in the zero-shot learning (ZSL) world, these ideas have received only marginal attention. This work studies normalization in ZSL scenario from both theoretical and practical perspectives. First, we give a theoretical explanation to two popular tricks used in zero-shot learning: normalize+scale and attributes normalization and show that they help training by preserving variance during a forward...	Ivan Skorokhodov, Mohamed Elhoseiny
791	Batch Reinforcement Learning Through Continuation Method	Many real-world applications of reinforcement learning (RL) require the agent to learn from a fixed set of trajectories, without collecting new interactions. Policy optimization under this setting is extremely challenging as: 1) the geometry of the objective function is hard to optimize efficiently; 2) the shift of data distributions causes high noise in the value estimation. In this work, we propose a simple yet effective policy iteration approach to batch RL using global optimization techniques known as continuation. By constraining...	Ed H. Chi, Honglak Lee, Minmin Chen, Nicolas Le Roux, Shengyu Feng, Yijie Guo
792	Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning	Matrix factorization is a simple and natural test-bed to investigate the implicit regularization of gradient descent. Gunasekar et al. (2017) conjectured that gradient flow with infinitesimal initialization converges to the solution that minimizes the nuclear norm, but a series of recent papers argued that the language of norm minimization is not sufficient to give a full characterization for the implicit regularization. In this work, we provide theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow...	Kaifeng Lyu, Yuping Luo, Zhiyuan Li
793	Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs	A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features. Most prominent among these is the popular feature normalization technique BatchNorm, which normalizes activations and then subsequently applies a learned affine transform. In this paper, we aim to understand the role and expressive power of affine parameters used to transform features in this way. To isolate the contribution of these parameters from that of the learned features they transform, we...	Ari S. Morcos, David J. Schwab, Jonathan Frankle
794	CompOFA - Compound Once-For-All Networks for Faster Multi-Platform Deployment	The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware and latency constraints. To scale these resource-intensive tasks with an increasing number of deployment targets, Once-For-All (OFA) proposed an approach to jointly train several models at once with a constant training cost. However, this cost remains as high as 40-50 GPU days and also suffers from a combinatorial explosion of sub-optimal model configurations. We...	Alexey Tumanov, Alind Khare, Manas Sahni, Shreya Varshini
795	Adaptive and Generative Zero-Shot Learning	We address the problem of generalized zero-shot learning (GZSL) where the task is to predict the class label of a target image whether its label belongs to the seen or unseen category. Similar to ZSL, the learning setting assumes that all class-level semantic features are given, while only the images of seen classes are available for training. By exploring the correlation between image features and the corresponding semantic features, the main idea of the proposed approach is to enrich the semantic-to-visual (S2V) embeddings via a...	HsuanTien Lin, TyngLuh Liu, YuYing Chou
796	Learning to Make Decisions via Submodular Regularization	Many sequential decision making tasks can be viewed as combinatorial optimization problems over a large number of actions. When the cost of evaluating an action is high, even a greedy algorithm, which iteratively picks the best action given the history, is prohibitive to run. In this paper, we aim to learn a greedy heuristic for sequentially selecting actions as a surrogate for invoking the expensive oracle when evaluating an action. In particular, we focus on a class of combinatorial problems that can be solved via submodular...	Aiden Aceves, Ayya Alieva, Jialin Song, Stephen Mayo, Yisong Yue, Yuxin Chen
797	Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains	We consider the novel task of learning disentangled representations of object shape and appearance across multiple domains (e.g., dogs and cars). The goal is to learn a generative model that learns an intermediate distribution, which borrows a subset of properties from each domain, enabling the generation of images that did not exist in any domain exclusively. This challenging problem requires an accurate disentanglement of object shape, appearance, and background from each domain, so that the appearance and shape factors from the two...	Krishna Kumar Singh, Utkarsh Ojha, Yong Jae Lee
798	Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning	We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We char- acterize this loss of expressivity via a drop in the rank of the learned value net- work features, and show that this typically corresponds to a...	Aviral Kumar, Dibya Ghosh, Rishabh Agarwal, Sergey Levine
799	Disentangling 3D Prototypical Networks for Few-Shot Concept Learning	We present neural architectures that disentangle RGB-D images into objects’ shapes and styles and a map of the background scene, and explore their applications for few-shot 3D object detection and few-shot concept classification. Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay. They are trained end-to-end self-supervised by predicting views in static scenes, alongside a small number of 3D object boxes. Objects and scenes are represented in...	Adam W. Harley, Darshan Patil, HsiaoYu Tung, Katerina Fragkiadaki, Mihir Prabhudesai, Shamit Lal
800	Anytime Sampling for Autoregressive Models via Ordered Autoencoding	Autoregressive models are widely used for tasks such as image and audio generation. The sampling process of these models, however, does not allow interruptions and cannot adapt to real-time computational resources. This challenge impedes the deployment of powerful autoregressive models, which involve a slow sampling process that is sequential in nature and typically scales linearly with respect to the data dimension. To address this difficulty, we propose a new family of autoregressive models that enables anytime sampling. Inspired by...	Aditya Grover, Linyuan Gong, Rui Shu, Sahaj Garg, Stefano Ermon, Yang Song, Yilun Xu
801	HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks	We propose HyperDynamics, a dynamics meta-learning framework that conditions on an agent’s interactions with the environment and optionally its visual observations, and generates the parameters of neural dynamics models based on inferred properties of the dynamical system. Physical and visual properties of the environment that are not part of the low-dimensional state yet affect its temporal dynamics are inferred from the interaction history and visual observations, and are implicitly captured in the generated parameters. We test...	Emmanouil Antonios Platanios, HsiaoYu Tung, Katerina Fragkiadaki, Shamit Lal, Zhou Xian
802	Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning	Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. In this paper we propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed...	Lawrence Carin, Pengyu Cheng, Ruiyi Zhang, Siyang Yuan, Weituo Hao, Zhe Gan
803	GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing	We present GraPPa, an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We construct synthetic question-SQL pairs over high-quality tables via a synchronous context-free grammar (SCFG). We pre-train our model on the synthetic data to inject important structural properties commonly found in semantic parsing into the pre-training language model. To maintain the model's ability to represent real-world data, we also include masked...	Bailin Wang, Caiming Xiong, ChienSheng Wu, Dragomir R. Radev, Richard Socher, Tao Yu, Xi Victoria Lin, Xinyi Yang, Yi Chern Tan
804	Estimating Lipschitz constants of monotone deep equilibrium models	Several methods have been proposed in recent years to provide bounds on the Lipschitz constants of deep networks, which can be used to provide robustness guarantees, generalization bounds, and characterize the smoothness of decision boundaries. However, existing bounds get substantially weaker with increasing depth of the network, which makes it unclear how to apply such bounds to recently proposed models such as the deep equilibrium (DEQ) model, which can be viewed as representing an infinitely-deep network. In this paper, we show...	Chirag Pabbaraju, Ezra Winston, J. Zico Kolter
805	Estimating informativeness of samples with Smooth Unique Information	We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights. Though related, we show that these quantities have a qualitatively different behavior. We give efficient approximations of these quantities using a linearized network and demonstrate empirically that the approximation is accurate for real-world architectures, such as pre-trained ResNets. We...	Alessandro Achille, Avinash Ravichandran, Giovanni Paolini, Hrayr Harutyunyan, Orchid Majumder, Rahul Bhotika, Stefano Soatto
806	NBDT: Neural-Backed Decision Tree	Machine learning applications such as finance and medicine demand accurate and justifiable predictions, barring most deep learning methods from use. In response, previous work combines decision trees with deep learning, yielding models that (1) sacrifice interpretability for accuracy or (2) sacrifice accuracy for interpretability. We forgo this dilemma by jointly improving accuracy and interpretability using Neural-Backed Decision Trees (NBDTs). NBDTs replace a neural network's final linear layer with a differentiable sequence of...	Alvin Wan, Daniel Ho, Jihan Yin, Joseph E. Gonzalez, Lisa Dunlap, Sarah Adel Bargal, Scott Lee, Suzanne Petryk
807	Accurate Learning of Graph Representations with Graph Multiset Pooling	Graph neural networks have been widely used on modeling graph data, achieving impressive results on node classification and link prediction tasks. Yet, obtaining an accurate representation for a graph further requires a pooling function that maps a set of node representations into a compact form. A simple sum or average over all node representations considers all node features equally without consideration of their task relevance, and any structural dependencies among them. Recently proposed hierarchical graph pooling methods, on the...	Jinheon Baek, Minki Kang, Sung Ju Hwang
808	Byzantine-Resilient Non-Convex Stochastic Gradient Descent	We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions. However, an $\alpha$ -fraction of the machines are Byzantine, in that they may behave in arbitrary, adversarial ways. We consider a variant of this procedure in the challenging non-convex case. Our main result is a new algorithm SafeguardSGD, which can provably escape saddle points and find approximate local minima of the non-convex...	Dan Alistarh, Faeze Ebrahimianghazani, Jerry Li, Zeyuan AllenZhu
809	MetaNorm: Learning to Normalize Few-Shot Batches Across Domains	Batch normalization plays a crucial role when training deep neural networks. However, batch statistics become unstable with small batch sizes and are unreliable in the presence of distribution shifts. We propose MetaNorm, a simple yet effective meta-learning normalization. It tackles the aforementioned issues in a unified way by leveraging the meta-learning setting and learns to infer adaptive statistics for batch normalization. MetaNorm is generic, flexible and model-agnostic, making it a simple plug-and-play module that is seamlessly...	Cees G. M. Snoek, Ling Shao, Xiantong Zhen, YingJun Du
810	Large Batch Simulation for Deep Reinforcement Learning	We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19,000 frames of experience per second on a single GPU and up to 72,000 frames per second on a single eight-GPU machine. The key idea of our approach is to design a 3D renderer and embodied navigation simulator around the principle of “batch simulation”: accepting and executing large batches of requests simultaneously. Beyond exposing large amounts of work...	Aleksei Petrenko, Brennan Shacklett, Dhruv Batra, Erik Wijmans, Kayvon Fatahalian, Manolis Savva, Vladlen Koltun
811	Personalized Federated Learning with First Order Model Optimization	While federated learning traditionally aims to train a single global model across decentralized local datasets, one model may not always be ideal for all participating clients. Here we propose an alternative, where each client only federates with other relevant clients to obtain a stronger model per client-specific objectives. To achieve this personalization, rather than computing a single model average with constant weights for the entire federation as in traditional FL, we efficiently calculate optimal weighted model combinations for...	José M. Álvarez, Karan Sapra, Michael Zhang, Sanja Fidler, Serena Yeung
812	Combining Physics and Machine Learning for Network Flow Estimation	The flow estimation problem consists of predicting missing edge flows in a network (e.g., traffic, power, and water) based on partial observations. These missing flows depend both on the underlying \textit{physics} (edge features and a flow conservation law) as well as the observed edge flows. This paper introduces an optimization framework for computing missing edge flows and solves the problem using bilevel optimization and deep learning. More specifically, we learn regularizers that depend on edge features (e.g., number of lanes in...	Ambuj K. Singh, Ananthram Swami, Arlei Lopes da Silva, Francesco Bullo, Furkan Kocayusufoglu, Saber Jafarpour
813	Knowledge Distillation as Semiparametric Inference	A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. To explain and enhance this phenomenon, we cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher...	Govinda M. Kamath, Lester Mackey, Tri Dao, Vasilis Syrgkanis
814	Learning Value Functions in Deep Policy Gradients using Residual Variance	Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the...	OdalricAmbrym Maillard, Philippe Preux, Reda Ouhamma, Yannis FletBerliac
815	Randomized Ensembled Double Q-Learning: Learning Fast Without a Model	Using a high Update-To-Data (UTD) ratio, model-based methods have recently achieved much higher sample efficiency than previous model-free methods for continuous-action DRL benchmarks. In this paper, we introduce a simple model-free algorithm, Randomized Ensembled Double Q-Learning (REDQ), and show that its performance is just as good as, if not better than, a state-of-the-art model-based algorithm for the MuJoCo benchmark. Moreover, REDQ can achieve this performance using fewer parameters than the model-based method, and with less...	Che Wang, Keith W. Ross, Xinyue Chen, Zijian Zhou
816	Decentralized Attribution of Generative Models	Growing applications of generative models have led to new threats such as malicious personation and digital copyright infringement. One solution to these threats is model attribution, i.e., the identification of user-end models where the contents under question are generated. Existing studies showed empirical feasibility of attribution through a centralized classifier trained on all existing user-end models. However, this approach is not scalable in a reality where the number of models ever grows. Neither does it provide an...	Changhoon Kim, Yezhou Yang, Yi Ren
817	Attentional Constellation Nets for Few-Shot Learning	The success of deep convolutional neural networks builds on top of the learning of effective convolution operations, capturing a hierarchy of structured features via filtering, activation, and pooling. However, the explicit structured features, e.g. object parts, are not expressive in the existing CNN frameworks. In this paper, we tackle the few-shot learning problem and make an effort to enhance structured features by expanding CNNs with a constellation model, which performs cell feature clustering and encoding with a dense part...	Huaijin Wang, Weijian Xu, Yifan Xu, Zhuowen Tu
818	Adapting to Reward Progressivity via Spectral Reinforcement Learning	In this paper we consider reinforcement learning tasks with progressive rewards; that is, tasks where the rewards tend to increase in magnitude over time. We hypothesise that this property may be problematic for value-based deep reinforcement learning agents, particularly if the agent must first succeed in relatively unrewarding regions of the task in order to reach more rewarding regions. To address this issue, we propose Spectral DQN, which decomposes the reward into frequencies such that the high frequencies only activate when large...	John Thangarajah, Michael Dann
819	TropEx: An Algorithm for Extracting Linear Terms in Deep Neural Networks	Deep neural networks with rectified linear (ReLU) activations are piecewise linear functions, where hyperplanes partition the input space into an astronomically high number of linear regions. Previous work focused on counting linear regions to measure the network's expressive power and on analyzing geometric properties of the hyperplane configurations. In contrast, we aim to understand the impact of the linear terms on network performance, by examining the information encoded in their coefficients. To this end, we derive TropEx, a...	Cristian Sminchisescu, Henning Petzka, Martin Trimmel
820	Reset-Free Lifelong Learning with Skill-Space Planning	The objective of \textit{lifelong} reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose \textit{Lifelong Skill Planning} (LiSP), an algorithmic framework for lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a...	Aditya Grover, Igor Mordatch, Kevin Lu, Pieter Abbeel
821	Robust Learning of Fixed-Structure Bayesian Networks in Nearly-Linear Time	We study the problem of learning Bayesian networks where an $\epsilon$ -fraction of the samples are adversarially corrupted. We focus on the fully-observable case where the underlying graph structure is known. In this work, we present the first nearly-linear time algorithm for this problem with a dimension-independent error guarantee. Previous robust algorithms with comparable error guarantees are slower by at least a factor of $(d/\epsilon)$ , where $d$ is the number of variables in the Bayesian network and $\epsilon$ is the fraction of...	Honghao Lin, Yu Cheng
822	Teaching Temporal Logics to Neural Networks	We study two fundamental questions in neuro-symbolic computing: can deep learning tackle challenging problems in logics end-to-end, and can neural networks learn the semantics of logics. In this work we focus on linear-time temporal logic (LTL), as it is widely used in verification. We train a Transformer on the problem to directly predict a solution, i.e. a trace, to a given LTL formula. The training data is generated with classical solvers, which, however, only provide one of many possible solutions to each formula. We demonstrate...	Bernd Finkbeiner, Christopher Hahn, Frederik Schmitt, Jens U. Kreber, Markus Norman Rabe
823	Spatially Structured Recurrent Modules	Capturing the structure of a data-generating process by means of appropriate inductive biases can help in learning models that generalise well and are robust to changes in the input distribution. While methods that harness spatial and temporal structures find broad application, recent work has demonstrated the potential of models that leverage sparse and modular structure using an ensemble of sparingly interacting modules. In this work, we take a step towards dynamic models that are capable of simultaneously exploiting both modular and...	Anirudh Goyal, Bernhard Schölkopf, Manuel Wuthrich, Muhammad Waleed Gondal, Nasim Rahaman, Stefan Bauer, Yash Sharma, Yoshua Bengio
824	Bayesian Few-Shot Classification with One-vs-Each Pólya-Gamma Augmented Gaussian Processes	Few-shot classification (FSC), the task of adapting a classifier to unseen classes given a small labeled dataset, is an important step on the path toward human-like machine learning. Bayesian methods are well-suited to tackling the fundamental issue of overfitting in the few-shot scenario because they allow practitioners to specify prior beliefs and update those beliefs in light of observed data. Contemporary approaches to Bayesian few-shot classification maintain a posterior distribution over model parameters, which is slow and...	Jake Snell, Richard S. Zemel
825	Parameter-Based Value Functions	Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies. PBVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution...	Francesco Faccio, Jürgen Schmidhuber, Louis Kirsch
826	Hyperbolic Neural Networks++	Hyperbolic spaces, which have the capacity to embed tree structures without distortion owing to their exponential volume growth, have recently been applied to machine learning to better capture the hierarchical nature of data. In this study, we generalize the fundamental components of neural networks in a single hyperbolic geometry model, namely, the Poincaré ball model. This novel methodology constructs a multinomial logistic regression, fully-connected layers, convolutional layers, and attention mechanisms under a unified...	Ryohei Shimizu, Tatsuya Harada, Yusuke Mukuta
827	Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering	Combinations of neural ODEs with recurrent neural networks (RNN), like GRU-ODE-Bayes or ODE-RNN are well suited to model irregularly observed time series. While those models outperform existing discrete-time approaches, no theoretical guarantees for their predictive capabilities are available. Assuming that the irregularly-sampled time series data originates from a continuous stochastic process, the $L^2$ -optimal online prediction is the conditional expectation given the currently available information. We introduce the Neural Jump ODE...	Calypso Herrera, Florian Krach, Josef Teichmann
828	Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections	Sequential data such as time series, video, or text can be challenging to analyse as the ordered structure gives rise to complex dependencies. At the heart of this is non-commutativity, in the sense that reordering the elements of a sequence can completely change its meaning. We use a classical mathematical object -- the free algebra -- to capture this non-commutativity. To address the innate computational complexity of this algebra, we use compositions of low-rank tensor projections. This yields modular and scalable building blocks...	Csaba Tóth, Harald Oberhauser, Patric Bonnier
829	FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization	We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks without any interactions with the environments, making RL truly practical in many real-world applications. This problem is still not fully understood, for which two major challenges need to be addressed. First, offline RL usually suffers from bootstrapping errors of out-of-distribution state-actions which leads to divergence of value functions. Second, meta-RL requires...	Dijun Luo, Lanqing Li, Rui Yang
830	On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis	We study the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuous-time linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a universal approximation theorem of such linear functionals and characterize the approximation rate. Moreover, we perform a fine-grained...	Jiequn Han, Qianxiao Li, Weinan E, Zhong Li
831	Generating Adversarial Computer Programs using Optimized Obfuscations	Machine learning (ML) models that learn and predict properties of computer programs are increasingly being adopted and deployed. These models have demonstrated success in applications such as auto-completing code, summarizing large programs, and detecting bugs and malware in programs. In this work, we investigate principled ways to adversarially perturb a computer program to fool such learned models, and thus determine their adversarial robustness. We use program obfuscations, which have conventionally been used to avoid attempts at...	Gaoyuan Zhang, Quanfu Fan, Shashank Srikant, Shiyu Chang, Sijia Liu, Tamara Mitrovska, UnaMay O'Reilly
832	BOIL: Towards Representation Change for Few-shot Learning	Model Agnostic Meta-Learning (MAML) is one of the most representative of gradient-based meta-learning algorithms. MAML learns new tasks with a few data samples using inner updates from a meta-initialization point and learns the meta-initialization parameters with outer updates. It has recently been hypothesized that representation reuse, which makes little change in efficient representations, is the dominant factor in the performance of the meta-initialized model through MAML in contrast to representation change, which causes a...	ChangHwan Kim, Hyungjun Yoo, Jaehoon Oh, SeYoung Yun
833	Interpreting and Boosting Dropout from a Game-Theoretic View	This paper aims to understand and improve the utility of the dropout operation from the perspective of game-theoretical interactions. We prove that dropout can suppress the strength of interactions between input variables of deep neural networks (DNNs). The theoretical proof is also verified by various experiments. Furthermore, we find that such interactions were strongly related to the over-fitting problem in deep learning. So, the utility of dropout can be regarded as decreasing interactions to alleviating the significance of...	Hao Zhang, Mingjie Li, Quanshi Zhang, Sen Li, Yichen Xie, Yinchao Ma
834	Representation Learning via Invariant Causal Mechanisms	Self-supervised learning has emerged as a strategy to reduce the reliance on costly supervised signal by pretraining representations only using unlabeled data. These methods combine heuristic proxy classification tasks with data augmentations and have achieved significant success, but our theoretical understanding of this success remains limited. In this paper we analyze self-supervised representation learning using a causal framework. We show how data augmentations can be more effectively utilized through explicit invariance...	Brian McWilliams, Charles Blundell, Jacob C. Walker, Jovana Mitrovic, Lars Holger Buesing
835	Fooling a Complete Neural Network Verifier	The efficient and accurate characterization of the robustness of neural networks to input perturbation is an important open problem. Many approaches exist including heuristic and exact (or complete) methods. Complete methods are expensive but their mathematical formulation guarantees that they provide exact robustness metrics. However, this guarantee is valid only if we assume that the verified network applies arbitrary-precision arithmetic and the verifier is reliable. In practice, however, both the networks and the verifiers apply...	Balázs Bánhelyi, Dániel Zombori, István Megyeri, Márk Jelasity, Tibor Csendes
836	CPR: Classifier-Projection Regularization for Continual Learning	We propose a general, yet simple patch that can be applied to existing regularization-based continual learning methods called classifier-projection regularization (CPR). Inspired by both recent results on neural networks with wide local minima and information theory, CPR adds an additional regularization term that maximizes the entropy of a classifier's output probability. We demonstrate that this additional term can be interpreted as a projection of the conditional probability given by a classifier's output to the uniform...	Flávio P. Calmon, Hsiang Hsu, Sungmin Cha, Taebaek Hwang, Taesup Moon
837	CO2: Consistent Contrast for Unsupervised Visual Representation Learning	Contrastive learning has recently been a core for unsupervised visual representation learning. Without human annotation, the common practice is to perform an instance discrimination task: Given a query image crop, label crops from the same image as positives, and crops from other randomly sampled images as negatives. An important limitation of this label assignment is that it can not reflect the heterogeneous similarity of the query crop to crops from other images, but regarding them as equally negative. To address this issue, inspired...	Alan L. Yuille, Chen Wei, Huiyu Wang, Wei Shen
838	GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images	We tackle a challenging blind image denoising problem, in which only single distinct noisy images are available for training a denoiser, and no information about noise is known, except for it being zero-mean, additive, and independent of the clean image. In such a setting, which often occurs in practice, it is not possible to train a denoiser with the standard discriminative training or with the recently developed Noise2Noise (N2N) training; the former requires the underlying clean image for the given noisy image, and the latter...	Byeongjoon Kim, Jongduk Baek, Sungmin Cha, Taeeon Park, Taesup Moon
839	Learning Subgoal Representations with Slow Dynamics	In goal-conditioned Hierarchical Reinforcement Learning (HRL), a high-level policy periodically sets subgoals for a low-level policy, and the low-level policy is trained to reach those subgoals. A proper subgoal representation function, which abstracts a state space to a latent subgoal space, is crucial for effective goal-conditioned HRL, since different low-level behaviors are induced by reaching subgoals in the compressed representation space. Observing that the high-level agent operates at an abstract temporal scale, we propose a...	Chongjie Zhang, Jianhao Wang, Lulu Zheng, Siyuan Li
840	Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis	We propose a novel task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints. While existing work copes with two or more tasks mainly by multi-task learning of shareable feature representations, we take a different perspective. We focus on the interaction and cooperation between a generative model and a discriminative...	Martial Hebert, YuXiong Wang, Zhipeng Bao
841	Taming GANs with Lookahead-Minmax	Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtracking step of our Lookahead–minmax naturally handles the rotational game dynamics, a property which was identified to be key for enabling...	François Fleuret, Martin Jaggi, Matteo Pagliardini, Sebastian U. Stich, Tatjana Chavdarova
842	Certify or Predict: Boosting Certified Robustness with Compositional Architectures	A core challenge with existing certified defense mechanisms is that while they improve certified robustness, they also tend to drastically decrease natural accuracy, making it difficult to use these methods in practice. In this work, we propose a new architecture which addresses this challenge and enables one to boost the certified robustness of any state-of-the-art deep network, while controlling the overall accuracy loss, without requiring retraining. The key idea is to combine this model with a (smaller) certified network where at...	Mark Niklas Müller, Martin T. Vechev, Mislav Balunovic
843	New Bounds For Distributed Mean Estimation and Variance Reduction	We consider the problem of distributed mean estimation (DME), in which $n$ machines are each given a local $d$ -dimensional vector $\mathbf x_v \in \mathbb R^d$ , and must cooperate to estimate the mean of their inputs $\mathbf \mu = \frac 1n\sum_{v = 1}^n \mathbf x_v$ , while minimizing total communication cost. DME is a fundamental construct in distributed machine learning, and there has been considerable work on variants of this problem, especially in the context of distributed variance reduction for stochastic gradients in parallel...	Dan Alistarh, Niusha Moshrefi, Peter Davies, Saleh Ashkboos, Vijaykrishna Gurunanthan
844	Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels	Current neural architecture search (NAS) strategies focus only on finding a single, good, architecture. They offer little insight into why a specific network is performing well, or how we should modify the architecture if we want further improvements. We propose a Bayesian optimisation (BO) approach for NAS that combines the Weisfeiler-Lehman graph kernel with a Gaussian process surrogate. Our method not only optimises the architecture in a highly data-efficient manner, but also affords interpretability by discovering useful network...	Bin Xin Ru, Michael A. Osborne, Xiaowen Dong, Xingchen Wan
845	A Discriminative Gaussian Mixture Model with Sparsity	In probabilistic classification, a discriminative model based on the softmax function has a potential limitation in that it assumes unimodality for each class in the feature space. The mixture model can address this issue, although it leads to an increase in the number of parameters. We propose a sparse classifier based on a discriminative GMM, referred to as a sparse discriminative Gaussian mixture (SDGM). In the SDGM, a GMM-based discriminative model is trained via sparse Bayesian learning. Using this sparse learning framework, we...	Hideaki Hayashi, Seiichi Uchida
846	Communication in Multi-Agent Reinforcement Learning: Intention Sharing	Communication is one of the core components for learning coordinated behavior in multi-agent systems. In this paper, we propose a new communication scheme named Intention Sharing (IS) for multi-agent reinforcement learning in order to enhance the coordination among agents. In the proposed IS scheme, each agent generates an imagined trajectory by modeling the environment dynamics and other agents' actions. The imagined trajectory is the simulated future trajectory of each agent based on the learned model of the environment dynamics and...	Jongeui Park, Woojun Kim, Youngchul Sung
847	Is Attention Better Than Matrix Decomposition?	As an essential ingredient of modern deep learning, attention mechanism, especially self-attention, plays a vital role in the global correlation discovery. However, is hand-crafted attention irreplaceable when modeling the global context? Our intriguing finding is that self-attention is not better than the matrix decomposition~(MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies. We model the global context issue as a low-rank completion problem and show that its...	Hongxu Chen, Ke Wei, MengHao Guo, Xia Li, Zhengyang Geng, Zhouchen Lin
848	Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers	Formal verification of neural networks (NNs) is a challenging and important problem. Existing efficient complete solvers typically require the branch-and-bound (BaB) process, which splits the problem domain into sub-domains and solves each sub-domain using faster but weaker incomplete verifiers, such as Linear Programming (LP) on linearly relaxed sub-domains. In this paper, we propose to use the backward mode linear relaxation based perturbation analysis (LiRPA) to replace LP during the BaB process, which can be efficiently implemented...	ChoJui Hsieh, Huan Zhang, Kaidi Xu, Shiqi Wang, Suman Jana, Xue Lin, Yihan Wang
849	A Geometric Analysis of Deep Generative Image Models and Its Applications	Generative adversarial networks (GANs) have emerged as a powerful unsupervised method to model the statistical patterns of real-world data sets, such as natural images. These networks are trained to map random inputs in their latent space to new samples representative of the learned data. However, the structure of the latent space is hard to intuit due to its high dimensionality and the non-linearity of the generator, which limits the usefulness of the models. Understanding the latent space requires a way to identify input codes for...	Binxu Wang, Carlos R. Ponce
850	Solving Compositional Reinforcement Learning Problems via Task Reduction	We propose a novel learning paradigm, Self-Imitation via Reduction (SIR), for solving compositional reinforcement learning problems. SIR is based on two core ideas: task reduction and self-imitation. Task reduction tackles a hard-to-solve task by actively reducing it to an easier task whose solution is known by the RL agent. Once the original hard task is successfully solved by task reduction, the agent naturally obtains a self-generated solution trajectory to imitate. By continuously collecting and imitating such demonstrations, the...	Huazhe Xu, Xiaolong Wang, Yi Wu, Yilin Wu, Yunfei Li
851	ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity	Adversarial attacks pose a major challenge for modern deep neural networks. Recent advancements show that adversarially robust generalization requires a large amount of labeled data for training. If annotation becomes a burden, can unlabeled data help bridge the gap? In this paper, we propose ARMOURED, an adversarially robust training method based on semi-supervised learning that consists of two components. The first component applies multi-view learning to simultaneously optimize multiple independent networks and utilizes unlabeled...	ChuanSheng Foo, Cuong Manh Nguyen, Kangkang Lu, Kiran Chari, Xun Xu, Yu Jing Goh
852	Acting in Delayed Environments with Non-Stationary Markov Policies	The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps. The brute-force state augmentation baseline where the state is concatenated to the last $m$ ...	Esther Derman, Gal Dalal, Shie Mannor
853	Overfitting for Fun and Profit: Instance-Adaptive Data Compression	Neural data compression has been shown to outperform classical methods in terms of $RD$ performance, with results still improving rapidly. At a high level, neural compression is based on an autoencoder that tries to reconstruct the input instance from a (quantized) latent representation, coupled with a prior that is used to losslessly compress these latents. Due to limitations on model capacity and imperfect optimization and generalization, such models will suboptimally compress test data in general. However, one of the great strengths...	Iris A. M. Huijben, Taco Cohen, Ties van Rozendaal
854	Learnable Embedding sizes for Recommender Systems	The embedding-based representation learning is commonly used in deep learning recommendation models to map the raw sparse features to dense vectors. The traditional embedding manner that assigns a uniform size to all features has two issues. First, the numerous features inevitably lead to a gigantic embedding table that causes a high memory usage cost. Second, it is likely to cause the over-fitting problem for those features that do not require too large representation capacity. Existing works that try to address the problem always...	Chen Gao, Depeng Jin, Siyi Liu, Yihong Chen, Yong Li
855	Generative Scene Graph Networks	Human perception excels at building compositional hierarchies of parts and objects from unlabeled scenes that help systematic generalization. Yet most work on generative scene modeling either ignores the part-whole relationship or assumes access to predefined part labels. In this paper, we propose Generative Scene Graph Networks (GSGNs), the first deep generative model that learns to discover the primitive parts and infer the part-whole relationship jointly from multi-object scenes without supervision and in an end-to-end trainable...	Donghun Lee, Fei Deng, Sungjin Ahn, Zhuo Zhi
856	Deconstructing the Regularization of BatchNorm	Batch normalization (BatchNorm) has become a standard technique in deep learning. Its popularity is in no small part due to its often positive effect on generalization. Despite this success, the regularization effect of the technique is still poorly understood. This study aims to decompose BatchNorm into separate mechanisms that are much simpler. We identify three effects of BatchNorm and assess their impact directly with ablations and interventions. Our experiments show that preventing explosive growth at the final layer at...	Ekin Dogus Cubuk, Yann N. Dauphin
857	PolarNet: Learning to Optimize Polar Keypoints for Keypoint Based Object Detection	A variety of anchor-free object detectors have been actively proposed as possible alternatives to the mainstream anchor-based detectors that often rely on complicated design of anchor boxes. Despite achieving promising performance on par with anchor-based detectors, the existing anchor-free detectors such as FCOS or CenterNet predict objects based on standard Cartesian coordinates, which often yield poor quality keypoints. Further, the feature representation is also scale-sensitive. In this paper, we propose a new anchor-free keypoint...	Doyen Sahoo, Steven C. H. Hoi, Xiongwei Wu
858	Simple Spectral Graph Convolution	Graph Convolutional Networks (GCNs) are leading methods for learning graph representations. However, without specially designed architectures, the performance of GCNs degrades quickly with increased depth. As the aggregated neighborhood size and neural network depth are two completely orthogonal aspects of graph representation, several methods focus on summarizing the neighborhood by aggregating K-hop neighborhoods of nodes while using shallow neural networks. However, these methods still encounter oversmoothing, and suffer from high...	Hao Zhu, Piotr Koniusz
859	Explainable Subgraph Reasoning for Forecasting on Temporal Knowledge Graphs	Modeling time-evolving knowledge graphs (KGs) has recently gained increasing interest. Here, graph representation learning has become the dominant paradigm for link prediction on temporal KGs. However, the embedding-based approaches largely operate in a black-box fashion, lacking the ability to interpret their predictions. This paper provides a link forecasting framework that reasons over query-relevant subgraphs of temporal KGs and jointly models the structural dependencies and the temporal dynamics. Especially, we propose a temporal...	Peng Chen, Volker Tresp, Yunpu Ma, Zhen Han
860	Evaluation of Neural Architectures trained with square Loss vs Cross-Entropy in Classification Tasks	Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or...	Like Hui, Mikhail Belkin