Abstracts
Keynote Abstracts
Challenges in Privacy-preserving Data Analysis
Kamalika Chaudhuri, PhD, University of California San Diego
Machine learning algorithms increasingly work with sensitive information on individuals, and hence the problem of privacy-preserving data analysis—how to design data analysis algorithms that operate on the sensitive data of individuals while still guaranteeing the privacy of individuals in the data—has achieved great practical importance. In this talk, we address two problems in privacy-preserving data analysis.
First, we address the problem of privacy-preserving classification, and present an efficient classifier which is private in the differential privacy model of Dwork et al. Our classifier works in the ERM (empirical loss minimization) framework, and includes privacy preserving logistic regression and privacy preserving support vector machines.
We next address the question of learning from sensitive correlated data, such as private information on users connected together in a social network, and measurements of physical activity of a single user across time. We consider a recent generalization of differential privacy, called Pufferfish, that can be used to address privacy in correlated data and present new privacy mechanisms in this framework.
Joint work with Claire Monteleoni (George Washington University), Anand Sarwate (Rutgers), Yizhen Wang (UCSD) and Shuang Song (UCSD).
Machine Learning for Individualized Healthcare
Suchi Saria, PhD, Johns Hopkins University
Ever increasing quantities of health-related data are being collected in electronic health records (EHRs), by clinicians and through personal monitoring devices. I develop statistical learning algorithms for leveraging this data in optimizing decision-making in healthcare. In this talk, I plan to focus on the following overarching research question: how can we individualize prognosis tools using the diverse clinical data in EHRs? I will introduce challenges associated with personalization from observational clinical data. And, on the technical front, we will tackle questions such as:
i) how can diversity-in baseline health and in disease presentation-across individuals be modeled and estimated from data to individualize prognostic tools?
ii) how can we do joint analysis of heterogeneous data sources-demographics, history, biological markers, time-varying tests, treatments, and
iii) how do we develop models that generalize well to shifts in provider practice patterns.
Applications of Learning Theory in Algorithmic Game Theory
Tim Roughgarden, PhD, Stanford University
Algorithmic game theory is a field that uses and extends tools from economics and game theory to reason about fundamental computer science problems. The field is important both for its applications, which span the gamut from network routing to online advertising, and for its remarkably diverse and rich connections to other areas of theoretical computer science, including complexity theory and approximation algorithms. In this talk, we survey two ways in which definitions and tools from learning theory have been crucial to recent advances in algorithmic game theory. First, we outline a theory of robust bounds on the "price of anarchy"—meaning approximation guarantees for game-theoretic equilibria—that apply to all outcome sequences generated by no-regret learners playing a multi-player game. Second, we explain how to use concepts from learning theory to make traditional (Bayesian) optimal auction theory operational, replacing the practically problematic "common prior" assumption with a data-driven approach.
Spotlight Abstracts
Corralling a Band of Bandit Algorithms
Haipeng Luo, PhD, Microsoft Research, New York, New York
We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm if it were to be run on its own. The main challenge is that when run with a master, base algorithms unavoidably receive much less feedback and it is thus critical that the master not starve a base algorithm that might perform uncompetitively initially but would eventually outperform others if given enough feedback. We address this difficulty by devising a version of Online Mirror Descent with a special mirror map together with a sophisticated learning rate scheme. We show that this approach manages to achieve a more delicate balance between exploiting and exploring base algorithms than previous works yielding superior regret bounds.
Our results are applicable to many settings, such as multi-armed bandits, contextual bandits, and convex bandits. As examples, we present two main applications. The first is to create an algorithm that enjoys worst-case robustness while at the same time performing much better when the environment is relatively easy. The second is to create an algorithm that works simultaneously under different assumptions of the environment, such as different priors or different loss structures.
Coauthors: Alekh Agarwal1, Behnam Neyshabur2, and Robert E. Schapire1.
1. Microsoft Research, New York, New York.
2. Toyota Technological Institute, Chicago, Illinois.
Edward: A Library for Probabilistic Modeling, Inference, and Criticism
Dustin Tran, MS, Columbia University, New York, New York
Probabilistic modeling is a powerful approach for analyzing empirical information. In this talk, I will provide an overview of Edward, a software library for probabilistic modeling. Formally, Edward is a probabilistic programming system built on computational graphs, supporting compositions of both models and inference for flexible experimentation. For example, Edward makes it easy to fit the same model using a variety of composable inferences, ranging from point estimation, to variational inference, to MCMC. Edward is also integrated into TensorFlow, providing significant speedups over existing probabilistic systems. As examples, I will show how Edward can be leveraged for expanding the frontier of variational inference and deep generative models.
Coauthors: Matthew D. Hoffman2,3, Kevin Murphy2, Eugene Brevdo2, Rif A. Saurous2, and David M. Blei1.
1. Columbia University, New York, New York.
2. Google Brain, Mountain View, California.
3. Adobe Research, Mountain View, California.
Generalized Topic Modeling
Nika Haghtalab, MMath, Carnegie Mellon University, Pittsburgh, Pennsylvania
Recently there has been significant activity in developing algorithms with provable guarantees for topic modeling. In standard topic models, a topic (such as sports, business, or politics) is viewed as a probability distribution over words, and a document is generated by first selecting a mixture over topics, and then generating words i.i.d. from the associated mixture . Given a large collection of such documents, the goal is to recover the topic vectors and then to correctly classify new documents according to their topic mixture.
In this work we consider a broad generalization of this framework in which words are no longer assumed to be drawn i.i.d. and instead a topic is a complex distribution over sequences of paragraphs. Since one could not hope to even represent such a distribution in general (even if paragraphs are given using some natural feature representation), we aim instead to directly learn a document classifier. That is, we aim to learn a predictor that given a new document, accurately predicts its topic mixture, without learning the distributions explicitly. We present several natural conditions under which one can do this efficiently and discuss issues such as noise tolerance and sample complexity in this model. More generally, our model can be viewed as a generalization of the multi-view or co-training setting in machine learning.
Coauthor: Avrim Blum, PhD, Carnegie Mellon University.
A New Theory of Exploration in Reinforcement Learning with Function Approximation
Nan Jiang, MSc, University of Michigan, Ann Arbor, Michigan
In reinforcement learning, autonomous agents solve sequential decision-making problems by (1) actively exploring the environment to collect data, and (2) improving behavior by learning from the collected data. The recent success of reinforcement learning can largely be attributed to the use of advanced function approximation techniques (e.g., deep neural networks) for the learning component. In contrast, advances in exploration techniques have been rather limited: most existing algorithms that perform systematic exploration only apply to simple problems with small state spaces and cannot accommodate sophisticated function approximation schemes that are critical for solving real-world problems with rich observations.
In this work, we take a unified view of reinforcement learning with rich observations and function approximation through a new model called the Contextual Decision Process (CDP), which encompasses existing models such as MDPs and POMDPs. Several special cases of CDPs are, however, provably intractable in terms of the amount of data needed to find the near-optimal behavior. To overcome this challenge, we identify a structural property, called the Bellman rank, which is naturally small in a range of important reinforcement learning scenarios. We propose a new algorithm, whose sample complexity scales polynomially in the Bellman rank and other relevant parameters, and crucially has no dependence on the number of unique observations. The algorithm uses Bellman error minimization with optimistic exploration, and provides new insights into exploration in complex reinforcement learning problems.
Coauthors: Akshay Krishnamurthy1, Alekh Agarwal2, John Langford2, and Robert E. Schapire2.
1. University of Massachusetts, Amherst, Massachusetts.
2. Microsoft Research, New York, New York.
Unsupervised Learning of Word-Sequence Representations from Scratch via Convolutional Tensor Decomposition
Furong Huang, PhD, Microsoft Research, New York, New York
Unsupervised text embedding extraction is crucial for text understanding in machine learning. Word2Vec and its variations are successful in mapping words with similar syntactic or semantic meaning to vectors close to each other. However, extracting context-aware word-sequence embedding remains a challenging task. Training over large corpus is difficult as labels are difficult to get. More importantly, it is challenging for pre-trained models to obtain word-sequence embeddings that are universally good for all downstream tasks or for any new datasets.
We propose a two-phased ConvDic+DeconvDec framework to solve the problem by combining a word-sequence dictionary learning model with a word-sequence embedding decode model. We propose a convolutional tensor decomposition mechanism to learn good word-sequence phrase dictionary in the learning phase. It is proved to be more accurate and much more efficient than the popular alternating minimization method. In the decode phase, we introduce a deconvolution framework that is immune to the problem of varying sentence lengths. The word-sequence embeddings we extracted using ConvDic+DeconvDec are universally good for a few downstream tasks we test on. The framework requires neither pre-training nor prior/outside information.
Coauthor: Anima Anandkumar, University of California, Irvine.
Oracle-Efficient Learning and Auction Design
Nika Haghtalab, MMath, Carnegie Mellon University, Pittsburgh, Pennsylvania
We consider the design of online no-regret algorithms that are computationally efficient, given access to an offline optimization oracle. Our first main contribution is introducing an oracle-based online algorithm and conditions under which it achieves vanishing regret. Our learning algorithm is a generalization of the Follow-The-Perturbed-Leader algorithm of Kalai and Vempala that at every step plays the best-performing action subject to some independent random perturbations. Our design uses a shared source of randomness across all actions that can be efficiently implemented by adding to the history of the game a set of adversary's actions from a carefully designed distribution. Our work extends to oracle-efficient algorithms for contextual learning, learning with Maximal-in-Range approximation algorithms, and no-regret bidding in simultaneous auctions, answering an open problem of Daskalakis and Syrgkanis in the latter case.
Our second main contribution is introducing a new adversarial auction-design framework for revenue maximization and applying our oracle-efficient learning results to adaptive auction design. We give oracle-efficient learning results for: (1) VCG auctions with bidder-specific reserves in single-parameter settings, (2) envy-free item pricing in multi-item auctions, and (3) s-level auctions of Morgenstern and Roughgarden for single-item settings. The last result leads to an approximation of the optimal Myerson auction for the stationary distribution of a Markov process, extending prior work that only gave such guarantees for the i.i.d. setting. We also extend our framework to allow the auctioneer to use side information about the bidders in the design of the optimal auction (contextual learning).
Coauthors: Miroslav Didík1, Haipeng Luo1, Robert E. Schapire1, Vasilis Syrgkanis2, and Jennifer Wortman Vaughan1.
1. Microsoft Research, New York, New York.
2. Microsoft Research, Cambridge, Massachusetts.
AdaCluster: Adaptive Clustering for Heterogeneous Data
Mehmet E. Basbug, PhD, Lion Cave Capital, Chatham, New Jersey, and Princeton University
Clustering algorithms start with a fixed divergence, which captures the possibly asymmetric distance between a sample and a centroid. In the mixture model setting, the sample distribution plays the same role. When all attributes have the same topology and dispersion, the data are said to be homogeneous. If the prior knowledge of the distribution is inaccurate or the set of plausible distributions is large, an adaptive approach is essential. The motivation is more compelling for heterogeneous data, where the dispersion or the topology differs among attributes. We propose an adaptive approach to clustering using classes of parametrized Bregman divergences. We first show that the density of a steep exponential dispersion model (EDM) can be represented with a Bregman divergence. We then propose AdaCluster, an expectation-maximization (EM) algorithm to cluster heterogeneous data using classes of steep EDMs. We compare AdaCluster with EM for a Gaussian mixture model on synthetic data and nine UCI data sets. We also propose an adaptive hard clustering algorithm based on Generalized Method of Moments. We compare the hard clustering algorithm with k-means on the UCI data sets. We empirically verified that adaptively learning the underlying topology yields better clustering of heterogeneous data.
Coauthor: Barbara E. Engelhardt, Princeton University.
Time Series Prediction and Online Learning: Model Selection and Ensemble Learning
Vitaly Kuznetsov, PhD, Google Research, New York, New York
We present a series of theoretical and algorithmic results for time series prediction leveraging recent advances in the statistical learning analysis of this problem and on-line learning. We prove the first generalization bounds for a hypothesis derived by online-to-batch conversion of the sequence of hypotheses output by an online algorithm, in the general setting of a non-stationary non-mixing stochastic process. Our guarantees hold for adapted sequences of hypotheses both for convex and non-convex losses. We give generalization bounds for sequences of hypotheses that may not be adapted but that admit a stability property. Our bounds are given in terms of a discrepancy measure, which we show can be accurately estimated from data under a mild assumption.
We also highlight the application of our results to two related problems: model selection in time series prediction, and the design of accurate ensembles of time series predictors. Model selection for time series prediction appears to be a difficult task: in contrast with the i.i.d. scenario, in time series analysis, there is no straightforward method for splitting a sample into a training and validation sets. Using the most recent data for validation may result in models that ignore the most recent information. Validating over the most distant past may lead to selecting sub-optimal parameters. Any other split of the sample may result in the destruction of important statistical patterns and correlations across time. We show that, remarkably, our on-line-to-batch conversions enable us to use the same time series for both training and model selection.
Coauthors: Mehryar Mohri, Google Research and Courant Institute, New York, New York.
The Temporal Neural Coding Network: Towards Lifelong Language Learning
Alexander G. Ororbia II, BS, Pennsylvania State University
We present a novel lifelong neural architecture, the Temporal Neural Coding Network (TNCN), and its learning algorithm, for uncovering multiple levels of distributed representations for language data streams. The TNCN model adapts its parameters iteratively as new samples are observed without resorting to the popular, but expensive back-propagation through time procedure needed to calculate gradients for recurrent neural networks, notably requiring no unrolling of internally defined recurrence relations. Furthermore, we integrate concepts from variational encoder-decoder models, such as those in [1], to easily demonstrate how uncertainty can be characterized in the model's next step predictions. We discuss how the proposed TNCN works specifically on language-based tasks formulated in the streaming setting, compared to various neural models, as well as how the TNCN may be easily adapted to other real-time tasks, such as video sequence modeling, yielding a natural framework for online multi-modal and online semi-supervised learning.
Coauthors: David Reitter and C. Lee Giles, Pennsylvania State University.
References: Serban, Iulian; Ororbia II, Alexander G.; Pineau, Joelle; Courville, Aaron. Multi-modal Variational Encoder-Decoders. 2017. arXiv:1612.00377 [cs.CL]
The Topology of Time Series Networks
Ben Cassidy, PhD, Columbia University, New York, New York
Large high-dimensional networks constructed from time series arise naturally in many areas of machine learning such as genomics, neuroimaging, econometrics, and social networks. But the interpretation and comparison of differently sized networks remains problematic, since many measures assume the networks to be compared are defined on the same node set, or those measures scale with network (graph) size. A further problem is to meaningfully compare between dense and sparse time series networks.
In this work we introduce a new network comparison method based on Persistent Homology, an approach from Topological Signal Processing. Persistent homology studies the mesoscopic architecture within networks, to find distributed patterns that appear or disappear over a range of scales.
We apply the new method to compare networks from neuroimaging (brain activity networks) and econometrics datasets; we show which fundamental topological features are equivalent or different between networks of different sizes, and between sparse and dense networks. In the process we explain why many existing time series network identification methods fail in real world circumstances.
Coauthors: DuBois Bowman1, Goran Marjanovic2, and Victor Solo2,3.
1. Columbia University, New York, New York.
2. The University of New South Wales, Sydney.
3. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, Massachusetts.
AdaNet: Adaptive Structural Learning of Artificial Neural Networks
Scott Yang, BS, Courant Institute, New York, New York
We present a new theoretical framework for building and learning artificial neural networks. Formulating the configuration of a neural network as an effective parametrized model for supervised learning, we design a method for training neural networks that adapts model structure and complexity to the difficulty of the particular problem, with no pre-defined architecture.
Starting from a simple single layer network, we add additional neurons and layers to the model using rigorous estimates of the generalization ability from statistical learning theory. Specifically, we determine whether to add new nodes to the network by measuring the incremental reduction in empirical error against the additional model complexity.
This structural risk minimization-type approach is in contrast to existing work on structure learning, which has been either based on heuristics for growing and pruning the network or focused on designing more efficient methods for large-scale hyperparameter tuning. The former is unable to provide concrete theoretical guarantees, and the latter is wasteful of data and statistically equivalent to standard grid search.
Our structure learning algorithm is based on directly optimizing for generalization performance. This allows us to not only enforce model sparsity but to also provide strong accompanying generalization bounds, in contrast to previous work in this area. Remarkably, the resulting method is also convex and hence more stable than many of the current deep learning methodologies employed. Preliminary experiments support and validate our theory and framework.
Coauthors: Corinna Cortes1, Xavi Gonzalvo1, Vitaly Kuznetsov1, and Mehryar Mohri1,2.
1. Google Research, New York, New York.
2. Courant Institute, New York, New York.
Learning with Rejection
Giulia DeSalvo, BS, New York University Courant Institute, New York, New York
We consider a flexible binary classification scenario where the learner is given the option to reject an instance instead of returning its prediction, thereby incurring a cost. We introduce a novel framework for this setting that consists of simultaneously learning two functions: a classifier along with a rejection function. We present a full theoretical analysis of this framework including new data-dependent learning bounds in terms of the Rademacher complexities of the classifier and rejection families as well as consistency and calibration results. These theoretical guarantees guide us in designing new algorithms that can exploit different kernel-based hypothesis sets for the two sets of classifier and rejection functions. We also present a boosting-style algorithm where at each round, it selects a pair of functions, a base classifier and a base rejection function. We give its convergence guarantees along with a linear-time weak-learning algorithm for rejection stumps. We compare and contrast to the special case of confidence-based rejection, devising alternative loss functions and algorithms for this setting as well. We report the results of several experiments showing that our algorithms can yield a notable improvement over the best existing confidence-based rejection algorithm.
Coauthors: Corinna Cortes2 and Mehryar Mohri1.
1. New York University Courant Institute, New York, New York.
2. Google Research, New York, New York.
Reconstructing Hidden Permutations Using the Average-Precision (AP) Correlation Statistic
Lorenzo De Stefani, PhD, Brown University, Providence, Rhode Island
We study the problem of learning probabilistic models for permutations, where the order between highly ranked items in the observed permutations is more reliable (i.e., consistent in different rankings) than the order between lower ranked items, a typical phenomena observed in many applications such as web search results and product ranking. We introduce and study a variant of the Mallows model where the distribution is a function of the widely used Average-Precision (AP) Correlation statistic, instead of the standard Kendall's tau distance.
We present a generative model for constructing samples from this distribution and prove useful properties of that distribution. Using these properties we develop an efficient algorithm that provably computes an asymptotically unbiased estimate of the center permutation, and a faster algorithm that learns with high probability the hidden central permutation for a wide range of the parameters of the model.
We show experimentally the accuracy of our estimators in the reconstruction of the hidden permutation with a limited number of samples. We also show with real data that unsupervised methods based on our model can precisely (and efficiently) identify ground-truth clusters of rankings. Compared to the Kendall's tau based methods, our methods are less affected by noise in low-rank items.
Finally, we show that supervised classification algorithms based on the AP statistic outperform Kendall's tau distance based algorithms by comparing their performance in classifying large genomic expressions from ten publicly available datasets from human cancer research.
Appeared in the proceedings of the Thirtieth (AAAI) Conference on Artificial Intelligence.
Coauthors: Alessandro Epasto2, Eli Upfal1, and Fabio Vandin3.
1. Brown University, Providence, Rhode Island.
2. Google Research, New York, New York.
3. Univeristà degli Studi di Padova, Padova, Italy.
Human Teaching by Demonstration: Showing versus Doing Reinforcement Learning Tasks
Mark K. Ho, ScM, Brown University, Providence, Rhode Island
What is the relationship between doing a task and showing another agent how to do a task? In inverse reinforcement learning, an algorithm attempts to learn a policy or task representation by observing example demonstrations. Typically, these example demonstrations are sampled from the optimal policy for a task. That is, they are samples of an expert "doing" a task. However, when teaching by demonstration, people often do not simply do a task but actively attempt to show others how to do a task. A better understanding of this showing behavior can be used to improve inverse reinforcement learning algorithms.
In this work, we present a computational account of showing as a form of Bayesian pedagogy: optimally selecting sequences of behavior that lead an observer to infer a particular underlying reward function. This formulation can be understood as "planning in a learner's belief space" and extends previous accounts of teaching by example to sequential domains. In several human experiments, we demonstrate that people's behavior when "showing" a task closely matches this model, while their "doing" a task does not. Similarly, we show that these human-generated showing trajectories are more effective for training standard inverse reinforcement learning algorithms. Moreover, this work provides a basis for developing inverse reinforcement learning algorithms that can benefit from reasoning about intentional teaching.
Coauthors: Michael Littman1, James MacGlashan1, Fiery Cushman2, and Joseph L. Austerweil3.
1. Brown University, Providence, Rhode Island.
2. Harvard University, Cambridge, Massachusetts.
3. University of Wisconsin–Madison, Madison, Wisconsin.