Sum-Product Networks: Deep Models with Tractable Inference
Pedro Domingos, PhD, University of Washington, Seattle, Washington, United States
Big data makes it possible in principle to learn very rich probabilistic models, but inference in them is prohibitively expensive. Since inference is typically a subroutine of learning, in practice learning such models is very hard. Sum-product networks (SPNs) are a new model class that squares this circle by providing maximum flexibility while guaranteeing tractability. In contrast to Bayesian networks and Markov random fields, SPNs can remain tractable even in the absence of conditional independence. SPNs are defined recursively: an SPN is either a univariate distribution, a product of SPNs over disjoint variables, or a weighted sum of SPNs over the same variables. It's easy to show that the partition function, all marginals and all conditional MAP states of an SPN can be computed in time linear in its size. SPNs have most tractable distributions as special cases, including hierarchical mixture models, thin junction trees, and nonrecursive probabilistic context-free grammars. I will present generative and discriminative algorithms for learning SPN weights, and an algorithm for learning SPN structure. SPNs have achieved impressive results in a wide variety of domains, including object recognition, image completion, collaborative filtering, and click prediction. Our algorithms can easily learn SPNs with many layers of latent variables, making them arguably the most powerful type of deep learning to date. (Joint work with Rob Gens and Hoifung Poon.)
Overcoming Computational Hardness by Agnostic Non-Proper Learning
Elad Hazan, PhD, Princeton University Department of Computer Science and Microsoft Research, Herzliya, Israel
Numerous learning problems are naturally modelled by NP-hard formulations. Examples include predicting outcomes in a sports tournament, learning the preference of users in media recommendation systems and learning the optimal ranking of web search results. Recent literature overcomes the computational hardness for these and other problems through convex optimization and regret minimization techniques, giving rise to agnostic non-proper learning algorithms that are both efficient and come with provable generalization guarantees. We describe a recent example—the problem of learning from low rank missing data arising in recommendation systems.
Coauthors: Roi Livni and Yishay Mansour, Tel Aviv University Department of Computer Science and Microsoft Research, Herzliya, Israel.
Deep Generative Models
Yoshua Bengio, PhD, Université de Montréal, Ontario, Canada
Boltzmann machines and their variants (restricted ordeep) have been the dominant model for generative neural network models for a long time and they are appealing among other things because of their relative biological plausibility (say, compared to back-prop). We start this presentation by discussing some of the difficulties we encountered in training them, and undirected graphical models in general, and ask the question of the existence of credible alternatives which avoid the issues with the partition function gradient and mixing between modes with MCMC methods. We review advances of recent years to train deep unsupervised models that capture the data distribution, all related to auto-encoders, and that avoid the partition function and MCMC issues. In particular recent theoretical and empirical work on denoising auto-encoders can be extended to train deep generative models with latent variables. The presentation will end with an introduction to on-going exploration of a framework that is meant to be more biologically plausible as well as an alternative to both Boltzmann machines and back-propagation.
Achieving All with No Parameters: Adaptive NormalHedge
Haipeng Luo, Department of Computer Science, Princeton University
The problem of predicting with expert advice was pioneered in [1, 2, 3, 4] about two decades ago, but there are still constant new challenges on this problem driven from both theory and practice, and various advancements from different aspects in recent years. In this work, we address a challenging problem of catching any unknown competitor with zero prior information, even when the competitor is changing over time. Our main contribution is a novel expert algorithm that is truly parameter-free and adaptive to the environment.
Roughly speaking, in the expert problem, a player tries to lose as little as possible by cleverly spreading a fixed amount of money to bet on a set of experts on each day. His goal is to have a small regret, i.e. to have a total loss that is not much worse than any single expert, or more generally, any fixed and unknown convex combination of experts that we want to compare to. When this competitor is known, the well-known exponential weights algorithm already gives the optimal results [2, 5]. However, when the competitor is unknown beforehand, existing algorithms fail to do the job. Let alone the case where we even allow the unknown competitor to vary over time.
We address this problem by proposing an improved version of the NormalHedge.DT algorithm , called adaptive NormalHedge (or AdaNormalHedge for short). On one hand, this new algorithm is completely parameter-free and able to compete with any convex combination of experts with a regret in terms of the relative entropy of the prior and the competitor. On the other hand, it ensures a new regret bound in terms of the cumulative magnitude of the instantaneous regrets, which is always at most the bound for NormalHedge.DT. More importantly, this new form of regret implies 1) a small regret when the loss of the competitor is small and 2) an almost constant regret when the losses are stochastically generated. This resolves an open problem proposed by  and . In fact, our results are even better and more general.
We then extend the results to the so-called sleeping expert setting and provide two applications to illustrate the power of AdaNormalHedge: 1) competing with time-varying unknown competitors and 2) predicting almost as well as the best pruning tree. Our results on these applications significantly improve previous work from different aspects, and a special case of the first application resolves another open problem on whether one can simultaneously achieve optimal shifting regret for both adversarial and stochastic losses .
Coauthor: Robert E. Schapire, Department of Computer Science, Princeton University and Microsoft Research, New York City.
 Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994.
 Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, August 1997.
 Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, May 1997.
 V. G. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153–173, April 1998.
 Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79–103, 1999.
 Haipeng Luo and Robert E. Schapire. A drifting-games analysis for online learning and applications to boosting. In Advances in Neural Information Processing Systems 27, 2014.
 Kamalika Chaudhuri, Yoav Freund, and Daniel Hsu. A parameter-free hedging algorithm. In Ad- vances in Neural Information Processing Systems 22, 2009.
 Alexey Chernov and Vladimir Vovk. Prediction with advice of unknown number of experts. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, 2010.
 Manfred K. Warmuth and Wouter M. Koolen. Open problem: Shifting experts on easy data. In Proceedings of the 27th Annual Conference on Learning Theory, 2014.
Approximate Kernel Methods for Speech Recognition
In this work, we investigate how to scale kernel methods to large-scale problems in speech recognition and computer vision, and compare their performance with deep neural networks (DNNs). We show that on these tasks, our methods are competitive with well-engineered neural networks.
The difficulty in scaling kernel methods is that their running times are generally at least quadratic in the size of the training set; simply computing the kernel matrix takes O(N2d) time and O(N2) space, where N is the training set size, and d is the dimensionality of the data. This makes using standard algorithms like kernel SVM intractable for large-scale problems. In this work, we investigate whether it is possible to use approximation techniques to help scale kernel methods, in the context of the acoustic modeling problem in speech recognition, as well as in the digit and objection recognition problems from computer vision. Our work builds on the kernel approximation method proposed by Rahimi and Recht , which uses random projections, passed through a cosine non-linearity, to generate feature representations. In our work, we use these features to train a multinomial logistic regression model. We leverage GPUs extensively during training, both for generating the random features, as well as for model updates.
Our contributions are as follows: First, we propose ways of incorporating ideas from multiple kernel learning (MKL) in the context of the "Random Kitchen Sink" (RKS) features of Rahimi and Recht. Second, we show how to scale training to very large numbers of random features, by training separate models on disjoint subsets of the features in parallel, and then combining the models. Third, we provide extensive experiments which compare these kernel methods with deep neural networks, on two speech recognition datasets, as well as two computer vision tasks, and show that these methods match or surpass the DNNs in many cases. Lastly, we show that the representations learned from DNNs appear to be complementary to the RKS features, as combining these models achieves better performance than any single model.
Our empirical work is as follows: For the speech recognition problem, we worked with the IARPA Babel Program Cantonese (IARPAbabel101-v0.4c) and Bengali (IARPA-babel103b-v0.4b) limited language packs. Each pack contains a 20-hour training, and a 20-hour test set. We show that although DNN models often attain lower perplexity ("Perp") and higher frame-level state accuracy ("Accuracy"), the kernel models are quite competitive in these metrics, and sometimes beat the DNNs in the token error rates ("TER") measured after speech recognition is performed. See table below for details.
For the computer vision problem, we worked with the MNIST-8M dataset for digit recognition, and with the CIFAR-10 dataset for object recognition. We observe very comparable performance between these methods, with kernel methods outperforming DNNs on CIFAR-10, but DNNs winning on MNIST-8M. See detailed results in table below. It is important to acknowledge that convolution neural networks (CNNs) can significantly outperform DNNs on these tasks, and thus these results are not state of the art. However, they do indicate comparable performance between DNNs and kernels in a challenging domain.
These results call into question the assumption commonly made today that deep architectures are necessary to achieve state of the art results in speech recognition and computer vision. They open the door for new kernel-based methods to be explored for speech recognition and computer vision, as well as in other domains where deep architectures are accepted as the de facto standard.
Coauthors: Zhiyun Lu1†, Avner May2†,*, Kuan Liu1‡, Alireza Bagheri Garakani1‡, Dong Guo1‡, Aurélian Bellet4‡, Linxi Fan2‡, Michael Collins2, Brian Kingsbury3, Michael Picheny3, and Fei Sha1
1 Department of Computer Science, University of Southern California, Los Angeles, CA
2 Department of Computer Science, Columbia University, New York, NY
3 IBM T.J. Watson Research Center, Yorktown Heights, NY
4 LTCI UMR 5141, Télécom ParisTech & CNRS, France
† and ‡: Shared first and second co-authorships, respectively.
Probabilistic Bayesian Analysis of Genetic Associations with Clinical Features in Cancer
Melanie F. Pradier, University Carlos III in Madrid and Memorial Sloan-Kettering Cancer Center
Understanding key genotype-phenotype relationships remains one of the central challenges in biomedical research . Such studies allow identifying genetic risk factors in patients or improving current clinical diagnosis. It also gives more insights about the diseases at hand, which might be valuable for marker discovery or treatment personalization. This work deals with the task of finding associations between genetic variants and clinical features in cancer. The features considered are directly extracted from the Electronic Health Records (EHR) of the patients. To the best of our knowledge, there exist no previous work that considers such high-level features.
We first present a classical approach to find pair-wise associations using Linear Mixed Models (LMM) . Such models are often considered because of their capacity to deal with confounding effects, like population structure, which can cause false positive associations if ignored. Already known covariates such as for example age, gender or cancer type are accounted for as structured noise in the model. We estimate additional hidden confounders using PANAMA (Probabilistic ANAlysis of GenoMic Data), with is a statistical model often used in expression quantitative trait loci (eQTL) studies . This model combines a Gaussian Process Latent Variable Model of the phenotypes with a LMM, considering the obtained latent projections as noise or signal in an iterative fashion.
The first approach does not consider epistasis, i.e. complex interactions between genetic variants, nor pleiotropy, i.e. multiple traits being influenced by the same mutation. Also, complex diseases such as breast cancer often present very heterogeneous phenotypes, which might cause some associations to remain hidden . Therefore, we propose an alternative approach where multiple clinical features and multiple genetic markers are tested all together. Two infinite mixture models are used to identify genetic and clinical patterns independently . The previous LMM approach can then be used with the mixture assignment variables as input, accounting for both group clinical features and group genetic variants, while identifying sub-phenotypes in the population.
The LMM also assumes that the features are Gaussian-distributed. This assumption does not hold for most clinical features. A common practice applies a rank transformation to the features to make them Gaussian- distributed, despite of the information loss. We additionally propose an alternative approach using a joint Bayesian Partition model to generate both the discrete clinical features and genetic information. Bayesian modeling has already been proved useful on epistasis , pleiotropy [6,7] or sub-phenotyping [8,9] applications. Our model combines ideas of these previous contributions and is the first one to deal with clinical text data and genetic information, capturing interactions of multiple-markers beyond additive effects.
Coauthors: Fernando Perez-Cruz, University Carlos III in Madrid; Julia E. Vogt, Stefan Stark, and Gunnar Rätsch, Memorial Sloan-Kettering Cancer Center
 M. I. McCarthy, G. R. Abecasis, L. R. Cardon, D. B. Goldstein, J. Little, J. P. A. Ioannidis, and J. N. Hirschhorn. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet, vol. 9, no. 5, pp. 356–369, May 2008.
 C. Lippert, F. P. Casale, B. Rakitsch, and O. Stegle. LIMIX: genetic analysis of multiple traits. bioRxiv, 2014.
 N. Fusi, O. Stegle, and N. D. Lawrence. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput Biol, vol. 8, no. 1, p. e1002330, Jan. 2012.
 M. D. Ritchie, L. W. Hahn, N. Roodi, L. R. Bailey, W. D. Dupont, F. F. Parl, and J. H. Moore. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. vol. 69, no. 1, pp. 138–147, Jul. 2001.
 A. Rodriguez and K. Ghosh. Nested partition models. Jack Baskin School of Engineering, 2009.
 Y. Zhang and J. S. Liu. Bayesian inference of epistatic interactions in case-control studies. Nature Genetics, vol. 39, no. 9, pp. 1167–1173, Sep. 2007.
 W. Zhang, J. Zhu, E. E. Schadt, and J. S. Liu. A bayesian partition method for detecting pleiotropic and epistatic eQTL modules. PLoS Comput Biol, vol. 6, no. 1, Jan. 2010.
 D. Warde-Farley, M. Brudno, Q. Morris, and A. Goldenberg. Mixture model for subphenotyping in GWAS. in Pac. Symp. Biocomput, 2012, vol. 17, pp. 363–374.
 L. Parts, O. Stegle, J. Winn, and R. Durbin. Joint genetic analysis of gene expression data with inferred cellular phenotypes. PLoS Genetics, vol. 7, no. 1, p. e1001276, Jan. 2011.
Learning With Deep Cascades
Giulia DeSalvo, Courant Institute of Mathematical Sciences, New York University
Anchored Factor Analysis
Yonatan Halpern, New York University
On-line Learning Approach to Ensemble Methods for Structured Prediction
Vitaly Kuznetsov, Courant Institute of Mathematical Sciences, New York University
Regret Minimization in Posted Price Auctions against Strategic Buyers
Andrés Muñoz Medina, Courant Institute of Mathematical Sciences, New York University
Finding a Sparse Vector in a Subspace: Linear Sparsity Using Alternating Directions
Qing Qu, Columbia University
Large-Scale Clustering of Sentences and Patients Based on Electronic Health Records
Stefan Stark, Memorial Sloan-Kettering Cancer Center
Theoretical Foundations for Learning Kernels in Supervised Kernel PCA
Dmitry Storcheus, Google Research