Big Data and Machine Learning: A Modern Approach Krzysztof Choromanski Google Research New York, NY, USA June 21, 2015 1 Course Description The goal of the course is to present very recent techniques used in machine learning and big data analysis. Contrary to popular belief, several most powerful modern methods in data mining apply mathematical techniques that are not only accessible for experts in the field. The course will show the students those techniques in a way that will enable them to quickly start working on many problems on the frontier of machine learning. The goal of this class is not to cover major statistical methods for data mining, but rather to show them various connections between different branches of science that researchers come across while working on big data. The course will cover some new important real-life applications of the presented tools, as well as propose open problems that hopefully can be solved by using a mixture of these techniques. We will focus on two topics. The first one are: neural networks. Deep Neural Networks paradigm has been proven over several years to be an extremely effective and fruitful method for solving several challenging machine learning problems. Its usefulness in such tasks as speech, image and video recognition can hardly be overestimated. The most accurate known algorithms performing these tasks are based on neural networks. Despite the unquestionable advances in machine learning that were obtained with the use of neural networks, not much is known about their theoretical guarantees. In particular, we still have only a very vague idea why the multilayer architecture, where linear projections are followed by nonlinear pointwise transformations, turns out to beat most state-of-the-art approaches with well-established theoretical background for a variety of problems. One of the goals of this course is to shed light on that. We present an intriguing connection between neural networks and renormalization techniques in Physics, show some very recent results regarding application of neural networks with structured matrices (that in practice are competitive with the state-of-the-art constructions in terms of accuracy, but significantly outperform them in terms of speed of training and memory consumption), as well as explain connection between neural networks and graph theory. Some time from this part of the course will be devoted to exploring open problems regarding presented material and working on them with the students. In the second part of the course we will describe several methods that can be used to learn with privacy guarantees. This branch of machine learning is relatively new, but became very important over recent years due to its straightforward applications. Modern data that machine learning algorithms operate on is very often sensitive and thus certain privacy constraints must be satisfied before results of these algorithms can be published. One of the strongest currently used notion of privacy is the so-called differential privacy. In this course we will present new machine learning algorithms, , shortly called ml-priv algorithms, preserving this notion of privacy (such as differentially-private random projection trees or differentially-private random decision trees), as well as other techniques for preserving privacy based on graph theory techniques (such as b-matching algorithm). We explain how new purely combinatorial approaches to solve some fundamental graph theory problems lead to more efficient ml-priv algorithms. We will also show how the problem of constructing machine learning algorithm with privacy guarantees relates to the notion of stability of machine learning algorithms. Like in the neural networks setting, some time from this part of the talk will be devoted to exploring open problems regarding ml-priv algorithms and working on them with the students. 2 Prerequisites The course does not require its attendees to have any prior knowledge in machine learning. All concepts will be explained during lectures. It is assumed that participants have some basic background 1 in linear algebra, calculus and probability theory. 3 Content • Neural Networks and Deep Learning (6 lectures) – Lecture 1: Neural Networks - basic concepts: ∗ feedforward neural networks, backpropagation algorithm ∗ convolutional neural networks with application in image and speech recognition ∗ regularization techniques, drop-outs, different neuron models, autoencoders. – Lecture 2: Speeding up Neural Nets - structured matrices: ∗ drawbacks of machine learning with neural networks ∗ Toeplitz and circulant matrices for parameters reduction and fast matrix-vector product computations ∗ Johnson-Lindenstrauss Lemma ∗ binary embeddings - preprocessing with structured hashing via Ψ-regular matrices, connections with chromatic numbers χ(G) of related graphical representations. – Lecture 3: “Open problems lecture” - working on a few open problems regarding material covered in lectures: 1-2. – Lecture 4: Graph Theory & Neural Networks - Boltzmann Machine Paradigm: ∗ ∗ ∗ ∗ Boltzmann Machine and Restricted Boltzmann Machine (RBM) combinatorial optimization with Boltzmann machines learning with Boltzmann machines quantum computing and Boltzmann machines. – Lecture 5: Renormalization techniques in Physics and Deep Neural Networks: ∗ 1D and 2D Ising models - closer look ∗ coupling DNN architecture with multilayer renormalization pipeline. – Lecture 6: “Open problems lecture” - working on a few open problems regarding material covered in lectures: 4-5. • Learning with Privacy Guarantees (6 lectures) – Lecture 7: Different notions of privacy - basic concepts: ∗ ∗ ∗ ∗ differential privacy - introduction k-anonymity and l-diversity state-of-the-art applications differential privacy in social networks. 2 – Lecture 8: Differentially-private Random Projection Trees (RPT) & Random Decision Trees (RDT): ∗ random decision trees - an introduction ∗ random projection trees algorithm for low dimensional manifolds by Dasgupta & Freund ∗ efficient differentially-private RPT and RDT algorithms via buckets’ counts. – Lecture 9: “Open problems lecture” - working on a few open problems regarding material covered in lectures: 7-8. – Lecture 10: Adaptive Anonymity via b-matching: ∗ matching and b-matching problems in graphs ∗ applications of b-matching in algorithms preserving k-anonymity ∗ analysis of the new approaches for anonymizing data via b-matching in the heterogeneous privacy setting. – Lecture 11: Limits of learning with privacy: ∗ Fast Dinur-Nissim linear-programming algorithm for breaking privacy of statistical databases ∗ Dinur-Nissim algorithm extensions for more general classes of queries and database models (Dwork-McSherry-Talwar algorithm, Choromanski-Malkin algorithm) ∗ online algorithms for retrieving heavily perturbed statistical databases in the lowdimensional querying model with low error rate & applications in adversarial machine learning. – Lecture 12: “Open problems lecture” - working on a few open problems regarding material covered in lectures: 10-11. 4 4.1 Detailed description Lecture 1 This lecture will introduce basic neural networks concepts. We will present the feedforward neural network paradigm and briefly discuss other possible approaches (such as recurrent neural networks). It will be showed how via backpropagation algorithm neural networks can be used to solve image and speech recognition problems. We will also introduce convolutional neural networks and discuss their advantages that for the time being make them the most effective tool scientists have to apply to these problems. We will discuss different techniques that are used to improve the accuracy of deep neural networks and how they affect the entire computational pipeline. Literature: [1] (very good and comprehensive tutorial over deep neural nets paradigm), [2] (paper that introduced backpropagation algorithm), [3], [4] (book available online), Stanford University machine learning lectures: deeplearning.stanford.edu/tutorial/. 3 4.2 Lecture 2 Very recently (most important papers regarding this topic are few months old) new paradigm for computations in neural networks has been proposed. The core idea is to impose a restriction regarding the structure of the linear projection layer that boils down to a specific restriction regarding related matrix. That can lead to significant reduction in the space needed to store the matrices encoding all linear projections in the DNN pipeline (that leaves the way to perform neural networks’ computation directly on mobile devices), and very often also to speeding up the entire computation (smaller set of all the parameters that need to be learned as well as faster matrix-vector computations via such tools as Fast Fourier Transform). We will discuss in detail all these new advances and explain some theoretical challenges regarding those models. We will introduce all necessary mathematical machinery that will involve methods for reducing space dimensionality with small distance distortion (several variants of the so-called Johnson-Lindenstrauss Lemma). We wil also talk about binary embeddings that are related to some of these new techniques and due to their nonlinearity can model nonlinear maps (such as sigmoid function) applied by neural units. Literature: [5], [6] (very compact proof of the basic version of Johnson-Lindenstrauss Lemma), [7] (fast circulant version of the Johnson-Lindenstrauss Lemma), [8](available at Amazon, this is an excellent mathematical tutorial over structured matrices and algorithms that may be performed on them). 4.3 Lecture 3 The entire course will consist of four “Open problems” lectures. The goal of this session is to work on a few open problems regarding material covered in last two lectures. Some partial results regarding these problems will be presented. It will be explained where main challenges are and students will be encouraged to work together with the lecturer on making some advance. All four sets of problems will be given to students in advance at the beginning of the course so that these of the participants that would like to start working on the problems earlier, will have an opportunity to do so. It could be the case that some open problems will be added to the list later. The goal of these lectures is to work on some problems on the frontier of modern machine learning and to ignite collaboration between students that can hopefully lead to some publications in the field. In this lecture some open problems regarding theoretical guarantees of structured approaches in deep learning will be given. In particular, we will work on providing learning gurantees for neural networks using circulant or more general Toeplitz matrices. We will also work on extending new results on binary embedding to more general nonlinear mappings than a sign map that is used in most of the theoretical results on the subject. This is motivated by types of nonlinearities applied in neural networks. Literature: From previous two lectures. 4.4 Lecture 4 This lecture explains in more detail connection between neural networks and graph theory. We will present different topologies of neural networks, introduce related models such as Boltzmann Machine and Restricted Boltzmann Machine. We will show how general neural networks can be used to approximately solve several difficult graph optimization problems. Finally, we will explain why Boltzmann machine paradigm is so attractive for researchers working on quantum computing. 4 This lecture will gently introduce relationship between neural networks and complex physical systems. This will be further explored in the subsequent lecture. Literature: [9], Stanford University machine learning lectures: deeplearning.stanford.edu/tutorial/, [10], [11], [12], [13] (last two positions regard restricted Boltzmann machines and quantum computing). 4.5 Lecture 5 This entire lecture will focus on the paper ([14]) that was published only few months ago, but has already gained huge attention in machine learning community. This paper establishes 1-1 mapping between deep neural networks with specific nonlinear operation performed by neural units and Ising models heavily exploited by physicists. We will start by explaining the renormalization techniques for the 1D-Ising model. This model, in contrast to the 2D-Ising model, can be exactly solved with the use of renormalization techniques. We will show how it can be done. Then we will couple aforementioned renormnalization techniques with the mutlilayer architecture of deep neural networks. Understanding content from Lecture 4 will be very useful to follow this one. Literature: [14], [15] (available online, very good introduction to renormalization techniques). 4.6 Lecture 6 This is one of the “Open problems” lectures (see: description of Lecture 3). Open problems of this lecture will regard (restricted) Boltzmann machines, their relation to complex physical systems and quantum computing with Boltzmann machines. Scale-freeness and potential connection of autoencoders with objects extracting general properties of massive systems with rich structure (such as graphons which are limiting objects for several families of graphs) is another topic that can be considered here. Literature: From previous two lectures. 4.7 Lecture 7 In this lecture we will introduce several definitions of privacy and show how they can be utilized in algorithms solving practical problems in data mining. We will pay special attention to differential privacy which is nowadays widely considered as the strongest used notion of privacy. Algorithms preserving differential privacy usually meet all privacy requirements needed in practice, yet it is usually hard to establish differentially-private algorithms solving effectively machine learning problems. Other introduced notions of privacy will include l-diversity and k-anonymity. These are weaker than differential privacy, but can be succesfully applied in the much wider range of scenarios, are related to the data rather than a specific algorithm, and thus can be applied when the way in which data will be used is not known in advance. Literature: [16], [17], [18], [19], [20], [21]. 4.8 Lecture 8 This lecture will be about two very recent succesfull applications (see: [22, 23]) of differential privacy in machine learning - random projection trees introduced (in the non-differentially-private 5 version) by Dasgupta & Freund and random decision trees. Both applications give good intuition in which machine learning applications differential privacy makes sense. In particular, they show that machine learning algorithms operating on random (or pseudorandom) structures, where parameters’ dependence on the data is significantly limited are good candidates for acheving high privacy requirements. In these cases the analysis of the privacy part is usually very straightforward, but it is hard to understand from the theoretical point of view the quality of the proposed solution. Some of these algorithms were for years considered as “interesting heuristics” and only recently some theoretical guarantees have been established (an excellent example here is the random decision tree paradigm). All new concepts (such as decision and projection trees) will be explained at the beginning of the lecture so no previous knowledge is required. Literature: [22], [23], [24] (good introduction to random decision trees), [25], [26], [27], [28] (non-differentially-private random projection trees algorithm by Dasgupta & Freund), [29] (application of random projection trees in vector quantization). 4.9 Lecture 9 In this “Open problems”-lecture we will work on constructing differentially-private versions of important machine learning/data mining algorithms for which no such versions are known. In particular, we will focus on algorithms working on graph data as well as these that operate in the heterogeneous setting, where different users specify different privacy requirements and thus stateof-the-art definition of differential privacy needs to be relaxed. Literature: From previous two lectures. 4.10 Lecture 10 The goal of this lecture is to present applications of graph theory concepts (such as b-matching or b-edge cover) in the machine learning/data mining algorithms achieving k-anonymity/l-diversity guarantees. Our use case will be a recent result [30]. We will show how these new graph theory techniques naturally lead to adaptive privacy, where users get level of privacy adjusted to their own needs. Furthermore, we explain why these techniques may potentially lead to much more stable machine learning algorithms and how new combinatorial algortihms solving the b-matching problem are applied in some of the fastest known data anonymization pipelines. All necessary graph theory concepts (matching, b-matching, edge cover, etc.) will be explained at the beginning of the lecture. No advanced knowledge in graph theory is required. Literature: [30], [31], [32], [33] (positions [31]-[33] consider matching problem in graphs), [34], [35],[36] (last three positions regard alternative applications of the b-matching method in machine learning to the one presented in [30]). 4.11 Lecture 11 The last standard lecture of this part of the course will be about limitations of privacy techniques in data mining. We will show very efficient algorithms for breaking database privacy. They are based on different techniques that are interesting in themselves. We start with the celebrated Dinur-Nissim algorithm that was proposed few years ago and is based on a particularly beautiful ideas involving linear programming and probabilistic analysis. We present several of its extensions that work under much relaxed conditions regarding database model, queries’ structure, etc. Some 6 of the extensions regard databases that can be described by the underlying graph model. All these variants assume offline model of querying, where first all the answers to the queries are collected and then the adversary decides how to use them to gain additional information about database system. Later we will consider breaking database privacy in the online setting and show efficient algorithms for breaking privacy in that scenario. All these results, even though at first glance negative, may also server as machine learning algorithms that reconstruct data from noisy, perturbed statistics which is the case for most practical applications. For instance, the aforementioned online algorithms can be applied in adversarial machine learning. Note: We will not consider here any cryptographic aspects of breaking database privacy. Instead we will analyze this topic from the machine learning point of view. Literature: [37] (the aforementioned online algorithm for breaking database privacy), [38] (original Dinur-Nissim algorithm), [39] (Dwork-McSherry-Talwar algorithm), [40] (ChoromanskiMalkin algorithm). 4.12 Lecture 12 This “Open problems”-lecture is dedicated to extending general tools used in the Dinur-Nissim algorithm and other presented techniques for breaking database privacy, as well as to adaptive anonymity algorithm. In particular, we will work on constructing online algorithms for breaking database privacy that do not rely on any low-dimensionality assumptions regarding data. Another interesting open problem may involve extending Dinur-Nissim techniques to a general graphdatabase models with minimal requirements regarding the structure of the graph-query used in that model. In the adpative anonymity setting we will work on adapting presented b-matching technique to the differential privacy setting. We will also work on proving much stronger theoretical guarantees even in the k-anonymity scenario by a careful analysis of the structure of the set of perfect matchings of the corresponding comparability bipartite graph produced by the b-matching algorithm. 7 References [1] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. [2] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by backpropagating errors. Nature, 323:533–536, 1986. [3] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradient-based learning. In Shape, Contour and Grouping in Computer Vision, page 319, 1999. [4] David Kriesel. A brief introduction to neural networks. [5] Yu Cheng, Felix X. Yu, Rogerio S. Feris, Sanjiv Kumar, Alok Choudhary, and Shih-Fu Chang. Fast neural networks with circulant projections. arXiv:1502.03436. [6] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 2003. [7] Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302–322, 2009. [8] Victor Y. Pan. Structured matrices and polynomials: Unified superfast algorithms. [9] Emile H. L. Aarts and Jan H. M. Korst. Boltzmann machines and their applications. In PARLE, Parallel Architectures and Languages Europe, Volume I: Parallel Architectures, Eindhoven, The Netherlands, June 15-19, 1987, Proceedings, pages 34–50, 1987. [10] Geoffrey E. Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the Trade - Second Edition, pages 599–619. 2012. [11] Geoffrey E. Hinton and Terrence J. Sejnowski. Learning and relearning in boltzmann machines. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pages 282–317. 1986. [12] Misha Denil and Nando de Freitas. Toward the implementation of a quantum RBM. In NIPS 2011 Deep Learning and Unsupervised Feature Learning Workshop, 2011. [13] Vincent Dumoulin, Ian J. Goodfellow, Aaron C. Courville, and Yoshua Bengio. On the challenges of physical implementations of rbms. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Québec City, Québec, Canada., pages 1199–1205, 2014. [14] Pankaj Mehta and David J. Schwab. An exact mapping between the variational renormalization group and deep learning. In arXiv:1410.3831, 2014. [15] William David McComb. Renormalization methods: A guide for beginners. 2004. [16] C. Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, 5th International Conference, TAMC 2008, Xi’an, China, April 25-29, 2008. Proceedings, pages 1–19, 2008. [17] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006. [18] Latanya Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, 2002. [19] Xiaokui Xiao and Yufei Tao. Personalized privacy preservation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27-29, 2006, pages 229–240, 2006. [20] Mingqiang Xue, Panagiotis Karras, Chedy Raı̈ssi, Jaideep Vaidya, and Kian-Lee Tan. Anonymizing set-valued data by nonreciprocal recoding. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, August 12-16, 2012, pages 1050–1058, 2012. 8 [21] Graham Cormode, Divesh Srivastava, Smriti Bhagat, and Balachander Krishnamurthy. Class-based graph anonymization for social network data. PVLDB, 2(1):766–777, 2009. [22] Anna Choromanska, Krzysztof Choromanski, Geetha Jagannathan, and Claire Monteleoni. Differentially-private learning of low dimensional manifolds. In Algorithmic Learning Theory - 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Proceedings, pages 249–263, 2013. [23] Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, and Yann LeCun. Differentially- and non-differentially-private random decision trees. In arXiv:1410.6973, 2015. [24] W. Fan, H. Wang, P.S Yu, and S. Ma. Is random model better? on its accuracy and efficiency. In ICDM, 2003. [25] Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101:578-590, 2002. [26] G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res., 9:2015–2033, 2008. [27] G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright. A practical differentially private random decision tree classifier. Trans. Data Privacy, 5(1):273–295, 2012. [28] Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, May 17-20, 2008, pages 537–546, 2008. [29] Sanjoy Dasgupta and Yoav Freund. Random projection trees for vector quantization. IEEE Transactions on Information Theory, 55(7):3229–3242, 2009. [30] Krzysztof Choromanski, Tony Jebara, and Kui Tang. Adaptive anonymity via b-matching. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 3192–3200, 2013. [31] Harold N. Gabow and Robert Endre Tarjan. Faster scaling algorithms for general graph-matching problems. J. ACM, 38(4):815–853, 1991. [32] John E. Hopcroft and Richard M. Karp. An n5/2 algorithm for maximum matchings in bipartite graphs. SIAM J. Comput., 2(4):225–231, 1973. [33] Silvio Micali and Vijay V. Vazirani. An o(sqrt(|v|) |e|) algorithm for finding maximum matching in general graphs. In 21st Annual Symposium on Foundations of Computer Science, Syracuse, New York, USA, 13-15 October 1980, pages 17–27, 1980. [34] Tony Jebara and Vlad Shchogolev. B-matching for spectral clustering. In Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006, Proceedings, pages 679–686, 2006. [35] Bert C. Huang and Tony Jebara. Fast b-matching via sufficient selection belief propagation. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 361–369, 2011. [36] Tony Jebara, Jun Wang, and Shih-Fu Chang. Graph construction and b-matching for semi-supervised learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, pages 441–448, 2009. [37] Krzysztof Choromanski, Afshin Rostamizadeh, and Umar Syed. An Õ( √1t )-error online algorithm for retrieving heavily perturbated statistical databases in the low-dimensional querying model. In arXiv:1504.01117, 2015. [38] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 202–210. ACM, 2003. 9 [39] Cynthia Dwork, Frank McSherry, and Kunal Talwar. The price of privacy and the limits of LP decoding. In STOC, pages 85–94, 2007. [40] Krzysztof Choromanski and Tal Malkin. The power of the dinur-nissim algorithm: breaking privacy of statistical and graph databases. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, pages 65–76, 2012. 10
© Copyright 2024 Paperzz