Big Data and Machine Learning: A Modern Approach

Big Data and Machine Learning: A Modern Approach
Krzysztof Choromanski
Google Research
New York, NY, USA
June 21, 2015
1
Course Description
The goal of the course is to present very recent techniques used in machine learning and big data
analysis. Contrary to popular belief, several most powerful modern methods in data mining apply
mathematical techniques that are not only accessible for experts in the field. The course will show
the students those techniques in a way that will enable them to quickly start working on many
problems on the frontier of machine learning. The goal of this class is not to cover major statistical
methods for data mining, but rather to show them various connections between different branches
of science that researchers come across while working on big data. The course will cover some
new important real-life applications of the presented tools, as well as propose open problems that
hopefully can be solved by using a mixture of these techniques.
We will focus on two topics. The first one are: neural networks. Deep Neural Networks paradigm
has been proven over several years to be an extremely effective and fruitful method for solving
several challenging machine learning problems. Its usefulness in such tasks as speech, image and
video recognition can hardly be overestimated. The most accurate known algorithms performing
these tasks are based on neural networks. Despite the unquestionable advances in machine learning
that were obtained with the use of neural networks, not much is known about their theoretical
guarantees. In particular, we still have only a very vague idea why the multilayer architecture,
where linear projections are followed by nonlinear pointwise transformations, turns out to beat most
state-of-the-art approaches with well-established theoretical background for a variety of problems.
One of the goals of this course is to shed light on that. We present an intriguing connection
between neural networks and renormalization techniques in Physics, show some very recent results
regarding application of neural networks with structured matrices (that in practice are competitive
with the state-of-the-art constructions in terms of accuracy, but significantly outperform them in
terms of speed of training and memory consumption), as well as explain connection between neural
networks and graph theory. Some time from this part of the course will be devoted to exploring
open problems regarding presented material and working on them with the students.
In the second part of the course we will describe several methods that can be used to learn
with privacy guarantees. This branch of machine learning is relatively new, but became very
important over recent years due to its straightforward applications. Modern data that machine
learning algorithms operate on is very often sensitive and thus certain privacy constraints must be
satisfied before results of these algorithms can be published. One of the strongest currently used
notion of privacy is the so-called differential privacy. In this course we will present new machine
learning algorithms, , shortly called ml-priv algorithms, preserving this notion of privacy (such as
differentially-private random projection trees or differentially-private random decision trees), as well
as other techniques for preserving privacy based on graph theory techniques (such as b-matching
algorithm). We explain how new purely combinatorial approaches to solve some fundamental graph
theory problems lead to more efficient ml-priv algorithms. We will also show how the problem of
constructing machine learning algorithm with privacy guarantees relates to the notion of stability
of machine learning algorithms. Like in the neural networks setting, some time from this part of
the talk will be devoted to exploring open problems regarding ml-priv algorithms and working on
them with the students.
2
Prerequisites
The course does not require its attendees to have any prior knowledge in machine learning. All concepts will be explained during lectures. It is assumed that participants have some basic background
1
in linear algebra, calculus and probability theory.
3
Content
• Neural Networks and Deep Learning (6 lectures)
– Lecture 1:
Neural Networks - basic concepts:
∗ feedforward neural networks, backpropagation algorithm
∗ convolutional neural networks with application in image and speech recognition
∗ regularization techniques, drop-outs, different neuron models, autoencoders.
– Lecture 2:
Speeding up Neural Nets - structured matrices:
∗ drawbacks of machine learning with neural networks
∗ Toeplitz and circulant matrices for parameters reduction and fast matrix-vector
product computations
∗ Johnson-Lindenstrauss Lemma
∗ binary embeddings - preprocessing with structured hashing via Ψ-regular matrices,
connections with chromatic numbers χ(G) of related graphical representations.
– Lecture 3:
“Open problems lecture” - working on a few open problems regarding material covered
in lectures: 1-2.
– Lecture 4:
Graph Theory & Neural Networks - Boltzmann Machine Paradigm:
∗
∗
∗
∗
Boltzmann Machine and Restricted Boltzmann Machine (RBM)
combinatorial optimization with Boltzmann machines
learning with Boltzmann machines
quantum computing and Boltzmann machines.
– Lecture 5:
Renormalization techniques in Physics and Deep Neural Networks:
∗ 1D and 2D Ising models - closer look
∗ coupling DNN architecture with multilayer renormalization pipeline.
– Lecture 6:
“Open problems lecture” - working on a few open problems regarding material covered
in lectures: 4-5.
• Learning with Privacy Guarantees (6 lectures)
– Lecture 7:
Different notions of privacy - basic concepts:
∗
∗
∗
∗
differential privacy - introduction
k-anonymity and l-diversity
state-of-the-art applications
differential privacy in social networks.
2
– Lecture 8:
Differentially-private Random Projection Trees (RPT) & Random Decision Trees (RDT):
∗ random decision trees - an introduction
∗ random projection trees algorithm for low dimensional manifolds by Dasgupta &
Freund
∗ efficient differentially-private RPT and RDT algorithms via buckets’ counts.
– Lecture 9:
“Open problems lecture” - working on a few open problems regarding material covered
in lectures: 7-8.
– Lecture 10:
Adaptive Anonymity via b-matching:
∗ matching and b-matching problems in graphs
∗ applications of b-matching in algorithms preserving k-anonymity
∗ analysis of the new approaches for anonymizing data via b-matching in the heterogeneous privacy setting.
– Lecture 11:
Limits of learning with privacy:
∗ Fast Dinur-Nissim linear-programming algorithm for breaking privacy of statistical
databases
∗ Dinur-Nissim algorithm extensions for more general classes of queries and database
models (Dwork-McSherry-Talwar algorithm, Choromanski-Malkin algorithm)
∗ online algorithms for retrieving heavily perturbed statistical databases in the lowdimensional querying model with low error rate & applications in adversarial machine learning.
– Lecture 12:
“Open problems lecture” - working on a few open problems regarding material covered
in lectures: 10-11.
4
4.1
Detailed description
Lecture 1
This lecture will introduce basic neural networks concepts. We will present the feedforward neural
network paradigm and briefly discuss other possible approaches (such as recurrent neural networks).
It will be showed how via backpropagation algorithm neural networks can be used to solve image
and speech recognition problems. We will also introduce convolutional neural networks and discuss
their advantages that for the time being make them the most effective tool scientists have to apply
to these problems. We will discuss different techniques that are used to improve the accuracy of
deep neural networks and how they affect the entire computational pipeline.
Literature: [1] (very good and comprehensive tutorial over deep neural nets paradigm), [2]
(paper that introduced backpropagation algorithm), [3], [4] (book available online), Stanford University machine learning lectures: deeplearning.stanford.edu/tutorial/.
3
4.2
Lecture 2
Very recently (most important papers regarding this topic are few months old) new paradigm for
computations in neural networks has been proposed. The core idea is to impose a restriction regarding the structure of the linear projection layer that boils down to a specific restriction regarding
related matrix. That can lead to significant reduction in the space needed to store the matrices
encoding all linear projections in the DNN pipeline (that leaves the way to perform neural networks’
computation directly on mobile devices), and very often also to speeding up the entire computation
(smaller set of all the parameters that need to be learned as well as faster matrix-vector computations via such tools as Fast Fourier Transform). We will discuss in detail all these new advances
and explain some theoretical challenges regarding those models. We will introduce all necessary
mathematical machinery that will involve methods for reducing space dimensionality with small
distance distortion (several variants of the so-called Johnson-Lindenstrauss Lemma). We wil also
talk about binary embeddings that are related to some of these new techniques and due to their
nonlinearity can model nonlinear maps (such as sigmoid function) applied by neural units.
Literature: [5], [6] (very compact proof of the basic version of Johnson-Lindenstrauss Lemma),
[7] (fast circulant version of the Johnson-Lindenstrauss Lemma), [8](available at Amazon, this is
an excellent mathematical tutorial over structured matrices and algorithms that may be performed
on them).
4.3
Lecture 3
The entire course will consist of four “Open problems” lectures. The goal of this session is to
work on a few open problems regarding material covered in last two lectures. Some partial results
regarding these problems will be presented. It will be explained where main challenges are and
students will be encouraged to work together with the lecturer on making some advance. All four
sets of problems will be given to students in advance at the beginning of the course so that these of
the participants that would like to start working on the problems earlier, will have an opportunity
to do so. It could be the case that some open problems will be added to the list later. The goal
of these lectures is to work on some problems on the frontier of modern machine learning and to
ignite collaboration between students that can hopefully lead to some publications in the field.
In this lecture some open problems regarding theoretical guarantees of structured approaches in
deep learning will be given. In particular, we will work on providing learning gurantees for neural
networks using circulant or more general Toeplitz matrices. We will also work on extending new
results on binary embedding to more general nonlinear mappings than a sign map that is used in
most of the theoretical results on the subject. This is motivated by types of nonlinearities applied
in neural networks.
Literature: From previous two lectures.
4.4
Lecture 4
This lecture explains in more detail connection between neural networks and graph theory. We
will present different topologies of neural networks, introduce related models such as Boltzmann
Machine and Restricted Boltzmann Machine. We will show how general neural networks can be
used to approximately solve several difficult graph optimization problems. Finally, we will explain
why Boltzmann machine paradigm is so attractive for researchers working on quantum computing.
4
This lecture will gently introduce relationship between neural networks and complex physical systems. This will be further explored in the subsequent lecture.
Literature: [9], Stanford University machine learning lectures: deeplearning.stanford.edu/tutorial/,
[10], [11], [12], [13] (last two positions regard restricted Boltzmann machines and quantum computing).
4.5
Lecture 5
This entire lecture will focus on the paper ([14]) that was published only few months ago, but has
already gained huge attention in machine learning community. This paper establishes 1-1 mapping between deep neural networks with specific nonlinear operation performed by neural units
and Ising models heavily exploited by physicists. We will start by explaining the renormalization
techniques for the 1D-Ising model. This model, in contrast to the 2D-Ising model, can be exactly
solved with the use of renormalization techniques. We will show how it can be done. Then we will
couple aforementioned renormnalization techniques with the mutlilayer architecture of deep neural
networks. Understanding content from Lecture 4 will be very useful to follow this one.
Literature: [14], [15] (available online, very good introduction to renormalization techniques).
4.6
Lecture 6
This is one of the “Open problems” lectures (see: description of Lecture 3). Open problems of
this lecture will regard (restricted) Boltzmann machines, their relation to complex physical systems and quantum computing with Boltzmann machines. Scale-freeness and potential connection
of autoencoders with objects extracting general properties of massive systems with rich structure
(such as graphons which are limiting objects for several families of graphs) is another topic that
can be considered here.
Literature: From previous two lectures.
4.7
Lecture 7
In this lecture we will introduce several definitions of privacy and show how they can be utilized in
algorithms solving practical problems in data mining. We will pay special attention to differential
privacy which is nowadays widely considered as the strongest used notion of privacy. Algorithms
preserving differential privacy usually meet all privacy requirements needed in practice, yet it is
usually hard to establish differentially-private algorithms solving effectively machine learning problems. Other introduced notions of privacy will include l-diversity and k-anonymity. These are
weaker than differential privacy, but can be succesfully applied in the much wider range of scenarios, are related to the data rather than a specific algorithm, and thus can be applied when the way
in which data will be used is not known in advance.
Literature: [16], [17], [18], [19], [20], [21].
4.8
Lecture 8
This lecture will be about two very recent succesfull applications (see: [22, 23]) of differential
privacy in machine learning - random projection trees introduced (in the non-differentially-private
5
version) by Dasgupta & Freund and random decision trees. Both applications give good intuition in
which machine learning applications differential privacy makes sense. In particular, they show that
machine learning algorithms operating on random (or pseudorandom) structures, where parameters’ dependence on the data is significantly limited are good candidates for acheving high privacy
requirements. In these cases the analysis of the privacy part is usually very straightforward, but it
is hard to understand from the theoretical point of view the quality of the proposed solution. Some
of these algorithms were for years considered as “interesting heuristics” and only recently some
theoretical guarantees have been established (an excellent example here is the random decision
tree paradigm). All new concepts (such as decision and projection trees) will be explained at the
beginning of the lecture so no previous knowledge is required.
Literature: [22], [23], [24] (good introduction to random decision trees), [25], [26], [27], [28]
(non-differentially-private random projection trees algorithm by Dasgupta & Freund), [29] (application of random projection trees in vector quantization).
4.9
Lecture 9
In this “Open problems”-lecture we will work on constructing differentially-private versions of
important machine learning/data mining algorithms for which no such versions are known. In
particular, we will focus on algorithms working on graph data as well as these that operate in the
heterogeneous setting, where different users specify different privacy requirements and thus stateof-the-art definition of differential privacy needs to be relaxed.
Literature: From previous two lectures.
4.10
Lecture 10
The goal of this lecture is to present applications of graph theory concepts (such as b-matching or
b-edge cover) in the machine learning/data mining algorithms achieving k-anonymity/l-diversity
guarantees. Our use case will be a recent result [30]. We will show how these new graph theory
techniques naturally lead to adaptive privacy, where users get level of privacy adjusted to their own
needs. Furthermore, we explain why these techniques may potentially lead to much more stable
machine learning algorithms and how new combinatorial algortihms solving the b-matching problem are applied in some of the fastest known data anonymization pipelines. All necessary graph
theory concepts (matching, b-matching, edge cover, etc.) will be explained at the beginning of the
lecture. No advanced knowledge in graph theory is required.
Literature: [30], [31], [32], [33] (positions [31]-[33] consider matching problem in graphs), [34],
[35],[36] (last three positions regard alternative applications of the b-matching method in machine
learning to the one presented in [30]).
4.11
Lecture 11
The last standard lecture of this part of the course will be about limitations of privacy techniques
in data mining. We will show very efficient algorithms for breaking database privacy. They are
based on different techniques that are interesting in themselves. We start with the celebrated
Dinur-Nissim algorithm that was proposed few years ago and is based on a particularly beautiful
ideas involving linear programming and probabilistic analysis. We present several of its extensions
that work under much relaxed conditions regarding database model, queries’ structure, etc. Some
6
of the extensions regard databases that can be described by the underlying graph model. All these
variants assume offline model of querying, where first all the answers to the queries are collected and
then the adversary decides how to use them to gain additional information about database system.
Later we will consider breaking database privacy in the online setting and show efficient algorithms
for breaking privacy in that scenario. All these results, even though at first glance negative, may
also server as machine learning algorithms that reconstruct data from noisy, perturbed statistics
which is the case for most practical applications. For instance, the aforementioned online algorithms
can be applied in adversarial machine learning.
Note: We will not consider here any cryptographic aspects of breaking database privacy. Instead we will analyze this topic from the machine learning point of view.
Literature: [37] (the aforementioned online algorithm for breaking database privacy), [38]
(original Dinur-Nissim algorithm), [39] (Dwork-McSherry-Talwar algorithm), [40] (ChoromanskiMalkin algorithm).
4.12
Lecture 12
This “Open problems”-lecture is dedicated to extending general tools used in the Dinur-Nissim
algorithm and other presented techniques for breaking database privacy, as well as to adaptive
anonymity algorithm. In particular, we will work on constructing online algorithms for breaking
database privacy that do not rely on any low-dimensionality assumptions regarding data. Another interesting open problem may involve extending Dinur-Nissim techniques to a general graphdatabase models with minimal requirements regarding the structure of the graph-query used in
that model. In the adpative anonymity setting we will work on adapting presented b-matching
technique to the differential privacy setting. We will also work on proving much stronger theoretical guarantees even in the k-anonymity scenario by a careful analysis of the structure of the set of
perfect matchings of the corresponding comparability bipartite graph produced by the b-matching
algorithm.
7
References
[1] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning,
2(1):1–127, 2009.
[2] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by backpropagating errors. Nature, 323:533–536, 1986.
[3] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradient-based
learning. In Shape, Contour and Grouping in Computer Vision, page 319, 1999.
[4] David Kriesel. A brief introduction to neural networks.
[5] Yu Cheng, Felix X. Yu, Rogerio S. Feris, Sanjiv Kumar, Alok Choudhary, and Shih-Fu Chang. Fast
neural networks with circulant projections. arXiv:1502.03436.
[6] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss.
Random Struct. Algorithms, 22(1):60–65, 2003.
[7] Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest
neighbors. SIAM J. Comput., 39(1):302–322, 2009.
[8] Victor Y. Pan. Structured matrices and polynomials: Unified superfast algorithms.
[9] Emile H. L. Aarts and Jan H. M. Korst. Boltzmann machines and their applications. In PARLE, Parallel
Architectures and Languages Europe, Volume I: Parallel Architectures, Eindhoven, The Netherlands,
June 15-19, 1987, Proceedings, pages 34–50, 1987.
[10] Geoffrey E. Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks:
Tricks of the Trade - Second Edition, pages 599–619. 2012.
[11] Geoffrey E. Hinton and Terrence J. Sejnowski. Learning and relearning in boltzmann machines. In
Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pages 282–317. 1986.
[12] Misha Denil and Nando de Freitas. Toward the implementation of a quantum RBM. In NIPS 2011
Deep Learning and Unsupervised Feature Learning Workshop, 2011.
[13] Vincent Dumoulin, Ian J. Goodfellow, Aaron C. Courville, and Yoshua Bengio. On the challenges of
physical implementations of rbms. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial
Intelligence, July 27 -31, 2014, Québec City, Québec, Canada., pages 1199–1205, 2014.
[14] Pankaj Mehta and David J. Schwab. An exact mapping between the variational renormalization group
and deep learning. In arXiv:1410.3831, 2014.
[15] William David McComb. Renormalization methods: A guide for beginners. 2004.
[16] C. Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, 5th International Conference, TAMC 2008, Xi’an, China, April 25-29, 2008. Proceedings, pages
1–19, 2008.
[17] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data
analysis. In TCC, 2006.
[18] Latanya Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, 10(5):557–570, 2002.
[19] Xiaokui Xiao and Yufei Tao. Personalized privacy preservation. In Proceedings of the ACM SIGMOD
International Conference on Management of Data, Chicago, Illinois, USA, June 27-29, 2006, pages
229–240, 2006.
[20] Mingqiang Xue, Panagiotis Karras, Chedy Raı̈ssi, Jaideep Vaidya, and Kian-Lee Tan. Anonymizing
set-valued data by nonreciprocal recoding. In The 18th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, August 12-16, 2012, pages 1050–1058,
2012.
8
[21] Graham Cormode, Divesh Srivastava, Smriti Bhagat, and Balachander Krishnamurthy. Class-based
graph anonymization for social network data. PVLDB, 2(1):766–777, 2009.
[22] Anna Choromanska, Krzysztof Choromanski, Geetha Jagannathan, and Claire Monteleoni.
Differentially-private learning of low dimensional manifolds. In Algorithmic Learning Theory - 24th
International Conference, ALT 2013, Singapore, October 6-9, 2013. Proceedings, pages 249–263, 2013.
[23] Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, and Yann LeCun. Differentially- and
non-differentially-private random decision trees. In arXiv:1410.6973, 2015.
[24] W. Fan, H. Wang, P.S Yu, and S. Ma. Is random model better? on its accuracy and efficiency. In
ICDM, 2003.
[25] Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical
Association, 101:578-590, 2002.
[26] G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classifiers. J.
Mach. Learn. Res., 9:2015–2033, 2008.
[27] G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright. A practical differentially private random
decision tree classifier. Trans. Data Privacy, 5(1):273–295, 2012.
[28] Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. In
Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia,
Canada, May 17-20, 2008, pages 537–546, 2008.
[29] Sanjoy Dasgupta and Yoav Freund. Random projection trees for vector quantization. IEEE Transactions
on Information Theory, 55(7):3229–3242, 2009.
[30] Krzysztof Choromanski, Tony Jebara, and Kui Tang. Adaptive anonymity via b-matching. In Advances
in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United
States., pages 3192–3200, 2013.
[31] Harold N. Gabow and Robert Endre Tarjan. Faster scaling algorithms for general graph-matching
problems. J. ACM, 38(4):815–853, 1991.
[32] John E. Hopcroft and Richard M. Karp. An n5/2 algorithm for maximum matchings in bipartite graphs.
SIAM J. Comput., 2(4):225–231, 1973.
[33] Silvio Micali and Vijay V. Vazirani. An o(sqrt(|v|) |e|) algorithm for finding maximum matching in
general graphs. In 21st Annual Symposium on Foundations of Computer Science, Syracuse, New York,
USA, 13-15 October 1980, pages 17–27, 1980.
[34] Tony Jebara and Vlad Shchogolev. B-matching for spectral clustering. In Machine Learning: ECML
2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006, Proceedings, pages 679–686, 2006.
[35] Bert C. Huang and Tony Jebara. Fast b-matching via sufficient selection belief propagation. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS
2011, Fort Lauderdale, USA, April 11-13, 2011, pages 361–369, 2011.
[36] Tony Jebara, Jun Wang, and Shih-Fu Chang. Graph construction and b-matching for semi-supervised
learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML
2009, Montreal, Quebec, Canada, June 14-18, 2009, pages 441–448, 2009.
[37] Krzysztof Choromanski, Afshin Rostamizadeh, and Umar Syed. An Õ( √1t )-error online algorithm
for retrieving heavily perturbated statistical databases in the low-dimensional querying model. In
arXiv:1504.01117, 2015.
[38] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proceedings of
the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
(PODS), pages 202–210. ACM, 2003.
9
[39] Cynthia Dwork, Frank McSherry, and Kunal Talwar. The price of privacy and the limits of LP decoding.
In STOC, pages 85–94, 2007.
[40] Krzysztof Choromanski and Tal Malkin. The power of the dinur-nissim algorithm: breaking privacy of
statistical and graph databases. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, pages
65–76, 2012.
10