Advances in Machine Learning

Advances in Machine Learning
Klaus-Robert Müller !!et al.!!
IDA’s Role in BBDC
AP5: Statistical DA & ML
AP1: Material science
AP2: Video mining
AP3: Language processing
AP4: Image analysis in
medicine
IDA
Summarize requirements
from application partners.
AP6: Scalable ML
Develop new methods to
be added to Technology
Data
Huge
Complex-structured
Multi-modal
Distributed
Streaming
Partially-observed
Non-trivial,
further scalability,
justification
X.
New technologies
DIMA (AP7/8)
Trivially parallelizable
(pre-processing, linear regression,
etc.)
Requirements & Solutions
AP1: Material science
AP2: Video mining
AP3: Language processing
AP4: Image analysis in
medicine
Data
Huge
Complex-structured
Multi-modal
Distributed
Streaming
Partially-observed
Requirements
Efficient
Distributed
Handy
Deep
Online
Transductive
+
Classification
Regression
Clustering
Inference
Retrieval
Solutions
Data indexing
Deep learning
Decomposable optimization
Multi-modal analysis
Projects done
Data indexing:
Multi-purpose Locality Sensitive Hashing (mpLSH)
Deep learning:
Deep Tensor Neural Networks
Explaining Non-linear Machine Learning
Decomposable Optimization:
Multi-class SVM for Extreme Classification
Parallel Matrix Factorization
Polynomial-time Message Passing for High-order Potentials
Performance Guarantee of Approximate Bayesian Learning
Multi-modal Analysis:
Multi-modal Source Power Co-modulation (mSPoC)
Transductive Conditional Random Field Regression (TCRFR)
Data Indexing
A key technology for sub-linear time nearest neighbor (NN) search.
Required new technologies:
Multi-purpose LSH for the following applications:
General purpose indexing at data collection phase (without fixed analysis
plan).
Materical search with optimized properties (e.g., stability, utility, etc.)
Image/video retrieval with adjustable query (e.g., preference +
closeness).
Data collection
Indexing
(time-consuming)
NN search
(sub-linear time)
L2-LSH
sign-LSH
simple-LSH
Projects done:
Multi-purpose LSH?
Multi-purpose Locality Sensitive Hashing (mpLSH)
Deep Learning
Neural networks are scalable learning machines for capturing complex structure in big data.
Required new technologies:
Architecture design for new applications
Explaining learning machines
Insights into target problems
Projects done:
Deep Tensor Neural Networks
Explaining Non-linear Machine Learning
Decomposable Optimization
Decomposablity is a key for efficient/parallel computation.
Required new technologies:
Find an equivalent/approximate decomposable formulation of non-decomposable problems.
Algorithm design for efficient/parallel computation.
Assessment of accuracy of approximation.
Coordinate descent
(Decomposable)
Multi-class SVM
(Non-decomposable)
Projects done:
Multi-class SVM for extreme Classification
Parallel Matrix Factorization
Polynomial-time Message Passing for High-order Potentials
Performance Guarantee of Approximate Bayesian Learning
Pair-wise parallelization
Multi-modal Analysis
Big data can be multi-modal with heterogeneity (e.g., different S/N ratios, time/spatial resolutions)
Required new technologies:
Information mixing/propagation.
X
*
+ ...
Y
*
+ ...
Common (latent) feature extraction.
High temporal res. High spatial res.
(poor spatial res.) (poor temporal res.)
+
mSPoC
pattern
Projects done:
Noisy seismic dataAccurate direct data
Estimated
(densely available) (sparsely available) latent structure
Multi-modal Source Power Co-modulation (mSPoC)
Transductive Conditional Random Field Regression (TCRFR)
Projects done
Data indexing:
Multi-purpose Locality Sensitive Hashing (mpLSH)
Deep learning:
Deep Tensor Neural Networks
Explaining Non-linear Machine Learning
Decomposable Optimization:
Multi-class SVM for Extreme Classification
Parallel Matrix Factorization
Polynomial-time Message Passing for High-order Potentials
Performance Guarantee of Approximate Bayesian Learning
Multi-modal Analysis:
Multi-modal Source Power Co-modulation (mSPoC)
Transductive Conditional Random Field Regression (TCRFR)
Multiple-purpose
Locality Sensitive Hashing
Nearest neighbor search (NNS)
Naïve implementation (linear scan) requires
time.
Locality sensitive hashing (LSH)
LSH enables approximate NNS in
time for
.
Random projections provide good LSH functions.
Locality sensitive hashing (LSH)
LSH enables approximate NNS in
time for
.
Only samples in the same bucket should be evaluated.
Different similarity requires different hashing
L2-LSH for L2 similarity [Datar et al. 2004]
sign-LSH for cosine similarity [Coemans&Williamson 1995]
simple-LSH for inner product (IP) similarity [Neyshabur&Srebro 2015]
Different similarity requires different LSH codes
Multi-purpose similarity:
General LSH coding for multi-purposes?
Multi-Purpose Locality Sensitive Hashing (mpLSH)
Multi-purpose similarity:
Feature augmentation
for metric transform
Common part
Similarity dependent part
LSH coding
by random projection
Multi-metric search
by cover tree
Multi-metric in code space:
mpLSH: Theoretical and empirical performance validation
(Conditional) LSH property
theoretically guaranteed.
Approximate search performance
validated on several read data.
Proposed 1
Proposed 2
L2
Computational and memory efficiency
proven on large data (100M samples).
IP
Mixed
Multi-Purpose Locality Sensitive Hashing (mpLSH)
Characteristics
Random projection based (data-independent).
Query-time weight adjustment supported by cover tree.
Similar memory requirement to sign-LSH.
Applications:
General purpose indexing at data collection phase (without fixed analysis
plan).
Material search with optimized properties (e.g., stability, utility, etc.)
Image/video retrieval with adjustable query (e.g., preference + closeness).
Data collection
Indexing
(time-consuming)
NN search
(sub-linear time)
L2-LSH
sign-LSH
simple-LSH
Multi-purpose LSH!
Demo: Image retrieval system with adjustable objective
http://bbdcdemo.bbdc.tu-berlin.de
Retrieve similar images to user-provided query (L2-query) with user-preference query
(IP-query) taken into account. Mixing weight is adjusted in real-time.
Closest to L2-query
Best match IP-query
Most relevant to mixed query
Query search in
~100msec!
Synergy within BBDC
Machine Learning (IDA)
Locality sensitive hashing
Approximate NN search
Random projection theory
PAC theory
Structured indexing
Cover tree
mpLSH
project
Distributed implementation
(ongoing work)
Data Management (DIMA)
Computer Vision (HHI)
Image/Video mining
Image features
Retrieval system
Multi-class SVM for
Extreme Classification
SVMs for Extreme Classification
Extreme Classification:
• Big Data: Thousands of classes (e.g. > 10^4)!
⇨ E.g. text classification, object recognition, …
⇨ Often view of database with categorical features.
• Combined with high-dimensional features (e.g. > 10^5):
⇨ Huge models: Hundreds of Gigabytes! Need to distribute!
Support Vector Machines (SVMs):
•One of the most successful classification algorithms!
⇨In multi-class setting one can use One-vs-Rest (OVR) or a All-in-One (AIO) training.
⇨All-in-One on small problems superior, yet on large datasets unexplored.
SVMs for Extreme Classification
SVMs for Extreme Classification:
• Increase of classes gives rise to new parallelization schemes.
⇨ We are first to present a model distributed All-in-One SVMs (Alber et.al. 2016)
⇨ This allowed us to compare OVR and AIO SVMs unprecedented scale.
Why not OVR? It’s embarrassingly parallel:
•AIO has a holistic view one the classification problem.
⇨Models are a magnitude sparser.
⇨Accuracy is significantly better.
⇨This suggests superiority of AIO over OVR also for larger datasets!
SVMs for Extreme Classification
AIO-SVM Weston and Watkins (WW) decomposes in the following subproblem:
Multi-class SVM
(Non-decomposable)
Coordinate descent
(Decomposable)
W_yi and W_c can be updated in parallel!
Smart update schemes for broadest parallelization:
Similar to a soccer teams in a season, each class pair meets once an epoch. More evolved for
distributed setting:
SVMs for Extreme Classification
Speedup and timing on LSHTC-large:
Explaining Nonlinear
Machine Learning
Explaining Predictions Pixel-wise
[Bach, Binder, Klauschen, Montavon, Müller & Samek, PLOS ONE 2015]
Explaining Predictions Pixel-wise
Neural networks
Kernel methods
Application: Comparing Classifiers
Learning Atomistic Representations with
Deep Tensor Neural Networks
Kristof Schütt,Farhad Arbabzadah,
Stefan Chmiela, Alexandre Tkatchenko,
Klaus-Robert Müller
Machine Learning for chemical compound space
Ansatz:
instead of
[from von Lilienfeld]
Coulomb representation of molecules
M ii = Z i2.4
M ij =
Zi Z j
Ri  R j
M  2323
{Z 2, R2 }
{Z 1, R 1 }
{Z 3, R 3 }
{Z 4, R 4 }
M ij
...
+ phantom atoms
{0, R 21 } {0, R 22 } {0, R 23 }
Coulomb Matrix (Rupp, Müller et al 2012, PRL)
Kernel ridge regression
Distances between M define Gaussian kernel matrix K
Predict energy as sum over weighted Gaussians
using weights that minimize error in training set
Exact solution
As many parameters as molecules + 2 global parameters, characteristic length-scale or kT of system
(σ), and noise-level (λ)
[from von Lilienfeld]
Predicting Energy of small molecules: Results
March 2012
Rupp et al., PRL
9.99 kcal/mol
(kernels + eigenspectrum)
December 2012
Montavon et al., NIPS
3.51 kcal/mol
(Neural nets + Coulomb sets)
2015 Hansen et al 1.3kcal/mol at
10 million times faster than the
state of the art
Prediction considered chemically
accurate when MAE is below 1
kcal/mol
Dataset available at http://quantum-machine.org
Quantum Chemical Insights
[Schütt et al 2014 PRB, Schütt et al. Nature
Communications]
Conclusions
High (inter)national visibilty, top publications, workshops etc.
Nine projects done in the four research areas.
Data indexing:
Multi-purpose Locality Sensitive Hashing (mpLSH)
Decomposable Optimization:
Deep learning:
Deep Tensor Neural Networks
Explaining Non-linear ML
Multi-class SVM for Extreme Classification
Parallel Matrix Factorization
Polynomial-time Message Passing for High-order Potentials
Performance Guarantee of Approximate Bayesian Learning
Multi-modal Analysis:
Multi-modal Source Power Co-modulation (mSPoC)
Transductive Conditional Random Field Regression (TCRFR)
Further collaboration will happen in the second half period of BBDC project.
© 2013 Berlin Big Data Center • All Rights Reserved
Publications
Alber, M., Zimmert, J., Dogan, U., & Kloft, M. (2016). Distributed Optimization of Multi-Class SVMs. NIPS Extreme
Classification Workshop
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. R., & Samek, W. (2015). On pixel-wise explanations
for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7) :e0130140.
Bach, S., Binder, A., Montavon, G., Müller, K.-R. & Samek, W. (2016). Analyzing Classifiers: Fisher Vectors and
Deep Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2016.
Bauer, A., Braun, M., and Müller, & K. R. (2015). Accurate Maximum-Margin Training for Parsing with ContextFree Grammars. IEEE TNNLS
Bauer, A., Nakajima, S., and Müller, K. R. (2016). Efficient Exact Inference with Loss Augmented Objective in
Structured Learning. IEEE TNNLS
Dähne, S., Bießman, F., Samek, W., Haufe, S., Goltz, D., Gundlach, C., Villringer, A., Fazli, S., and Müller, K.-R.
(2015). Multivariate machine learning methods for fusing functional multimodal neuroimaging data.
Proceedings of the IEEE. 103 (9), 1507 – 1530
Görnitz, N., Lima, L. A., Varella, L. E., Müller, K. R., Nakajima, S. (2015). Transductive Regression for Data with
Latent Dependency Structure. IEEE TNNLS (submitted).
Hansen, S. T., Winkler, I., Hansen, L. K., Müller, K.-R., and Dähne, S. (2015). Fusing simultaneous EEG and fMRI
using functional and anatomical information. In International Workshop on Pattern Recognition in
Neuroimaging, 2015. IEEE
Lapuschkin, S., Binder, A., Montavon, G., Samek, W., & Müller, K. R. (2016). The LRP Toolbox for Articial Neural
Networks. Journal of Machine Learning Research 17, 1-5.
Montavon, G., Bach, S., Binder, A., Samek, W., & Müller, K. R. (2015). Explaining nonlinear classication decisions
with deep taylor decomposition. CoRR, abs/1512.02479.
Nakajima, S., Tomioka, R., Sugiyama, M., Babacan, S. D. (2015). Condition for Perfect Dimensionality Recovery
by Variational Bayesian PCA. Journal of Machine Learning Research 16, 3757-3811.
Publications
Pronobis, W., Panknin, D., Kirschnick, J., Srinivasan, V., Samek, W., Markl, V., Kaul, M., Muller, K.-R. , &
Nakajima, S. (2016). Sharing Hash Codes for Multiple Purposes. arXiv:1609.03219 [stat.ML].
Samek, W., Binder, A., Montavon, G., Bach, S., and Müller, K. R. (2015) Evaluating the visualization of what a
deep neural network has learned. IEEE TNNLS, in Press
Schütt, K. T., Arbabzadah, F., Chmiela, S., Tkatchenko, A., Müller, K. R., (2016) Learning Atomistic
Representations with Deep Tensor Neural Networks, Nature Communications, in Press
Sturm, I., Bach, S., Samek, W., & Müller, K. R. (2016). Interpretable Deep Neural Networks for Single-Trial EEG
Classification. Journal of Neuroscience Methods, .

Download Report

Advances in Machine Learning

Paperzz.com

Your Paperzz