Sharing Some Valuable Points
of Machine Learning
Notes of MLSS2014
Reporter: Xinliang Zhu
June 27,2014
Contents
• Introduction
• Supervised Learning
• Weakly-supervised Learning
• Unsupervised Learning
Contents
• Introduction
• Supervised Learning
• Weakly-supervised Learning
• Unsupervised Learning
Introduction
• What is Machine Learning?
Traditional Software Process
Machine Learning Process
Interview the experts
Collect input-output examples from the
Create an algorithm that automates
their process
experts
Learn a function to map from the input to
the output
Introduction
• Some Concepts of statistical ML
– Supervised Learning
– Semi-supervised Learning
– Weakly-supervised Learning
– Unsupervised Learning
– Reinforcement Learning
Introduction
• What is Machine Learning?
Traditional Software Process
Machine Learning Process
Interview the experts
Collect input-output examples from the
Create an algorithm that automates
their process
experts
Learn a function to map from the input to
the output
Introduction
• 3 essential parts of statistical ML
– Model(Probability distribution, discriminant
function with unresolved parameters)
– Strategy(Loss function and risk function)
– Algorithm (optimal algorithms, stochastic gradient
algorithm, etc.)
Statistical ML = Model + Strategy + Algorithm
--Hang Li
Contents
• Introduction
• Supervised Learning
• Weakly-supervised Learning
• Unsupervised Learning
Supervised Learning
• Given: Training examples (xi, f(xi)) for
some unknown function f.
• Find: A good approximation to f.
• Example
– Traffic sign recognition
• x: picture of traffic sign
• F(x): type of traffic sign
– Spam Detection
• x: email message
• f(x): spam or not spam
Supervised Learning
• Training examples are drawn independently
at random according to unknown
probability distribution P(x,y)
• The Learning algorithm analyzes the
examples and produces a classifier f
• Given a new data point (x,y) drawn from P,
the classifier is given x and predicts y
• The loss is then measured
• Goal of the learning algorithm: find the f
that minimizes the expected loss
The Main Approaches to
Machine Learning
• Learn a classifier: a function f.
• Learn a conditional distribution: a
conditional distribution P(y|x)
• Learn the joint probability distribution:
P(x,y)
• one example of each method
– Learn a classifier: The Perceptron algorithm
– Learn a conditional distribution: Logistic regression
– Learn the joint distribution: Linear discriminant
analysis
Linear Threshold Units
We Assume that each feature xj and each weight wj is a real number
We will study three different algorithms for learning linear threshold
units:
•Perceptron: function
•Logistic Regression: conditional distribution
•Linear Discriminant Analysis: joint distribution
A canonical representation
• Given a training example of the form
(<x1,x2,x3,x4>,y)
• Transform it to
(<-1, x1,x2,x3,x4>,y)
• The parameter vector will then be
(w0,w1,w2,w3,w4)
• We will call the unthresholded hypothesis
u(x,w)= <w,x>=wTx
• Each hypothesis can be written h(x)=sgn(u(x,w))
• Our goal is to find w
Geometrical View
• Consider 3 training examples:
– (<1.0,1.0>,+1)
– (<0.5,3.0>,+1)
– (<2.0,2.0),-1)
• We want a classifier that looks like this:
The Unthresholded Discriminant
Function is a Hyperplane
• The equation u(x,w)=wTx is a plane
Machine Learning==Optimization
• Given:
– A set of N training examples
{(x1,y1),(x2,y2),…,(xn,yn)}
– A loss function L
• Find
– The weight vector w that minimizes the expected
loss on the training data
Step-wise Constant Loss Function
Derivative is either 0 or infinite
Approximating the expected loss by a
smooth function
• Simplify the optimization problem by
replacing the original objective function
by a surrogate loss function.
• Hinge loss:
When y = 1 :
Minimizing J by Gradient Descent Search
Batch Perceptron Algorithm
Online Perceptron Alogrithm
This is also called stochastic gradient descent because the
overall gradient is approximated by the gradient from
each individual example
Logistic Regression
• Learn the conditional probability P(y|x)
• Let py(x;w) be our estimate of P(y|x),where w is a
vector of adjustable parameters. Assume only two
classes y=0 and y=1, and
• It is easy to show that this is equivalent to
• In other words, the log odds of class 1 is a linear
function of x.
The reason of choosing exp function
• A linear function has a range from negative infinite to positive
infinite and we need to force it to be a positive and sum to 1 in
order to be a probability
Choosing the Loss Function
• For probabilistic models, we use the log
loss:
Compare with 0/1 Loss
Maximum Likelihood Fitting
• To minimize the log loss, we should
maximize
• The likelihood of the data is:
• It is easier to work with the log likelihood:
Maximizing the log likelihood via
gradient ascent
• Similar to the gradient descent
algorithm of perceptron algorithm,need
to rewrite the log likelihood in terms of
p1(xi;w)
Logistic Regression Implements
a Linear Discriminant Function
• In the 2-class 0/1 loss function case, we
should predict y =1 if P(y=1|x;w)>0.5
• Take log of both sides
• Or wTx>0
The Joint Probability Approach:
Linear Discriminant Analysis
• Learn P(x,y). This is called the generative
approach, because we can think of P(x,y) as
a model of how the data is generated
– For example, if we factor the joint distribution into the
form P(x,y)=P(y)P(x|y)
– Generative story
• Draw y~P(y) choose a class
• Draw x~P(x|y) generate the features for x
– This can be represented as a probabilistic graphical
model
Linear Discriminant analysis
• P(y) is a discrete multinomial
distribution
• For LDA, we assume that P(x|y) is a
multivariate normal distribution with
mean uk and covariance matrix ∑
The LDA Model
• Linear discriminant analysis assumes
that the joint distribution has the form
Fitting the LDA Model
LDA learns an LTU
LDA learns an LTU
LDA learns an LTU
Two Geometric Views of LDA
View 1: Mahalanobis Distance
Two Geometric Views of LDA
View 2: Most Informative Low-Dimensional Projection
Comparing Perceptron, Logistic
Regression, and LDA
• Statistical Efficiency: If the generative model
P(x,y) is correct, then LDA usually gives the
highest accuracy, particularly when the amount
of training data is small. If the model is correct,
LDA requires 30% less data than Logistic
Regression in theory
• Computational Efficiency: Generative Models
typically are the easiest to learn. The LDA
parameters can be computed directly from the
data without using gradient descent.
Comparing Perceptron, Logistic
Regression, and LDA
• Vapnik’s Principle
– If your goal is to minimize 0/1 loss, then you
should do that directly , rather than first solving a
harder problem(probability estimation)
– This is what Perceptron does
– Other algorithms that follow this principle
• SVM
• Decision Tree
• Neural Networks
Comparing Perceptron, Logistic
Regression, and LDA
• Robustness to model assumptions. The
generative model usually performs poorly when
the assumptions are violated. Logistic
Regression is more robust to model assumptions,
and Perceptron is even more robust.
• Robustness to missing values and noise. In many
applications, some of the features xij may be
missing or corrupted in some of the training
examples. Generative models typically provide
better way of handling this than non-generative
models.
Thank you!
Sharing Some Valuable Points
of Machine Learning
Notes of MLSS2014
Reporter: Xinliang Zhu
July 4,2014
Contents
• Introduction
• Supervised Learning
• Weakly-supervised Learning
• Unsupervised Learning
Weakly-supervised Learning
Answers to what we left from the last lecture
• When ATB=BTA ?
Given: (AB)T=BTAT
?
(ATB)T = BTA = ATB
Only when: (ATB)T = ATB, i.e. ATB is a
symmetric matrix
Weakly-supervised Learning
Why weakly-supervised learning?
Weakly-supervised Learning
Weakly-supervised Learning
• Collecting full annotations for all the
images and videos in a large dataset is an
onerous and expensive task
• the largest dataset that provides only
image-level labels consists of millions of
images
• Instead of relying on small supervised
dataset for complex visual tasks, weak
learning allows us to use large, inexpensive
datasets
Multiple Instance Learning
Multiple Instance Learning
Multiple Instance Learning(discriminant)
Multiple Instance Learning Algorithms
• Learning Axis-Parallel Concepts(Dietterich et al., 1997)
• Diverse Density(DD) ( Maron and Lozano-Perez, 1998)
• EM-DD (Zhang and Goldman, 2001)
• Citation kNN (Wang and Zucker, 2000)
• SVM for multi-instance learning ( Andrews et al., 2002)
Multiple Instance Learning Applications
• Drug activity prediction
• Content-based image retrieval and
classification
• Text categorization
Contents
• Introduction
• Supervised Learning
• Weakly-supervised Learning
• Unsupervised Learning
Unsupervised Learning
• Clustering
• Change detection
Clustering
Unlabeled data
Clustering: the process of
grouping a set of objects
into classes of similar
objects
Issues For Clustering
• Representation for clustering
– similarity/distance
• How many clusters?
– Fixed a priori?
– Completely data driven?
Hard vs. Soft Clustering
• Hard clustering: Each document belongs to exactly
one cluster
– More common and easier to do
• Soft clustering: A document can belong to more than
one cluster.
– You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
– You can only do that with a soft clustering approach.
Clustering Algorithms
• Flat algorithms
– Usually start with a random (partial) partitioning
– Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
– Bottom-up, agglomerative
– (Top-down, divisive)
K-Means
• Assumes data are real-valued vectors.
• Clusters based on centroids (aka the
center of gravity or mean) of points in a
cluster, c:
1
μ(c)
x
| c | xc
• Reassignment of instances to clusters is
based on distance to the current cluster
centroids.
K-Means Algorithm
• Select K random points {s1, s2,… sK} as seeds.
• Until clustering converges (or other stopping
criterion):
• For each point di:
• Assign di to the cluster cj such that dist(xi, sj) is
minimal.
• (Next, update the seeds to the centroid of each
cluster)
• For each cluster cj : sj = (cj)
How Many Clusters
• Number of clusters K is given
• Finding the “right” number of clusters
is part of the problem
Change Detection
• Goal: Given two sets of samples, we want to
compare the probability distributions behind
• Two approaches:
– Distributional change detection
– Structural change detection
Distributional Change Detection
• Goal: Detect change in probability
distributions behind two sets of samples
through divergence
Examples
• ROI detection in images:
Examples
• Event detection in movies
Examples
• Event detection from Twitter
Distances and Divergences
Kullback-Leibler Divergence
f-Divergences
Pearson Divergence
Estimate Densities
• Maximum likelihood estimation
• Bayes estimation
• Kernel density estimation
• Nearest-neighbor density estimation
Thank you!
© Copyright 2025 Paperzz