Vision Applications.

Learning Data Representations
with “Partial Supervision”
Ariadna Quattoni
Outline
Motivation: Low dimensional representations.
 Principal Component Analysis.
 Structural Learning.
 Vision Applications.
 NLP Applications.
 Joint Sparsity.
 Vision Applications.

Outline
Motivation: Low dimensional representations.
 Principal Component Analysis.
 Structural Learning.
 Vision Applications.
 NLP Applications.
 Joint Sparsity.
 Vision Applications.

Semi-Supervised Learning
“Raw” Feature Space
Output Space
Core Task:Learn a function
from X to Y
Labeled Dataset (Small)
X  Rd
Y  {1,1}
F : X Y
T  {( x1 , y1 ), ( x2 , y2 )..., ( xn , yn )}
Classical Setting
Unlabeled Dataset (Large)
U  {x1 , x2 ,..., xu }
Partial Supervision Setting
Partially Labeled Dataset
(Large)
U  {( x1 , c1 ), ( x2 , c2 ),..., ( xu , cu )}
Semi-Supervised Learning
Classical Setting
Unlabeled
Dataset
Learn
Representation
G: X  X'
X '  Rh
Dimensionality
Reduction
G ( x)    x
  R h d h  d
Labeled
Dataset
Train
Classifier
F : G( X )  Y
Semi-Supervised Learning
Partial Supervision Setting
Unlabeled
Dataset
+
Partial Supervision
Learn
Representation
G: X  X'
X '  Rh
Dimensionality
Reduction
G ( x)    x
  R h d h  d
Labeled
Dataset
Train
Classifier
F : G( X )  Y
Why is “learning representations” useful?
 Infer the intrinsic dimensionality of the data.
 Learn the “relevant” dimensions.
 Infer the hidden structure.
Example: Hidden Structure
20 Symbols
4 Topics
Subset of 3 symbols
S  {s1 , s2 ,..., s20}
T1  {s1 , s2 ,..s5 }, T2  {s6 , s7 ,..s10}
T3  {s11, s12 ,..s15}, T4  {s16 , s17 ,..s20}
{s1 , s10 , s11}
x  [ x1  1 , x2  0,.., x10  1 , x11  1 ,...., x20  0]
3
3
3
Data Covariance
Data Covariance Matrix
Generate a datapoint:
 Choose a topic T.
 Sample 3 symbols
from T.
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
Example: Hidden Structure
 Number of latent dimensions = 4
 Map each x to the topic that generated it
 Function: G( x)    x
DataPoint
Projection Matrix
1
Topic
Vector
1 1 1 1
1 0 0 0 0 00 0 0 0 0
1 0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 1 1 1
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 1 1 1
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0 1 1 1 1 1
1 0 0 0 0 0
3
0
1
3
1
3
0
0
Latent
Representation
0
0
0
0
0
0
0
0
0
0
0
0
0
0

1 0 0 0
Outline
Motivation: Low dimensional representations.
 Principal Component Analysis.
 Structural Learning.
 Vision Applications.
 NLP Applications.
 Joint Sparsity.
 Vision Applications.

Classical Setting
Principal Components Analysis
 Rows
4
of theta as a ‘basis’: x'   z j rj
j 1
 Example generated by:
T1
T3




z  1 ,0,0,0
3
z  0,0, 1 ,0
3


 Low Reconstruction Error:
2
1 1
x´'x  
3 3
2


z  0, 1 ,0,0
3
T4 z  0,0,0, 1
3
T2
2
Minimum Error Formulation
Approximate high dimensional x with low dimensional x‘
Orthonormal basis
m
x   z ui 
'´
n
j 1
i
n
d
 biui
i  m 1
Error:
1
J
U
U
x
u 1
n
uTi u j  0
Data covariance
2
x
Solution
Su i  i u i
'
n
J
d
 u Su
i  m 1
T
i
i
Distorsion
J
d

i  m 1
i
Principal Component Analysis
2D Example
Projection
Error
 Uncorrelated variables ui  ei and i  var( xi )
 Cut dimensions according to their variance.
 Variables must be correlated.
Outline
Motivation: Low dimensional representations.
 Principal Component Analysis.
 Structural Learning.
 Vision Applications.
 NLP Applications.
 Joint Sparsity.
 Vision Applications.

Partial Supervision Setting
[Ando & Zhang JMLR 2005]
Unlabeled
Dataset
+
Partial Supervision
Create
Auxiliary
Tasks
Structure
Learning
G: X  X'
Partial Supervision Setting
 Unlabeled data + partial supervision:
 Images with associated natural language captions.
 Video sequences with associated speech.
 Document + keywords
 How
could the partial supervision help?
 A hint for discovering important features.
 Use the partial supervision to define “auxiliary tasks”.
 Discover feature groupings that are useful for these tasks.
Sometimes ‘auxiliary tasks’ defined from unlabeled data alone.
E.g. Auxiliary Task for word tagging predicting substructures-
Auxiliary Tasks:
Core task: Is a vision or machine learning article?
computer vision
papers
machine learning
papers
mask
occurrences
of keywords
keywords: object recognition,
shape matching, stereo
keywords: machine learning,
dimensionality reduction
keywords: linear embedding,
spectral methods, distance learning
Auxiliary task: predict object recognition from document content
Auxiliary Tasks
U  {( x1 , 1 ), ( x2 ,  2 ),..., ( xu ,  u )}
Di  {( x1 , y1 ), ( x2 , y2 )..., ( xu , yu )}
yi  1  k i   i
yi  1  otherwise
Structure Learning
Learning with
prior knowledge
Learning with no
prior knowledge
Hypothesis learned from
examples
Best hypothesis
^
min L( f , D j )
F ( )  { f (v, x)  v t x}
F  { f ( w, x)  w' x}
F  { f ( w, x)  w' x}
Learning from auxiliary tasks
Hypothesis learned
for related tasks
F ( )  { f (v, x)  v t x}
F  { f ( w, x)  w' x}
Learning Good Hypothesis Spaces



Class of linear predictors: f (v, x)  v  ' x
 is an h by d matrix of structural parameters.
Goal: Find thev parameters and shared that minimizes the joint
loss.
t
Shared
parameters
Problem specific parameters
  L( , v , D )  reg (v )  reg ( )
j 1:m
j
j
Loss on training set
j
Dj
Algorithm Step 1:
Train classifiers for auxiliary tasks.
C
2
w  arg min w  l ( f ( w, xi ), yi )  || w ||
2
i
*
j
Algorithm Step 2:
PCA On Classifiers Coefficients
W  [w1* , w*2 ,...w*m ]
  R h d
by taking the first h eigenvectors
of Covariance Matrix: WW
t
Linear subspace of dimension h;
a good low dimensional approximation
to the space of coefficients.
Algorithm Step 3:
Training on the core task
Project data: q ( x)   ' x
C
v  arg min v  l ( f (v, q( xi )), yi )  || v ||2
2
i
*
Equivalent to training core task on the original d
dimensional space with parameters constraints:
*
core
w
 v
t *
Example
Object = { letter, letter, letter }

An object
abC
Example

The same object seen in a different font
Abc
Example

The same object seen in a different font
ABc
Example

The same object seen in a different font
abC
Example
words
6 Letters (topics)
5 fonts per letter (symbols)
“ABC” object
“ADE” object
“BCF” words
“ABD” words
auxiliary task: recognize object .
20 words
30 Symbols  30 Features
acE
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
A
B
C
...
0 0 0 0 1
E
PCA on Data can not recover
lantent structure
Covariance Matrix
5
10
Covariance
DATA
15
20
25
30
5
10
15
20
25
30
PCA on Coefficients can recover latent structure
Auxiliary Tasks
Weight Matrix Auxiliary Task
5
10
Features
i.e. fonts
W
15
20
Topics
i.e Letters
25
30
2
4
6
8
10
12
14
Parameters
for object
BCD
16
18
20
PCA on Coefficients can recover latent structure
Covariance W
5
10
Features
i.e. fonts
Covariance W
15
20
25
30
5
10
15
20
Features
i.e. fonts
25
30
Each Block of Correlated Variables corresponds
to a Latent Topic
Outline
Motivation: Low dimensional representations.
 Principal Component Analysis.
 Structural Learning.
 Vision Applications.
 NLP Applications.
 Joint Sparsity.
 Vision Applications.

News domain
figure skating
ice hockey
golden globes
grammys
Dataset: News images from Reuters web-site.
Problem: Predicting news topics from images.
Learning visual representations using images with
captions
The Italian team celebrate their gold medal
win during the flower ceremony after the final
round of the men's team pursuit speedskating
at Oval Lingotto during the 2006 Winter
Olympics.
Former U.S. President Bill Clinton speaks
during a joint news conference with
Pakistan's Prime Minister Shaukat Aziz at
Prime Minister house in Islamabad.
Diana and Marshall Reed leave the
funeral of miner David Lewis in Philippi,
West Virginia on January 8, 2006. Lewis
was one of 12 miners who died in the
Sago Mine.
Auxiliary task: predict “ team ” from image content
Jim Scherr, the US Olympic Committee's
chief executive officer seen here in 2004,
said his group is watching the growing
scandal and keeping informed about the
NHL's investigation into Rick Tocchet,
U.S. director Stephen Gaghan and
his girlfriend Daniela Unruh arrive on
the red carpet for the screening of his
film 'Syriana' which runs out of
competition at the 56th Berlinale
International Film Festival.
Senior Hamas leader Khaled
Meshaal (2nd-R), is surrounded
by his bodyguards after a news
conference in Cairo February 8,
2006.
Learning visual topics
word ‘games’
might contain the visual topics:
people
word ‘Demonstrations’
might contain the visual topics:
pavement
medals
Auxiliary tasks
share visual topics
Different words can share topics.
Each topic can be observed under different
appearances.
people
Experiments
Results
Outline
Motivation: Low dimensional representations.
 Principal Component Analysis.
 Structural Learning.
 Vision Applications.
 NLP Applications.
 Joint Sparsity.
 Vision Applications.

Chunking
• Named entity chunking
Jane lives in New York and works for Bank of New York.
PER
LOC
ORG
• Syntactic chunking
But economists in Europe failed to predict that …
NP
PP NP
VP
SBAR
Data points: word occurrences
Labels: Begin-PER, Inside-PER, Begin-LOC, …, Outside
Example input vector representation
… lives in New York …
curr-“New” 1
curr-“in” 1
input vector X
• High-dimensional vectors.
• Most entries are 0.
left-“in” 1
left-“lives” 1
right-“New” 1
right-“York” 1
Algorithmic Procedure
1.
2.
3.
Create m auxiliary problems.
Assign auxiliary labels to unlabeled data.
Compute  (shared structure) by joint empirical risk minimization over all
the auxiliary problems.
4. Fix , and minimize empirical risk
on the labeled data for the target task.
Predictor:
f (, x)  wT x  vT x
Additional
features
Example auxiliary problems
Predict 1 from 2 .
compute shared 
add 2 as new features
1 current
?
?
:
?
left
word
1
word
Example auxiliary problems
Is the current word “New”?
Is the current word “day”?
Is the current word “IBM”?
Is the current word “computer”?
:
2
right
word
1
Experiments (CoNLL-03 named entity)




4 classes: LOC, ORG, PER, MISC
Labeled data: News documents.
204K words (English), 206K words (German)
Unlabeled data:
27M words (English), 35M words (German)
Features: A slight modification of ZJ03.
Words, POS, char types, 4 chars at the beginning/ending in a 5-word window;
words in a 3-chunk window; labels assigned to two words on the left, bi-gram of
the current word and left label; labels assigned to previous occurrences of the
current word.
No gazetteer. No hand-crafted resources.
Auxiliary problems
# of aux.
Auxiliary labels
problems
1000 Previous words
1000 Current words
1000 Next words
Features used for learning
auxiliary problems
All but previous words
All but current words
All but next words
300 auxiliary problems.
Syntactic chunking results (CoNLL-00)
method
description
F-measure
supervised
baseline
93.60
ASO-semi
+Unlabeled data
94.39
Co/self oracle
+Unlabeled data
SVM combination
93.66
93.91
Perceptron in two layers
Reg. Winnow
93.74
93.57
+full parser (ESG) output
94.17
KM01
CM03
ZDJ02
ZDJ02+
Exceeds previous best systems.
(+0.79%)
Other experiments
Confirmed effectiveness on:


POS tagging
Text categorization (2 standard corpora)
Outline
Motivation: Low dimensional representations.
 Principal Component Analysis.
 Structural Learning.
 Vision Applications.
 NLP Applications.
 Joint Sparsity.
 Vision Applications.

Notation
D2
D1
Dm
Collection of Tasks
D  {D1 , D2 ,..., Dm }
D  {D1 , D2 ,..., Dm }
Dk  {( x1k , y1k ),...., ( xnkk , ynkk )}
x   d y  {1,1}
Joint Sparse
Approximation
W
w1,1
w1, 2

w1, m
w2 ,1
w2, 2

w2 , m




wd ,1
wd , 2

wd , m
Single Task Sparse Approximation
 Consider learning a single sparse linear classifier of the form:
f ( x)  w  x
 We want a few features with non-zero coefficients
 Recent work suggests to use L1 regularization:
arg min
w

d
( x , y )D
l ( f ( x), y )  Q  | w j |
Classification
error
j 1
L1 penalizes
non-sparse solutions
 Donoho [2004] proved (in a regression setting) that the solution with smallest
L1 norm is also the sparsest solution.
Joint Sparse Approximation
 Setting : Joint Sparse Approximation
f k ( x)  w k  x
m
arg min w1 ,w 2 ,..., w m
1
l ( f k ( x), y )  Q R( w1 , w 2 ,...., w m )


|
D
|
k 1
k ( x , y )Dk
Average Loss
on training set k
penalizes
solutions that
utilize too many
features
Joint Regularization Penalty
 How do we penalize solutions that use too many features?
W1,1
W1, 2
 W1,m
W2,1 W2, 2  W2,m



Coefficients for
for feature 2

Wd ,1 Wd , 2  Wd ,m
Coefficients for
classifier 2
R(W ) # non  zero  rows
 Would lead to a hard combinatorial problem .
Joint Regularization Penalty
 We will use a L1-∞ norm [Tropp 2006]
d
R(W )   max (| Wik |)
i 1
k
 This norm combines:
The L∞ norm on each row promotes non-sparsity on
the rows.
An L1 norm on the maximum absolute values of
the coefficients across tasks promotes sparsity.
Share features
Use few features
 The combination of the two norms results in a solution where only
a few features are used but the features used will contribute in solving
many classification problems.
Joint Sparse Approximation
 Using the L1-∞ norm we can rewrite our objective function as:
m
d
1
min W 
l ( f k ( x), y )  Q  max (| Wik |)

k
k 1 | Dk | ( x , y )Dk
i 1
 For any convex loss this is a convex objective.
 For the hinge loss: l ( f ( x), y)  max( 0,1  yf ( x))
the optimization problem can be expressed as a linear program.
Joint Sparse Approximation
 Linear program formulation (hinge loss):
d
1 |Dk | k
min [ W ,ε ,t ] 
 j  Q  ti

k 1 | Dk | j 1
i 1
m
 Max value constraints:
for : k  1 : m and
for : i  1 : d
 ti  wik  ti
 Slack variables constraints:
for : k  1 : m and
for : j  1 :| Dk |
y kj f k ( x kj ) 1   kj
 kj  0
An efficient training algorithm
 The LP formulation can be optimized using standard LP solvers.
 The LP formulation is feasible for small problems but becomes
intractable for larger data-sets with thousands of examples and
dimensions.
 We might want a more general optimization algorithm that can handle
arbitrary convex losses.
 We developed a simple an efficient global optimization algorithm for
training joint models with L1−∞ constraints.
 The total cost is in the order of: O(dm log( dm))
Outline
Motivation: Low dimensional representations.
 Principal Component Analysis.
 Structural Learning.
 Vision Applications.
 NLP Applications.
 Joint Sparsity.
 Vision Applications.

SuperBowl
Sharon
Danish Cartoons
Academy Awards
Australian Open
Grammys
Trapped Miners
Figure Skating
Golden globes
Iraq
 Train a classifier for the 10th
held out topic using the relevant
features R only.
 Learn a representation using labeled data from 9 topics.
 Learn the matrix W using our transfer algorithm.
 Define the set of relevant features to be: R  {r : max k (| wrk |)  0}
Results
Asymmetric Transfer
0.72
0.7
Baseline Representation
Transfered Representation
0.68
Average AUC
0.66
0.64
0.62
0.6
0.58
0.56
0.54
0.52
4
20
40
60
80
# training samples
100
120
140
Future Directions
Joint Sparsity Regularization to control inference time.
Learning representations for ranking problems.