Recitation 2 Slides

Recitation2 for BigData
Hashing
Jay Gu
Jan 24 2013
Homework1 Patch
• Unknown user’s gender should be 0
– fix DataInstance.java, HashedDataInstance.java
• New lambda range for the “Regularization”
part
Outline
• Hash function
• Hash kernel
• Multitask learning
Hash Function
Hash Function
• Collision is bad
– Want:
– But we do not know the input distribution….
Universal Hashing Family
Uniformly pick
For any given pair
such that:
Simple construction: ax+b
• Pick a prime number p
Simple construction: ax+b
• Proof Sketch:
- How many hash functions in H? p(p-1)
- x1 <> x2  u <> v
- How many (u,v) pairs causes collision? p(p/n – 1)
- (u,v) is 1-1 mapped to (a,b), which is 1-1 mapped to h
How to hash string?
• Java’s built-in hashcode:
• Md5 checksum: 128 bits = 16 bytes
Hash Kernel
1
1
1
0
0
0
0
0
0
1
Mary
(1)
Little
(-1)
Lamb
(-1)
Obam
a (1)
Care
(1)
Husky
(-1)
UW
(-1)
Big
(1)
Data
(-1)
Rock
(-1)
1
-1
-1
-1
0
0
1
0
1
1
0
0
1
1
0
Mary
(1)
Little
(-1)
Lamb
(-1)
Obam
a (1)
Care
(1)
Husky
(-1)
UW
(-1)
Big
(1)
Data
(-1)
Rock
(-1)
-1
1
-1
1
1
Hash Kernel
High Dimension
(infinite)
Low Dimension
(finite)
X: Space of feature name, not value!
• Directly learn W in the space of
• Implement using only h
Instead of hashing into m bins, hashing into 2m bins, and
take the first bit as sign.
Multitask Learning
Multitask Learning
Global part
Hash feature name
Personalized part
Hash feature name along with user id