Canopy Cluster - University of Houston

COSC6376 Cloud Computing
Lecture 5. MapReduce and HDFS
Instructor: Weidong Shi (Larry), PhD
Computer Science Department
University of Houston
Outline
• Homework
• HDFS
Next week
Tutorial on Amazon Cloud Services
Netflix movies
• Goal -Cluster the Netflix movies using K-means
clustering. We’re given a set of movies, as well
as a list mapping ratings from individual users to
movie titles. We want to output four hundred or
so sets of related movies.
Input -Data is one entry per line of the form
“movieId, userId, rating, dateRated.”
Netflix Prize
• Netflix provided a training data set of 100,480,507
ratings that 480,189 users gave to 17,770 movies
• Netflix internal movie rating predictor: Cinematch
 used for recommending movies
• $1,000,000 award to these who can improve the
prediction by 10% (in terms of root means squared
error)
• Winner: BellKor's Pragmatic Chaos
 Another team: Ensemble
 Results equally good but submitted 20 minutes later
Competition cancelled
• Researchers demonstrated that individuals can
be identified by matching the Netflix data sets
with film ratings online.
• Netflix users filed a class action lawsuit against
Netflix for privacy violation
 Video Privacy Protection Act
Movie dataset
• The data is in the format
UserID::MovieID::Rating::Timestamp
• 1::1193::5::978300760
• 2::1194::4::978300762
• 7::1123::1::978300760
Kmeans clustering
• Clustering problem description:
iterate {
Compute distance from all points to all kcenters
Assign each point to the nearest k-center
Compute the average of all points assigned to
all specific k-centers
Replace the k-centers with the new averages
}
• Good survey:
 AK Jain etc. Data Clustering: A Review, ACM Computing
Surveys, 1999
Kmeans illustration
• Randomly select k centroids
• Assign cluster label of each point according to
the distance to the centroids
Kmeans illustration
Recalculate the centroids
Reclustering
Repeat, until the cluster labels do not change, or the
changes of centroids are very small
Summary of kmeans
• Determine the value of k
• Determine the initial k centroids
• Repeat until converge
- Determine membership: Assign each point to
the closest centroid
- Update centroid position: Compute the average
of the assigned members
The setting
• The dataset is stored in HDFS
• We use a MapReduce kMeans to get the
clustering result
 Implement each iteration in one MapReduce process
• Pass the k centroids to the Maps
• Map: assign a label to each record according to
the distances to the k centroids <cluster id,
record>
• Reduce: calculate the mean for each cluster, and
replace the centroid with the new mean
Complexity
• The complexity is pretty high:
 k * n * O ( distance metric ) * num (iterations)
• Moreover, it can be necessary to send tons of
data to each Mapper Node.
• Depending on your bandwidth and memory
available, this could be impossible.
Furthermore
• There are three big ways a data set can be
large:
 There are a large number of elements in the set.
 Each element can have many features.
 There can be many clusters to discover
• Conclusion – Clustering can be huge, even when
you distribute it.
Canopy clustering
• Preliminary step to help parallelize
computation.
• Clusters data into overlapping Canopies using
super cheap distance metric.
• Efficient
• Accurate
Canopy clustering
While there are unmarked points {
pick a point which is not strongly marked
call it a canopy center
mark all points within some threshold of
it as in it’s canopy
strongly mark all points within some
stronger threshold
}
After the canopy clustering…
• Run K-mean clustering as usual.
• Treat objects in separate clusters as being at
infinite distances.
MapReduce implementation:
• Problem – Efficiently partition a large data set
(say… movies with user ratings!) into a fixed
number of clusters using Canopy Clustering, KMeans Clustering, and a Euclidean distance
measure.
• The Distance Metric
 The Canopy Metric ($)
 The K-Means Metric ($$$)
Steps
•
•
•
•
•
•
Get Data into a form you can use (MR)
Picking Canopy Centers (MR)
Assign Data Points to Canopies (MR)
Pick K-Means Cluster Centers
K-Means algorithm (MR)
Iterate!
Canopy distance function
• Canopy selection requires a simple distance
function
• Number of rater IDs in common
• Close and far distance thresholds
 Close distance threshold: 8 rater IDs in common
 Far distance threshold: 2 rate IDs in common
K-means distance metric
• The set of ratings for a movie given by a set of
users can be thought of as a vector
A = [user1_score, user2_score, ..., userN_score]
• To evaluate the distance between two movies, A
and B, use the similarity metric below,
Similarity(A, B) = sum(A_i * B_i) /
(sqrt(sum(A_i^2)) * sqrt(sum(B_i^2)))
where the sum(...) functions retrieve all A_i or B_i
for 0 <= i < n
•
•
•
•
•
•
•
•
Vector(A) - 1111000
Vector (B)- 0100111
Vector (C)- 1110010
distance(A,B) = Vector (A) * Vector (B) /
(||A||*||B||)
Vector(A)*Vector(B) = 1
||A||*||B||=2*2=4
 ¼=.25
Similarity (A,B) = .25
Data Massaging
• Convert the data into the required format.
• In this case the converted data to be displayed
in <MovieId,List of Users>
• <MovieId, List<userId,ranking>>
Canopy Cluster – Mapper A
Threshold value
Reducer
Mapper A - Red center
Mapper B – Green center
Redundant centers within the threshold
of each other.
Add small error => Threshold+ξ
• So far we found , only the canopy center.
• Run another MR job to find out points that are
belong to canopy center.
• canopy clusters are ready when the job is
completed.
• How it would look like ?
Canopy Cluster - Before MR job
Sparse Matrix
Canopy Cluster – After MR job
Cells with values 1 are grouped together and users are moved from their
original location
K – Means Clustering
• Output of Canopy cluster will become input of Kmeans clustering.
• Apply Cosine similarity metric to find out similar
users.
• To find Cosine similarity create a vector in the
format <UserId,List<Movies>>
• <UserId, {m1,m2,m3,m4,m5}>
User A
Toy Story
Avatar
Jumanji
Heat
User B
Avatar
GoldenEye
Money Train
Mortal Kombat
User C
Toy Story
Jumanji
Money Train
Avatar
Toy Story
Avatar
Jumanji
Heat
Golden Eye
MoneyTrain
Mortal Kombat
UserA
1
1
1
1
0
0
0
User B
0
1
0
0
1
1
1
User C
1
1
1
0
0
1
0
• Find k-neighbors from the same canopy cluster.
• Do not get any point from another canopy
cluster if you want small number of neighbors
• # of K-means cluster > # of Canopy cluster.
• After couple of map-reduce jobs K-means
cluster is ready
All points –before clustering
Canopy - clustering
Canopy Clusering and K means
clustering.
47
Amazon Elastic MapReduce
48
Elastic Mapreduce
• Based on hadoop AMI
• Data stored on S3
• “job flow”
Example
elastic-mapreduce --create --stream \
--mapper
s3://elasticmapreduce/samples/wordcount/word
Splitter.py \
--input
s3://elasticmapreduce/samples/wordcount/input
--output s3://my-bucket/output
--reducer aggregate
HDFS
Global Picture
Goals
• Understand the underlying distributed file
system for large data processing
 Hadoop DFS is an open source implementation of the
GFS
• Understand tradeoffs in system design
Assumptions
•
•
•
•
•
•
Inexpensive components that often fail
Large files
Large streaming reads and small random reads
Large sequential writes
Multiple users append to the same file
High bandwidth is more important than low
latency.
55
56
Failure Trends in a Large Disk Drive Population
• The data are broken down by the age a drive was when it failed.
Failure Trends in a Large Disk Drive Population
• The data are broken down by the age a drive was when it failed.
Architecture
• Chunks
 File  chunks  location of chunks (replicas)
• Master server




Single master
Keep metadata
accept requests on metadata
Most mgr activities
• Chunk servers
 Multiple
 Keep chunks of data
 Accept requests on chunk data
Design decisions
• Single master
 Simplify design
 Single point-of-failure
 Limited number of files
• Meta data kept in memory
• Large chunk size: e.g., 64M
 advantages
• Reduce client-master traffic
• Reduce network overhead – less network interactions
• Chunk index is smaller
 Disadvantages
• Not favor small files
Master: meta data
• Metadata is stored in memory
• Namespaces
 Directory  physical location
• Files  chunks  chunk locations
• Chunk locations
 Not stored by master, sent by chunk servers
• Operation log
Master Operations
• All namespace operations
 Name lookup
 Create/remove directories/files, etc
• Manage chunk replicas




Placement decision
Create new chunks & replicas
Balance load across all chunkservers
Garbage claim
Master: chunk replica placement
• Goals: maximize reliability, availability and bandwidth
utilization
• Physical location matters
 Lowest cost within the same rack
 “Distance”: # of network switches
• In practice (hadoop)
 If we have 3 replicas
 Two chunks in the same rack
 The third one in another rack
• Choice of chunkservers
 Low average disk utilization
 Limited # of recent writes  distribute write traffic
• Re-replication
 Lost replicas for many reasons
 Prioritized: low # of replicas, live files, actively used
chunks
 Following the same principle to place
• Rebalancing
 Redistribute replicas periodically
• Better disk utilization
• Load balancing
Master: garbage collection
• Lazy mechanism
 Mark deletion at once
 Reclaim resources later
• Regular namespace scan
 For deleted files: remove metadata after three days
(full deletion)
 For orphaned chunks, let chunkservers know they are
deleted
• Stale replica
 Use chunk version numbers
Read/Write

File Read

1. open
2. get block locations
Distributed
NameNode
HDFS 3. read FileSystem
client 6. close FSData
name node
InputStream
client JVM
client node
File Write
1. create 2. create
Distributed
NameNode
HDFS 3. writeFileSystem
8. complete
client 7. close FSData
name node
OutputStream
client JVM 4. get a list of 3 data nodes
client node
4. read from the closest node
5. write packet 6. ack packet
5. read from the 2nd closest node
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
data node
data node
data node
data node
data node
data node
If a data node crashed, the crashed node is removed,
current block receives a newer id so as to delete the
partial data from the crashed node later, and
Namenode allocates an another node.
67
System Interactions
• Mutation
 Master assign a“lease” to a replica - primary
 Primary knows the order of mutations
Consistency
• It is expensive to maintain strict consistency
• GFS uses a relaxed consistency
 Better support for appending
 checkpointing
Fault Tolerance
• High availability
 Fast recovery
 Chunk replication
 Master replication: inactive backup
• Data integrity
 Checksumming
 Incremental update checksum to improve
performance
• A chunk is split into 64K-byte units
• Update checksum after adding a unit
Discussion
• Advantages
 Works well for large data processing
 Using cheap commodity servers
• Tradeoffs
 Single master design
 Reads most, appends most
• Latest upgrades (GFS II)
 Distributed masters
 Introduce the “cell” – a number of racks in the same
data center
 Improved performance of random r/w