COSC6376 Cloud Computing Lecture 5. MapReduce and HDFS Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston Outline • Homework • HDFS Next week Tutorial on Amazon Cloud Services Netflix movies • Goal -Cluster the Netflix movies using K-means clustering. We’re given a set of movies, as well as a list mapping ratings from individual users to movie titles. We want to output four hundred or so sets of related movies. Input -Data is one entry per line of the form “movieId, userId, rating, dateRated.” Netflix Prize • Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies • Netflix internal movie rating predictor: Cinematch used for recommending movies • $1,000,000 award to these who can improve the prediction by 10% (in terms of root means squared error) • Winner: BellKor's Pragmatic Chaos Another team: Ensemble Results equally good but submitted 20 minutes later Competition cancelled • Researchers demonstrated that individuals can be identified by matching the Netflix data sets with film ratings online. • Netflix users filed a class action lawsuit against Netflix for privacy violation Video Privacy Protection Act Movie dataset • The data is in the format UserID::MovieID::Rating::Timestamp • 1::1193::5::978300760 • 2::1194::4::978300762 • 7::1123::1::978300760 Kmeans clustering • Clustering problem description: iterate { Compute distance from all points to all kcenters Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averages } • Good survey: AK Jain etc. Data Clustering: A Review, ACM Computing Surveys, 1999 Kmeans illustration • Randomly select k centroids • Assign cluster label of each point according to the distance to the centroids Kmeans illustration Recalculate the centroids Reclustering Repeat, until the cluster labels do not change, or the changes of centroids are very small Summary of kmeans • Determine the value of k • Determine the initial k centroids • Repeat until converge - Determine membership: Assign each point to the closest centroid - Update centroid position: Compute the average of the assigned members The setting • The dataset is stored in HDFS • We use a MapReduce kMeans to get the clustering result Implement each iteration in one MapReduce process • Pass the k centroids to the Maps • Map: assign a label to each record according to the distances to the k centroids <cluster id, record> • Reduce: calculate the mean for each cluster, and replace the centroid with the new mean Complexity • The complexity is pretty high: k * n * O ( distance metric ) * num (iterations) • Moreover, it can be necessary to send tons of data to each Mapper Node. • Depending on your bandwidth and memory available, this could be impossible. Furthermore • There are three big ways a data set can be large: There are a large number of elements in the set. Each element can have many features. There can be many clusters to discover • Conclusion – Clustering can be huge, even when you distribute it. Canopy clustering • Preliminary step to help parallelize computation. • Clusters data into overlapping Canopies using super cheap distance metric. • Efficient • Accurate Canopy clustering While there are unmarked points { pick a point which is not strongly marked call it a canopy center mark all points within some threshold of it as in it’s canopy strongly mark all points within some stronger threshold } After the canopy clustering… • Run K-mean clustering as usual. • Treat objects in separate clusters as being at infinite distances. MapReduce implementation: • Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, KMeans Clustering, and a Euclidean distance measure. • The Distance Metric The Canopy Metric ($) The K-Means Metric ($$$) Steps • • • • • • Get Data into a form you can use (MR) Picking Canopy Centers (MR) Assign Data Points to Canopies (MR) Pick K-Means Cluster Centers K-Means algorithm (MR) Iterate! Canopy distance function • Canopy selection requires a simple distance function • Number of rater IDs in common • Close and far distance thresholds Close distance threshold: 8 rater IDs in common Far distance threshold: 2 rate IDs in common K-means distance metric • The set of ratings for a movie given by a set of users can be thought of as a vector A = [user1_score, user2_score, ..., userN_score] • To evaluate the distance between two movies, A and B, use the similarity metric below, Similarity(A, B) = sum(A_i * B_i) / (sqrt(sum(A_i^2)) * sqrt(sum(B_i^2))) where the sum(...) functions retrieve all A_i or B_i for 0 <= i < n • • • • • • • • Vector(A) - 1111000 Vector (B)- 0100111 Vector (C)- 1110010 distance(A,B) = Vector (A) * Vector (B) / (||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4 ¼=.25 Similarity (A,B) = .25 Data Massaging • Convert the data into the required format. • In this case the converted data to be displayed in <MovieId,List of Users> • <MovieId, List<userId,ranking>> Canopy Cluster – Mapper A Threshold value Reducer Mapper A - Red center Mapper B – Green center Redundant centers within the threshold of each other. Add small error => Threshold+ξ • So far we found , only the canopy center. • Run another MR job to find out points that are belong to canopy center. • canopy clusters are ready when the job is completed. • How it would look like ? Canopy Cluster - Before MR job Sparse Matrix Canopy Cluster – After MR job Cells with values 1 are grouped together and users are moved from their original location K – Means Clustering • Output of Canopy cluster will become input of Kmeans clustering. • Apply Cosine similarity metric to find out similar users. • To find Cosine similarity create a vector in the format <UserId,List<Movies>> • <UserId, {m1,m2,m3,m4,m5}> User A Toy Story Avatar Jumanji Heat User B Avatar GoldenEye Money Train Mortal Kombat User C Toy Story Jumanji Money Train Avatar Toy Story Avatar Jumanji Heat Golden Eye MoneyTrain Mortal Kombat UserA 1 1 1 1 0 0 0 User B 0 1 0 0 1 1 1 User C 1 1 1 0 0 1 0 • Find k-neighbors from the same canopy cluster. • Do not get any point from another canopy cluster if you want small number of neighbors • # of K-means cluster > # of Canopy cluster. • After couple of map-reduce jobs K-means cluster is ready All points –before clustering Canopy - clustering Canopy Clusering and K means clustering. 47 Amazon Elastic MapReduce 48 Elastic Mapreduce • Based on hadoop AMI • Data stored on S3 • “job flow” Example elastic-mapreduce --create --stream \ --mapper s3://elasticmapreduce/samples/wordcount/word Splitter.py \ --input s3://elasticmapreduce/samples/wordcount/input --output s3://my-bucket/output --reducer aggregate HDFS Global Picture Goals • Understand the underlying distributed file system for large data processing Hadoop DFS is an open source implementation of the GFS • Understand tradeoffs in system design Assumptions • • • • • • Inexpensive components that often fail Large files Large streaming reads and small random reads Large sequential writes Multiple users append to the same file High bandwidth is more important than low latency. 55 56 Failure Trends in a Large Disk Drive Population • The data are broken down by the age a drive was when it failed. Failure Trends in a Large Disk Drive Population • The data are broken down by the age a drive was when it failed. Architecture • Chunks File chunks location of chunks (replicas) • Master server Single master Keep metadata accept requests on metadata Most mgr activities • Chunk servers Multiple Keep chunks of data Accept requests on chunk data Design decisions • Single master Simplify design Single point-of-failure Limited number of files • Meta data kept in memory • Large chunk size: e.g., 64M advantages • Reduce client-master traffic • Reduce network overhead – less network interactions • Chunk index is smaller Disadvantages • Not favor small files Master: meta data • Metadata is stored in memory • Namespaces Directory physical location • Files chunks chunk locations • Chunk locations Not stored by master, sent by chunk servers • Operation log Master Operations • All namespace operations Name lookup Create/remove directories/files, etc • Manage chunk replicas Placement decision Create new chunks & replicas Balance load across all chunkservers Garbage claim Master: chunk replica placement • Goals: maximize reliability, availability and bandwidth utilization • Physical location matters Lowest cost within the same rack “Distance”: # of network switches • In practice (hadoop) If we have 3 replicas Two chunks in the same rack The third one in another rack • Choice of chunkservers Low average disk utilization Limited # of recent writes distribute write traffic • Re-replication Lost replicas for many reasons Prioritized: low # of replicas, live files, actively used chunks Following the same principle to place • Rebalancing Redistribute replicas periodically • Better disk utilization • Load balancing Master: garbage collection • Lazy mechanism Mark deletion at once Reclaim resources later • Regular namespace scan For deleted files: remove metadata after three days (full deletion) For orphaned chunks, let chunkservers know they are deleted • Stale replica Use chunk version numbers Read/Write File Read 1. open 2. get block locations Distributed NameNode HDFS 3. read FileSystem client 6. close FSData name node InputStream client JVM client node File Write 1. create 2. create Distributed NameNode HDFS 3. writeFileSystem 8. complete client 7. close FSData name node OutputStream client JVM 4. get a list of 3 data nodes client node 4. read from the closest node 5. write packet 6. ack packet 5. read from the 2nd closest node DataNode DataNode DataNode DataNode DataNode DataNode data node data node data node data node data node data node If a data node crashed, the crashed node is removed, current block receives a newer id so as to delete the partial data from the crashed node later, and Namenode allocates an another node. 67 System Interactions • Mutation Master assign a“lease” to a replica - primary Primary knows the order of mutations Consistency • It is expensive to maintain strict consistency • GFS uses a relaxed consistency Better support for appending checkpointing Fault Tolerance • High availability Fast recovery Chunk replication Master replication: inactive backup • Data integrity Checksumming Incremental update checksum to improve performance • A chunk is split into 64K-byte units • Update checksum after adding a unit Discussion • Advantages Works well for large data processing Using cheap commodity servers • Tradeoffs Single master design Reads most, appends most • Latest upgrades (GFS II) Distributed masters Introduce the “cell” – a number of racks in the same data center Improved performance of random r/w
© Copyright 2025 Paperzz