Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar I t l t Intelsat © 2012 by b Intelsat. I t l t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the Paradigm Shift • The performance of the traditional applications is becoming inadequate • Data is growing faster than Moore’s Law [IDC study] • The new buzzword is Big Data • Application A li ti performance f and d scalability l bilit are limited li it d by b the th performance and scalability of storage (disk I/O) and network • Applications must m st capitalize capitali e on the benefits provided pro ided by the cloud environment • No significant improvements in “scaling up” (scale vertically or upgrade a single node) • “Scaling out” (scale horizontally or add more nodes) pp is a better approach New 2007 Template - 2 How are Others Solving Big Data Problem? • Google invented a software framework called MapReduce to support pp distributed computing p g on large g data sets on clusters of computers – Google also invented Google Bigtable, a distributed storage system for managing petabytes of data • Hadoop: An open source Apache project inspired by the work at Google. A software framework for reliable, scalable, distributed computing – There are a number of other Hadoop related projects at Apache – Major contributors include Yahoo, Facebook, and Twitter • S4 Distributed Stream Computing Platform: An Apache incubator program. Supports development of applications for processing continuous unbounded streams of data • Many more … • We’ll review some technologies in the Hadoop ecosystem New 2007 Template - 3 What is Apache Hadoop? • The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing – A framework for the distributed processing of large data sets across clusters of computers using a simple programming model – Designed to scale from single server to thousands of servers. Each server offers local computation and storage • The Hadoop core consists of: – Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput high throughput access to application data – Hadoop MapReduce: A software g framework for distributed p processing of large data sets on clusters New 2007 Template - 4 MapReduce Data Flow Hadoop MapReduce Master = JobTracker Worker=TaskTracker New 2007 Template - 5 Mapreduce Example: Word Count cloud computing overview cloud,1; computing,1 overview,1 h heterogeneous,1 1 high,1 performance,1 cloud,1 computing,1 worker heterogeneous high performance cloud computing cloud computing applications in space system acquisitions applications development in the cloud space flight dynamics as a service Input files worker worker Map Phase cloud,1 computing,1 applications,1 in,1 space,1 system,1 acquisitions,1 space,1 flight,1 dynamics,1 as,1 a,1 Service,1 applications,1 development,1 in,1 th 1 the,1 cloud,1 Intermediate files worker aa,1 1 as,1 acquisitions,1 applications,2 cloud,4; computing,3 d l development,1 t1 dynamics,1 flight,1 heterogeneous,1 high,1 in,2 worker performance,1 overview,1 space,2 system 1 system,1 service,1 the,1 Reduce Phase Output fil files New 2007 Template - 6 HDFS • Very large distributed file system on commodity h d hardware • Designed to scale to petabytes of storage • Single Si l Namespace N for f entire ti cluster l t – Optimized for batch processing – Write-once-read-many Write once read man access model – Read, write, delete, append-only • NameNode handles replication replication, deletion deletion, creation • DataNode handles data retrieval New 2007 Template - 7 HDFS • Files are broken up p into blocks NameNode BackupNode – Block size is 64 MB (can be modified) – Blocks replicated on at least 3 nodes DataNode DataNode DataNode DataNode DataNode – If a node fails, blocks get replicated automatically to other th nodes d New 2007 Template - 8 Hadoop Cluster at Yahoo! • Largest cluster: – 4000 nodes, d 16PB raw di disk, k 64TB RAM New 2007 Template - 9 Who Uses Hadoop? • Adobe • LinkedIn • AOL • New York Times • Disney • Powerset/Microsoft • Ebay • Rackspace • Facebook • StumbleUpon • Google • Twitter • Hulu • Yahoo! • IBM Full list: http://wiki.apache.org/hadoop/PoweredBy New 2007 Template - 10 Examples • New York Times – Converted public domain articles from 1851-1922 to PDF – Stored 4 TB of scanned images as input on Amazon S3 – Ran 100 Amazon EC2 instances for around 24 hours – Produced P d d 11 Milli Million articles, ti l 1.5 1 5 TB off PDF fil files as output t t – http://open.blogs.nytimes.com/2007/11/01/self-service-proratedsuper-computing-fun/ • Visa Vi – Processing 73 billion transactions, amounting to 36 TB of data – Traditional methods: 1 month – Hadoop: 13 minutes – How Hadoop Tames Enterprises’ Big Data. Sreedhar Kajeepeta, InformationWeek Reports, Feb 2012 New 2007 Template - 11 Sorting Benchmarks • Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Sec – http://developer.yahoo.com/blogs/hadoop/posts/2009/05/had oop_sorts_a_petabyte_in_162/onds p_ _ _p y _ _ New 2007 Template - 12 The Hadoop Ecosystem • HBase™: Initiated by Powerset. A scalable, distributed database that supports structured data storage for large tables • Hive™: Initiated by Facebook. A data warehouse infrastructure that provides data summarization and ad hoc querying • Pig™: Pig : Initiated by Yahoo Yahoo. A high-level high level data-flow data flow language and execution framework for parallel computation • ZooKeeper™: Initiated by Yahoo. A high-performance coordination service ser ice for distributed distrib ted applications • Avro™: A data serialization system New 2007 Template - 13 HBase • Open source project modeled after Google’s BigTable • Distributed, large-scale data store • Efficient at random reads/writes • Use Hbase when you: – – – – – – need to store large amounts of data need high write throughput need efficient random access within large data sets need d to t scale l gracefully f ll with ith data d t need to save time-series ordered data don’tt need full RDMS capabilities (cross row/cross don table transactions, joins, etc.) New 2007 Template - 14 HBase Use Cases • Mozilla’s Socorro - crash reporting system – 2.5 2 5 Milli Million crash h reports/day t /d • Facebook – Real-time Real time Messaging System: HBase to Store 135+ Billion Messages a Month – Realtime Analytics System: HBase to Process 20 Billi Billion Events E t Per P Day D • Twitter, Yahoo!, TrendMicro, StumbleUpon, … • See http://wiki http://wiki.apache.org/hadoop/Hbase/PoweredBy apache org/hadoop/Hbase/PoweredBy New 2007 Template - 15 OpenTSDB • 15464 points retrieved, 932 points plotted in 100ms • Generating custom graphs and correlating events is easy • A distributed, scalable, time series database developed p by y StumbleUpon – Open-source monitoring system built on Hbase – It is also a data p plotting g system. y – Used to monitor computers in their datacenter – Collects 600 million data points per day, over 70 billion data points per year • Store them forever, retain granular data New 2007 Template - 16 Typical Use Cases in Ground Systems Domain • Hadoop can be used to scale out applications that fit the “write write-once-read-many once read many” model • Typical use cases: – Telemetry T l t analysis l i and d visualization i li ti – Real time displays – Storage and retrieval of spacecraft and ground system events – Payload data processing New 2007 Template - 17 Summary • In order to capitalize on the benefits of the cloud environment applications must be designed to environment, “scale out” rather than “scale up” • Robust open p source tools to support pp computing p g on a cluster of computers are available and in use today • The Hadoop Ecosystem is used in many enterprises • The Th performance f and d scalability l bilit off ground d systems t can be improved by using the growing ecosystem of Hadoop p tools and technologies g • Applications that fit the “write-once-read-many” model and process large amount of data benefit from this approach New 2007 Template - 18 References • Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing g on Large g Clusters.” Communications of the ACM, Volume 51, Issue 1, pp. 107-113, 2008 • Apache Hadoop: http://hadoop.apache.org/ • Apache Hadoop: an introduction. Todd Lipcon, May 27, 2010 • Hadoop Update, Open Cirrus Summit 2009.Eric Baldeschwieler, VP Hadoop Development, Yahoo! • Cloudera Training: http://www.cloudera.com/hadoop-training • Realtime Apache Hadoop at Facebook Facebook. Dhruba Borthakur, Borthakur Oct 5, 5 2011 at UC Berkeley Graduate Course • Apache Hbase: hbase.apache.org http://www.slideshare.net/kevinweil/nosql attwitter nosql eu 2010 • http://www.slideshare.net/kevinweil/nosql-attwitter-nosql-eu-2010 • http://www.slideshare.net/ydn/3-hadooppigattwitterhadoopsummit2010 • OpenTSDB, The Distributed, Scalable, Time Series Database. Benoît “tsuna” Sigoure, StumbleUpon New 2007 Template - 19
© Copyright 2024 Paperzz