VasaviCollegeofEngineering Ibrahimbagh‐31 DepartmentofComputerScience&Engineering AReporton“TheBigData&itsroleinCloudComputing” ByA.Nagarajuconductedon24thApril,2014 (Conductedunder–Co‐curricularActivity) AGuestLectureon“TheBigData&itsrolein Cloud Computing” was delivered by Mr. A. Nagaraju. A Certified Project Management Professional and Microsoft Certified TechnologySpecialistwithmorethan19years of diversified experience in the areas of Project/Program Management, Service Management, Education/Training and Information Technology in providing services for world class clients. He has come down to traintheM.TechCSEstudentsontheBigData &itsroleinCloudComputing. Large Volume of Data has been processing in the current IT situation with new terminology hintingtheevolution 1Giga Byte (10^9) <1 Tera Byte (10^12) < 1 Peta Byte (10^15)<1 Exabyte(10^18) <1 Zetabyte(10^21) Traditionalsystemscanstoreandprocess200‐ 400Terabytesofdata.Morethanthisitfails. BigData‐Journey 1990‐‐‐‐Store 1400MB, transfer speed‐ 4.5MB/sec‐‐‐read entire drive in 5min (per head) 2010‐‐‐‐Store1TB,speed‐100MB/sec,readin 3hrs Hadoop ‐‐‐100 drives working at same time canread1TBofdatain2minutes.Processesin blocksofsize64bytes. 2010‐‐Information data corporation (IDC) estimates1.2ZB 2011‐‐FB 6 billion messages per day, 400 PB permonth,googlecreates250‐300PB/month. Hadoop ‐ An open source framework for storageandlarge‐scaleprocessingofdata‐sets onclustersofcommodityhardware. coding‐Java,Python,RubyonRails Hadoop Cluster: Name Node (Master) ‐ Data Node(Slaves)‐Task(Job)Tracker CanunderstandonlyMapReduce Current Challenges: 3V’s Velocity, Volume & Varietyofdata HadoopAdv:Scalable,Available,Reliable,Cost Effective,Flexible,FastandResilienttofailure Stream access ‐ usually sequential rather than Random(TraditionalDB’sthroughindexes) Cloudera is being used by 70% companies, MapRinUS20% Hadoop provides 4 key breakthroughs comparedtotraditionalsolutions: 1. Overcome traditional limitations of storageandcompute 2. Leverage inexpensive commodity hardwareastheplatform 3. Provides linear scalability from 1 to 4000servers 4. Lowcost,opensourcesoftware SectorsUsing: Searchengines SocialNetworking Finance/Banking RetailIndustries E‐mailservices Security Govt.&Publicsectors Future:33%ofcompaniesusingHadoop,22% ofcompanieshavestarted 58%CAGRofHadoopusage By 2018, $2.18 billions will be sent on Hadoop Hadoop‐‐‐ WriteonceReadanynumberoftimes⇒ We cannotopen a filein write mode in HDFS Itfollowsbatchprocessingmechanism. HadoopEcosystem: HDFS:Distributedfilesystem MapReduce: Data processing model and executionenvironment Pig: a data flow language or script language andexecutionenvironment‐Piglatinlooksas sql. Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a querylanguagebasedonSQL‐HiveQL. Sqoop: A tool for efficiently moving data between relational databases and HDFS. Hive andsqoopgotogether. HBase: A distributed column‐ oriented database. HBase uses HDFS for its underlying storage and supports both batch‐ stylecomputationsusingMapReduceandpoint queries. NoSQL (Google Bigtable). Similar to Mongo DB or Cassandra or Couch DB. It supportsOLTP.soupdatesarepossible. Zookeeper: Comes with Hbase is a highly available coordination service, provides distributedlocks. Oozie:isservicethatrunsontomcatwhichisa workflowschedulersystemtomanageHadoop jobs Flume: a distributed and available service for efficiently collecting aggregating and moving largeamountsoflogdata. Integrations:Hive<‐>HBaseorMapReduce<‐ >HiveorMapReduce<‐>Hbase Datasests:CarEvaluationdataset http://archive.ics.uci.edu/ml/ http://archive.ics.uci.deu/ml/machine‐ learning‐databases/car/ Studentsweremadetoworkonasmall pre configured Hadoop cluster for executing a smalldemoprogram.Thesessionconcludedby thespeakeremphasizinghowcloudcomputing has been involved with handling such large amounts of Data Sets and how it takes the studentstobereadyforthejobsaheadtoface intheirplacements.
© Copyright 2026 Paperzz