BigTable ADistributed Storage Systemfor StructuredData FayChang,JeffreyDean,SanjayGhemawat, WilsonC.Hsieh,DeborahA.WallachMike Burrows,TusharChandra,AndrewFikes, RobertE.Gruber OSDI2006 Presenter:Nghia Vo Outline 01 02 03 04 05 06 Motivation Model Overview& API Infrastructure Refinement Performance Motivation • Googlescale: • Lotsofrequests • Needssuitablesystemtobackserviceslike Webpages,Emails,Maps… • Nocommercialservicebigenoughbythen • In-housemeanswellsuitedforfuturescaling • Builtontop,reliesonexistinginfrastructure (GFS,Chubby,etc…) Model • Multidimensionalmap • (Rowstring,columnstring,timestampinteger) askey • Valueisanarbitrarystring • Sparserow,columnoriented • Distributedandpersistent • Rows(string): • Orderedlexicographically • Rowread/writeisatomic • Exclusivelock • Rangeofrowsiscalledtablet • Tablet: • Dynamicallypartitioned • Unitfordistribution Modelcont. • Column(string) • Family:qualifier naming • Familyisunitforaccess control. • Clientmightonlyaccess somefamilieswithina row. • Smallnumberoffamiliesand rarelychange. • Unboundednumberof columns. Modelcont. • Timestamp(int) • Storeeachversionofthedata • Cansettostorelastk-values • Cansettostorevalueswithin thelastfewhours,etc… Modelcont. • Webtable: • Row:reverseURL • Helpswithlocality • Column:content,languages, anchor,etc… • Keepcontentsfordifferent timestamp Example Service/APIs • Metadataoperation: • Create/deletetable,definecolumnfamilies, changemetadatalikeaccesscontrol • Read: • ScannerAPIprovidesread(batchmode) • Filtering: • Supportregexfilterforcolumns • Client-suppliedscriptformorefiltering option • Write: • Setdatainsomecellsonarow • Deletesomecellsorallonarow • Onemaster • Metaoperation • Monitortabletserver • Lightlyloaded • Multipletabletservers • Handleread/writerequest • Splitovergrowntablets • Linked-to-clientlibrary • Cachetabletlocation Infrastructure • SSTable,TabletandGFS: • HowBigTable is formatted/stored • ChubbyandTabletServer: • HowtoaccessBigTable Infrastructure SSTable & Tablet • SSTable: • Immutablemapforkey-valuestore • Providekeyrangelookup • Tablet: • ConsistsofmultipleSSTables • Containsrangeofrows • Tabletsdonotoverlap • Unitofdistribution: • StoredinGFS • Mustbeassignedtoatabletserver beforebeingserved • 3-levelB+treedesign • Leaflevel:actualdatatablets • 1st and2nd levelareMETADATA tablets • METADATAtabletstores: • Keyrange->tabletlocation • RoottabletstoredinChubby Tabletdirectory • Requestforakey • Askfortabletlocation ->roottablet(inChubby) ->METADATAtablet ->UserTable • Tabletlocation=tablet’sID&its endrow • Cachestreeforperformance • Whencachestales,traceback andrepeat. Client’sperspective Tabletassignment • Eachtabletisassignedtoonetabletserver • Masterserver’sjob: • Tracktablet->server • Rememberunassignedtablets • Rememberliveservers • Assignwheneverpossible Tabletoperation • Masterhandles: • Createtablets • Mergetablets • Deletetablets • Tabletserverhandles: • Splittablet: • Writenewtabletinformationtoparent METADATAtablet • Sendnotificationtomaster Tabletserver liveness • TabletserverasksChubby: • Getsexclusivelockonauniquelynamedfile • Holdslockaslongasitservestablets • Tabletserverisconsideredbad: • Fileisn’tlockedonChubby • Mastercanacquiretheserver’slockbut can’tpingstabletserver • Result: • Fileisdeleted,serverterminates. • Whenmasterstarts: Masterserver liveness • • • • • AskChubbyforamasterlock ScanChubbyserverdirectoryforlivetabletserver Asktabletserversfortabletsbeingserved ScanMETADATAtabletforexistingtablets Assignaccordingly • Tabletserverneeds: • GFSstoresexistingtablets • In-memorymemtable • Constructfromtabletlog inGFS • Storescommited update fromredopoint • Duetoimmutabilityof SSTable Servingtablet • Requireauthorizationfrom Chubby • Read: • MergeSSTable inGFSwith memtable • Write: • Writetotabletlogthento memtable • Whathappenswhen memtable getslarge? ReadandWrite • Whycompaction: • SSTable isimmutable • Muststorecommitlog forwritingrequest • Memtable sizegrowsas well Compaction • Whenmemtable getslarge: • ConvertstoSSTable • WritetoGFS • Allowincomingread/write Minorcompaction • Periodicallyexecuted • Readconvertedmemtables andrelatedSSTables • MergesintooneSSTable • SafelydeleteoldSSTables Mergingcompaction • Mergingcompactionthat: • RewritesallSSTables into onewithinatablet • Containsnodeleted informationordata Majorcompaction • Usecase: • Mightexistasensitive data • Itsdeletecommand writtentoGFS • Majorcompactiongets ridofthedataandfree resource Majorcompaction Refinement • LocalityGroup: • Byclient • Combinefamiliesthatarefrequently accessedtogetherintoaSSTable • CompressingeachLG’sSSTable separately: • Reducesize(10-1) • OnlydecompressoneSSTable when accessthatLG • DeclareLGas“load-to-memory”: • LoadSSTable lazilyintotabletserver memory • Caching: • Bytabletserver • CacheK/Vpair • CacheSSTable blocks,fastaccessto neighbors Refinement • BloomFilter: • Clientcreatesahashforlocalitygroup • CheckwhetheraK/VexistsinaSSTable • Helpfultoavoiddiskread • ImmutabilityofSSTable canbeexploited • Example: • Whentabletsplits,itschildrencanuse itsSSTable • Recoveryspeedup: • Tabletcanbeassignedtoanewserver • Performminorcompactionsontheold serverbeforetransferring Refinement • Commit-logimplementation: • Tabletsonthesameservershare commitlog. • ReducenumberoffilesonGFS • Complicatedrecovery,sincetabletcan beassignedtodifferentserver Performance • Setup: • N=#ofclients=#oftabletservers • N<=500 • 1786machines: • EachrunsaGFSserver • Somerunsatabletserver,clientprocess, otherjob’sprocess • Choose#ofrowkeyssothateachtest read/write~1GBtoeachtabletserver • Numberof1000-bytevaluer/wper second • Sequential • read/writekeyfromrange • Random • Read/writekeyuniformlychosen • Read(mem) • localitygroupin-memory optimization • Scan • SequentialreadwithRPC optimization Throughput • Increasingbutnotlinearly! • CPU(bottleneckoftablet server) • Network • Read64KBSSTable per1KB valueR/W • Onlygoodforsequential R/Wandrandomread (mem) • Loadbalancingdoesn’twork perfectly • Loadisdynamicas benchmarkgoes • Limitedduetotablet movement Aggregatedthroughput Conclusion • Scalewell • Butcreditsduetorefinementaswell • ExistsHBase(opensource)thatmodelBigTable • ThanksforGoogle’sinfrastructure: • Easyforoptimizationonlower-level • WellsuitedforGoogle’sprojects: • Maps • PersonalizedSearch • Etc… Q&A • Hardtocompare. VsRelationDB (SQL) • Frommodelperspective: • Bydefault,indexlexicographicallyonrowkey • Dividedintoranges,similartoB+tree • Tightercontroloncolumnfamily • Versioning-focusedbytimestampkey • RDBMismorestructured,supportqueries,data aggregation • Fromdistributedsystemperspective: • Loadbalancingontablet/tabletserver(byGFS) • Tabletserving(byGFSandSSTable) • Concurrencycontrolontablet/tabletserver(by Chubby) Quiz • Underwhichdimension(row/column/time)isBigTable lexicographicallyordered? • Whatisatabletagain? • Howexactlyistabletdistributedandservedtoclient!?!? Quiz • Infewwords: • WhatdoesGFSstore? • Whatistherelationship betweenChubbyandtablet server? • Whatdoesthemasterserver do?
© Copyright 2025 Paperzz