Bigtable slides

BigTable
ADistributed
Storage
Systemfor
StructuredData
FayChang,JeffreyDean,SanjayGhemawat,
WilsonC.Hsieh,DeborahA.WallachMike
Burrows,TusharChandra,AndrewFikes,
RobertE.Gruber
OSDI2006
Presenter:Nghia Vo
Outline
01
02
03
04
05
06
Motivation
Model
Overview&
API
Infrastructure
Refinement
Performance
Motivation
• Googlescale:
• Lotsofrequests
• Needssuitablesystemtobackserviceslike
Webpages,Emails,Maps…
• Nocommercialservicebigenoughbythen
• In-housemeanswellsuitedforfuturescaling
• Builtontop,reliesonexistinginfrastructure
(GFS,Chubby,etc…)
Model
• Multidimensionalmap
• (Rowstring,columnstring,timestampinteger)
askey
• Valueisanarbitrarystring
• Sparserow,columnoriented
• Distributedandpersistent
• Rows(string):
• Orderedlexicographically
• Rowread/writeisatomic
• Exclusivelock
• Rangeofrowsiscalledtablet
• Tablet:
• Dynamicallypartitioned
• Unitfordistribution
Modelcont.
• Column(string)
• Family:qualifier naming
• Familyisunitforaccess
control.
• Clientmightonlyaccess
somefamilieswithina
row.
• Smallnumberoffamiliesand
rarelychange.
• Unboundednumberof
columns.
Modelcont.
• Timestamp(int)
• Storeeachversionofthedata
• Cansettostorelastk-values
• Cansettostorevalueswithin
thelastfewhours,etc…
Modelcont.
• Webtable:
• Row:reverseURL
• Helpswithlocality
• Column:content,languages,
anchor,etc…
• Keepcontentsfordifferent
timestamp
Example
Service/APIs
• Metadataoperation:
• Create/deletetable,definecolumnfamilies,
changemetadatalikeaccesscontrol
• Read:
• ScannerAPIprovidesread(batchmode)
• Filtering:
• Supportregexfilterforcolumns
• Client-suppliedscriptformorefiltering
option
• Write:
• Setdatainsomecellsonarow
• Deletesomecellsorallonarow
• Onemaster
• Metaoperation
• Monitortabletserver
• Lightlyloaded
• Multipletabletservers
• Handleread/writerequest
• Splitovergrowntablets
• Linked-to-clientlibrary
• Cachetabletlocation
Infrastructure
• SSTable,TabletandGFS:
• HowBigTable is
formatted/stored
• ChubbyandTabletServer:
• HowtoaccessBigTable
Infrastructure
SSTable
&
Tablet
• SSTable:
• Immutablemapforkey-valuestore
• Providekeyrangelookup
• Tablet:
• ConsistsofmultipleSSTables
• Containsrangeofrows
• Tabletsdonotoverlap
• Unitofdistribution:
• StoredinGFS
• Mustbeassignedtoatabletserver
beforebeingserved
• 3-levelB+treedesign
• Leaflevel:actualdatatablets
• 1st and2nd levelareMETADATA
tablets
• METADATAtabletstores:
• Keyrange->tabletlocation
• RoottabletstoredinChubby
Tabletdirectory
• Requestforakey
• Askfortabletlocation
->roottablet(inChubby)
->METADATAtablet
->UserTable
• Tabletlocation=tablet’sID&its
endrow
• Cachestreeforperformance
• Whencachestales,traceback
andrepeat.
Client’sperspective
Tabletassignment
• Eachtabletisassignedtoonetabletserver
• Masterserver’sjob:
• Tracktablet->server
• Rememberunassignedtablets
• Rememberliveservers
• Assignwheneverpossible
Tabletoperation
• Masterhandles:
• Createtablets
• Mergetablets
• Deletetablets
• Tabletserverhandles:
• Splittablet:
• Writenewtabletinformationtoparent
METADATAtablet
• Sendnotificationtomaster
Tabletserver
liveness
• TabletserverasksChubby:
• Getsexclusivelockonauniquelynamedfile
• Holdslockaslongasitservestablets
• Tabletserverisconsideredbad:
• Fileisn’tlockedonChubby
• Mastercanacquiretheserver’slockbut
can’tpingstabletserver
• Result:
• Fileisdeleted,serverterminates.
• Whenmasterstarts:
Masterserver
liveness
•
•
•
•
•
AskChubbyforamasterlock
ScanChubbyserverdirectoryforlivetabletserver
Asktabletserversfortabletsbeingserved
ScanMETADATAtabletforexistingtablets
Assignaccordingly
• Tabletserverneeds:
• GFSstoresexistingtablets
• In-memorymemtable
• Constructfromtabletlog
inGFS
• Storescommited update
fromredopoint
• Duetoimmutabilityof
SSTable
Servingtablet
• Requireauthorizationfrom
Chubby
• Read:
• MergeSSTable inGFSwith
memtable
• Write:
• Writetotabletlogthento
memtable
• Whathappenswhen
memtable getslarge?
ReadandWrite
• Whycompaction:
• SSTable isimmutable
• Muststorecommitlog
forwritingrequest
• Memtable sizegrowsas
well
Compaction
• Whenmemtable getslarge:
• ConvertstoSSTable
• WritetoGFS
• Allowincomingread/write
Minorcompaction
• Periodicallyexecuted
• Readconvertedmemtables
andrelatedSSTables
• MergesintooneSSTable
• SafelydeleteoldSSTables
Mergingcompaction
• Mergingcompactionthat:
• RewritesallSSTables into
onewithinatablet
• Containsnodeleted
informationordata
Majorcompaction
• Usecase:
• Mightexistasensitive
data
• Itsdeletecommand
writtentoGFS
• Majorcompactiongets
ridofthedataandfree
resource
Majorcompaction
Refinement
• LocalityGroup:
• Byclient
• Combinefamiliesthatarefrequently
accessedtogetherintoaSSTable
• CompressingeachLG’sSSTable separately:
• Reducesize(10-1)
• OnlydecompressoneSSTable when
accessthatLG
• DeclareLGas“load-to-memory”:
• LoadSSTable lazilyintotabletserver
memory
• Caching:
• Bytabletserver
• CacheK/Vpair
• CacheSSTable blocks,fastaccessto
neighbors
Refinement
• BloomFilter:
• Clientcreatesahashforlocalitygroup
• CheckwhetheraK/VexistsinaSSTable
• Helpfultoavoiddiskread
• ImmutabilityofSSTable canbeexploited
• Example:
• Whentabletsplits,itschildrencanuse
itsSSTable
• Recoveryspeedup:
• Tabletcanbeassignedtoanewserver
• Performminorcompactionsontheold
serverbeforetransferring
Refinement
• Commit-logimplementation:
• Tabletsonthesameservershare
commitlog.
• ReducenumberoffilesonGFS
• Complicatedrecovery,sincetabletcan
beassignedtodifferentserver
Performance
• Setup:
• N=#ofclients=#oftabletservers
• N<=500
• 1786machines:
• EachrunsaGFSserver
• Somerunsatabletserver,clientprocess,
otherjob’sprocess
• Choose#ofrowkeyssothateachtest
read/write~1GBtoeachtabletserver
• Numberof1000-bytevaluer/wper
second
• Sequential
• read/writekeyfromrange
• Random
• Read/writekeyuniformlychosen
• Read(mem)
• localitygroupin-memory
optimization
• Scan
• SequentialreadwithRPC
optimization
Throughput
• Increasingbutnotlinearly!
• CPU(bottleneckoftablet
server)
• Network
• Read64KBSSTable per1KB
valueR/W
• Onlygoodforsequential
R/Wandrandomread
(mem)
• Loadbalancingdoesn’twork
perfectly
• Loadisdynamicas
benchmarkgoes
• Limitedduetotablet
movement
Aggregatedthroughput
Conclusion
• Scalewell
• Butcreditsduetorefinementaswell
• ExistsHBase(opensource)thatmodelBigTable
• ThanksforGoogle’sinfrastructure:
• Easyforoptimizationonlower-level
• WellsuitedforGoogle’sprojects:
• Maps
• PersonalizedSearch
• Etc…
Q&A
• Hardtocompare.
VsRelationDB
(SQL)
• Frommodelperspective:
• Bydefault,indexlexicographicallyonrowkey
• Dividedintoranges,similartoB+tree
• Tightercontroloncolumnfamily
• Versioning-focusedbytimestampkey
• RDBMismorestructured,supportqueries,data
aggregation
• Fromdistributedsystemperspective:
• Loadbalancingontablet/tabletserver(byGFS)
• Tabletserving(byGFSandSSTable)
• Concurrencycontrolontablet/tabletserver(by
Chubby)
Quiz
• Underwhichdimension(row/column/time)isBigTable
lexicographicallyordered?
• Whatisatabletagain?
• Howexactlyistabletdistributedandservedtoclient!?!?
Quiz
• Infewwords:
• WhatdoesGFSstore?
• Whatistherelationship
betweenChubbyandtablet
server?
• Whatdoesthemasterserver
do?