ParallelismandConcurrency COS326 DavidWalker PrincetonUniversity slidescopyright2013-2015DavidWalkerandAndrewW.Appel permissiongrantedtoreusetheseslidesfornon-commercialeducaGonalpurposes Parallelism • Whatisit? • Today'stechnologytrends. • Then: – Whyisitsomuchhardertoprogram? • (Isitactuallysomuchhardertoprogram?) – SomepreliminarylinguisGcconstructs • threadcreaGon • ourfirstparallelfuncGonalabstracGon:futures 2 PARALLELISM: WHATISIT? Parallelism • Whatisit? – doingmanythingsatthesameGmeinsteadofsequenGally (one-aSer-the-other). 4 FlavorsofParallelism DataParallelism – samecomputaGonbeingperformedonacollec%onof independentitems – e.g.,addingtwovectorsofnumbers TaskParallelism – differentcomputaGons/programsrunningatthesameGme – e.g.,runningwebserveranddatabase PipelineParallelism – assemblyline: sequenGal f mapfoverallitems sequenGal g mapgoverallitems 5 Parallelismvs.Concurrency Parallelism:performsmanytaskssimultaneously • purpose:improvesthroughput • mechanism: – manyindependentcompuGngdevices – decreaserunGmeofprogrambyuGlizingmulGplecoresorcomputers • eg:runningyourwebcrawleronaclusterversusonemachine. Concurrency:mediatesmulG-partyaccesstosharedresources • purpose:decreaseresponseGme • mechanism: – switchbetweendifferentthreadsofcontrol – workononethreadwhenitcanmakeusefulprogress;whenitcan't, suspenditandworkonanotherthread • eg:runningyourclock,editor,chatatthesameGmeonasingleCPU. – OSgiveseachoftheseprogramsasmallGme-slice(~10msec) – oSenslowsthroughputduetocostofswitchingcontexts • eg:don'tblockwhilewaiGngforI/Odevicetorespond,butletanotherthread 6 dousefulCPUcomputaGon Parallelismvs.Concurrency Parallelism: performseveralindependent taskssimultaneously Concurrency: mediate/mulGplex accesstoshared resource job cpu cpu job job … … cpu job resource (cpu,disk,server, datastructure) manyefficientprogramsusesomeparallelismandsomeconcurrency 7 UNDERSTANDINGTECHNOLOGY TRENDS Moore'sLaw • Moore'sLaw:Thenumberoftransistorsyoucanputona computerchipdoubles(approximately)everycoupleofyears. • ConsequenceformostofthehistoryofcompuGng:All programsdoubleinspeedeverycoupleofyears. – Why?Hardwaredesignersarewickedsmart. – Theyhavebeenabletousethoseextratransistorsto(for example)doublethenumberofinstrucGonsexecutedperGme unit,therebyprocessingspeedofprograms • ConsequenceforapplicaGonwriters: – watchTVforawhileandyourprogramsop%mizethemselves! – perhapsmoreimportantly:newapplicaGonsthought impossiblebecamepossiblebecauseofincreased computaGonalpower CPUClockSpeedsfrom1993-2005 10 CPUClockSpeedsfrom1993-2005 Nextyear’smachine istwiceasfast! 11 CPUClockSpeedsfrom1993-2005 Oops! 12 CPUPower1993-2005 13 CPUPower1993-2005 Butpower consumpGonisonly partoftheproblem… coolingistheother! 14 TheHeatProblem 15 TheProblem 2005 Cooler 1993 PenGum Heat Sink 16 Cray-4:1994 Upto64processors Runningat1GHz 8MegabytesofRAM Cost:roughly$10M TheCRAY2,3,and4CPUandmemory boardswereimmersedinabathof electricallyinertcoolingfluid. 17 watercooled! 18 PowerDissipaGon 19 Powertochip peaking Darn! Intelengineersno longeropGmizemy programswhile IwatchTV! 20 Butlook: Moore’sLawsGll holds,sofar,for transistors-per-chip. Whatdowedo withallthosetransistors? 1. MulGcore! 2. System-on-chipwith specializedcoprocessors (suchasGPU) Bothofthoseare PARALLELISM 21 Parallelism WhyisitparGcularlyimportant(today)? – Roughlyeveryotheryear,achipfromIntelwould: • • • • halvethefeaturesize(sizeoftransistors,wires,etc.) doublethenumberoftransistors doubletheclockspeed thisdrovetheeconomicengineoftheITindustry(andtheUS!) – Nolongerabletodoubleclockorcutvoltage:aprocessor won’tgetanyfaster! • • • • (sowhyshouldyoubuyanewlaptop,desktop,etc.?) powerandheatarelimitaGonsontheclock errors,variability(noise)arelimitaGonsonthevoltage butwecansGllpackalotoftransistorsonachip…(atleastfor another10to15years.) 22 MulG-coreh/w–commonL2 Core ALU Core ALU L1cache L1cache ALU ALU L2cache Mainmemory 23 Today…(actually9yearsago!) 24 GPUs • There'snothinglikevideo gamingtodriveprogress incomputa%on! • GPUscanhavehundreds oreventhousandsof cores • Threeofthe5most powerfulsupercomputers intheworldtake advantageofGPU acceleraGon. • ScienGstsuseGPUsfor simulaGonandmodelling – eg:proteinfoldingand fluiddynamics GPUs • There'snothinglikevideo gamingtodriveprogress incomputa%on! • GPUscanhavehundreds oreventhousandsof cores • Threeofthe5most powerfulsupercomputers intheworldtake advantageofGPU acceleraGon. • ScienGstsuseGPUsfor simulaGonandmodelling JohnDanskin,PhDPrinceton1994, – eg:proteinfoldingand VicePresidentforGPUarchitecture,Nvidia fluiddynamics (whathedoeswithhisspareGme…builtthiscarhimself) So… InsteadoftryingtomakeyourCPUgofaster,Intel’sjustgoingto packmoreCPUsontoachip. – – – – afewyearsago:dualcore(2CPUs). aliqlemorerecently:4,6,8cores. IntelistesGng48-corechipswithresearchersnow. Within10years,you’llhave~1024IntelCPUsonachip. Infact,that’salreadyhappeningwithgraphicschips(eg,Nvidia). – – – – reallygoodatsimpledataparallelism(manydeeppipes) buttheyaremuchdumberthananIntelcore. andrightnow,chewupalotofpower. watchforGPUstoget“smarter”andmorepowerefficient,while CPUsbecomemorelikeGPUs. 27 STILLMOREPROCESSORS: THEDATACENTER DataCenters:GeneraGonZSuperComputers DataCenters:LotsofConnectedComputers! DataCenters • 10sor100softhousandsofcomputers • Allconnectedtogether • MoGvatedbynewapplicaGonsandscalablewebservices: – let'scatalogueallNbillionwebpagesintheworld – let'sallallowanyoneintheworldtosearchforthepageheor sheneeds – let'sprocessthatsearchinlessthanasecond • It'sAmazing! • It'sMagic! DataCenters:LotsofConnectedComputers Computercontainersforplug-and-playparallelism: SoundsGreat! • Somyoldprogramswillrun2x,4x,48x,256x,1024xfaster? 33 SoundsGreat! • Somyoldprogramswillrun2x,4x,48x,256x,1024xfaster? – noway! 34 SoundsGreat! • Somyoldprogramswillrun2x,4x,48x,256x,1024xfaster? – noway! – toupgradefromIntel386to486,theappwriterandcompiler writerdidnothavetodoanything(much) • IA486interpretedthesamesequenGalstreamofinstrucGons;it justdiditfaster • thisiswhywecouldwatchTVwhileIntelengineersopGmizedour programsforus – toupgradefromIntel486todualcore,weneedtofigureout howtosplitasinglestreamofinstrucGonsintotwostreamsof instrucGonsthatcollaboratetocompletethesametask. • withoutwork&thought,ourprogramsdon'tgetanyfasteratall • ittakesingenuitytogenerateefficientparallelalgorithmsfrom sequen%alones 35 What’stheanswer? InPart:FuncGonalProgramming! Naiad Pig Dryad PARALLELANDCONCURRENT PROGRAMMING MulGcoreHardware&DataCenters Core ALU Core ALU L1cache L1cache ALU ALU L2cache Mainmemory 39 Speedup • Speedup:theraGoofsequenGalprogramexecuGonGmeto parallelexecuGonGme. • IfT(p)istheGmeittakestorunacomputaGononpprocessors speedup(p)=T(1)/T(p) • Aparallelprogramhasperfectspeedup(akalinearspeedup)if T(1)/T(p)=speedup=p • Badnews:Noteveryprogramcanbeeffec%velyparallelized. – infact,veryfewprogramswillscalewithperfectspeedups. – wecertainlycan'tachieveperfectspeedupsautomaGcally – limitedbysequenGalporGons,datatransfercosts,... 40 MostTroubling… Most,butnotall,parallelandconcurrentprogrammingmodels arefarhardertoworkwiththansequenGalones: • Theyintroducenondeterminism – therootof(almostall)evil – programpartssuddenlyhavemanydifferentoutcomes • theyhavedifferentoutcomesondifferentruns • debuggingrequiresconsideringallofthepossibleoutcomes • horribleheisenbugshardtotrackdown • Theyarenonmodular – moduleAimplicitlyinfluencestheoutcomesofmoduleB • Theyintroducenewclassesoferrors – racecondiGons,deadlocks • Theyintroducenewperformance/scalabilityproblems – busy-waiGng,sequenGalizaGon,contenGon, 41 InformalErrorRateChart regularity withwhich youshoot yourself inthefoot InformalErrorRateChart regularity withwhich youshoot yourself inthefoot heaven onearth manual memory management nullpointers, paucityoftypes, inheritence kitchen sink+ manual memory unstructured parallel orconcurrent programming SolidParallelProgrammingRequires 1.GoodsequenGalprogrammingskills. – allthethingswe’vebeentalkingabout:usemodules,types,... 2.DeepknowledgeoftheapplicaGon. 3.Pickacorrect-by-construc%onparallelprogrammingmodel – wheneverpossible,aparallelmodelwithsemanGcsthatcoincides withsequenGalsemanGcs • wheneverpossible,reusewell-testedlibrariesthathideparallelism – wheneverpossible,amodelthatcutsdownnon-determinism – wheneverpossible,amodelwithfewerpossibleconcurrencybugs – ifbugscanarise,knowandusesafeprogrammingpaqerns 4.Carefulengineeringtoensurescaling. – unfortunately,thereissomeGmesatradeoff: • reducednondeterminismcanleadtoreducedresourceuGlizaGon – synchronizaGon,communicaGoncostsmayneedopGmizaGon 44 OURFIRSTPARALLEL PROGRAMMINGMODEL:THREADS Threads:AWarning • ConcurrentThreadswithLocks:theclassicshoot-yourself-inthe-footconcurrentprogrammingmodel – alltheclassicerrormodes • WhyThreads? – almostallprogramminglanguageswillhaveathreadslibrary • OCamlinparGcular! – youneedtoknowwherethepiyallsare – theassemblylanguageofconcurrentprogrammingparadigms • we’llusethreadstobuildseveralhigher-levelprogramming models Threads • Threads:anabstracGonofaprocessor. – programmer(orcompiler)decidesthatsomeworkcanbedone inparallelwithsomeotherwork,e.g.: let _ = compute_big_thing() in let y = compute_other_big_thing() in ... – weforkathreadtorunthecomputaGoninparallel,e.g.: let t = Thread.create compute_big_thing () in let y = compute_other_big_thing () in ... 47 IntuiGoninPictures let t = Thread.create f () in let y = g () in ... processor1 Gme1 Thread.create Gme2 execute g () Gme3 ... processor2 (* doing nothing *) execute f () ... 48 OfCourse… Supposeyouhave2availablecoresandyoufork4threads.Ina typicalmulG-threadedsystem, – theoperaGngsystemprovidestheillusionthattherearean infinitenumberofprocessors. • notreally:eachthreadconsumesspace,soifyouforktoomany threadstheprocesswilldie. – it%me-mul%plexesthethreadsacrosstheavailableprocessors. • aboutevery10msec,itstopsthecurrentthreadonaprocessor, andswitchestoanotherthread. • soathreadisreallyavirtualprocessor. 49 OCaml,ConcurrencyandParallelism Unfortunately,evenifyourcomputerhas2,4,6,8cores,OCaml cannotexploitthem.ItmulGplexesallthreadsoverasinglecore thread thread … thread core Hence,OCamlprovidesconcurrency,butnotparallelism.Why? BecauseOCaml(likePython)hasnoparallel“runGmesystem”or garbagecollector.OtherfuncGonallanguages(Haskell,F#,...)do. Fortunately,whenthinkingaboutprogramcorrectness,itdoesn’t maqerthatOCamlisnotparallel--IwilloSenpretendthatitis. YoucanhideI/Olatency,domulGprocessprogrammingordistribute tasksamongstmulGplecomputersinOCaml. CoordinaGon Thread.create : (‘a -> ‘b) -> ‘a -> Thread.t let t = Thread.create f () in let y = g () in ... HowdowegetbacktheresultthattiscompuGng? 51 FirstAqempt let r = ref None let t = Thread.create (fun _ -> r := Some(f ())) in let y = g() in match !r with | Some v -> (* compute with v and y *) | None -> ??? What’swrongwiththis? 52 SecondAqempt let r = ref None let t = Thread.create (fun _ -> r := Some(f ())) in let y = g() in let rec wait() = match !r with | Some v -> v | None -> wait() in let v = wait() in (* compute with v and y *) 53 TwoProblems let r = ref None let t = Thread.create (fun _ -> r := Some(f ())) in let y = g() in let rec wait() = match !r with | Some v -> v | None -> wait() in let v = wait() in (* compute with v and y *) First,wearebusy-wai%ng. • consumingcpuwithoutdoingsomethinguseful. • theprocessorcouldbeeitherrunningausefulthread/programorpower down. 54 TwoProblems let r = ref None let t = Thread.create (fun _ -> r := Some(f ())) in let y = g() in let rec wait() = match !r with | Some v -> v | None -> wait() in let v = wait() in (* compute with v and y *) Second,anoperaGonliker:=Somevmaynotbeatomic. • • • r:=SomevrequiresustocopythebytesofSomevintotherefr wemightseepartofthebytes(correspondingtoSome)beforewe’ve wriqenintheotherparts(e.g.,v). Sothewaitermightseethewrongvalue. 55 Atomicity Considerthefollowing: let inc(r:int ref) = r := (!r) + 1 andsupposetwothreadsareincremenGngthesamerefr: Thread1 Thread2 inc(r); inc(r); !r !r IfriniGallyholds0,thenwhatwillThread1seewhenitreadsr? 56 Atomicity Theproblemisthatwecan’tseeexactlywhatinstrucGonsthe compilermightproducetoexecutethecode. Itmightlooklikethis: Thread1 Thread2 EAX := load(r); EAX := load(r); EAX := EAX + 1; EAX := EAX + 1; store EAX into r store EAX into r EAX := load(r) EAX := load(r) 57 Atomicity ButaclevercompilermightopGmizethisto: Thread1 Thread2 EAX := load(r); EAX := load(r); EAX := EAX + 1; EAX := EAX + 1; store EAX into r store EAX into r EAX := load(r) EAX := load(r) 58 Atomicity Furthermore,wedon’tknowwhentheOSmightinterruptone threadandruntheother. Thread1 Thread2 EAX := load(r); EAX := load(r); EAX := EAX + 1; EAX := EAX + 1; store EAX into r store EAX into r EAX := load(r) EAX := load(r) (ThesituaGonissimilar,butnotquitethesameonmulGprocessorsystems.) 59 TheHappensBeforeRelaGon Wedon’tknowexactlywheneachinstrucGonwillexecute,but therearesomeconstraints:theHappensBeforerelaGon Rule1:Giventwoexpressions(orinstrucGons)insequence,e1; e2weknowthate1happensbeforee2. Rule2:Givenaprogram: lett=Thread.createfxin .... Thread.joint; e weknowthat(fx)happensbeforee. Atomicity OnepossibleinterleavingoftheinstrucGons: Thread1 Thread2 EAX := load(r); EAX := load(r); EAX := EAX + 1; EAX := EAX + 1; store EAX into r store EAX into r EAX := load(r) EAX := load(r) Whatanswerdoweget? 61 Atomicity Anotherpossibleinterleaving: Thread1 Thread2 EAX := load(r); EAX := load(r); EAX := EAX + 1; EAX := EAX + 1; store EAX into r store EAX into r EAX := load(r) EAX := load(r) WhatanswerdowegetthisGme? 62 Atomicity Anotherpossibleinterleaving: Thread1 Thread2 EAX := load(r); EAX := load(r); EAX := EAX + 1; EAX := EAX + 1; store EAX into r store EAX into r EAX := load(r) EAX := load(r) WhatanswerdowegetthisGme? Moral:ThesystemisresponsibleforschedulingexecuGonof instrucGons. Moral:Thiscanleadtoanenormousdegreeofnondeterminism. 63 Atomicity Infact,today’smulGcoreprocessorsdon’ttreatmemoryina sequen%allyconsistentfashion. Thread1 Thread2 EAX := load(r); EAX := load(r); EAX := EAX + 1; EAX := EAX + 1; store EAX into r store EAX into r EAX := load(r) EAX := load(r) Thatmeansthatwecan’tevenassumethatwhatwewillsee correspondstosomeinterleavingofthethreads’instruc%ons! Beyondthescopeofthisclass!Butthetake-awayisthis:It’snotagoodidea touseordinaryloads/storestosynchronizethreads;youshoulduseexplicitsynchronizaGon primiGvessothehardwareandopGmizingcompilerdon’topGmizethemaway. 64 Atomicity Infact,today’smulGcoreprocessorsdon’ttreatmemoryina sequen%allyconsistentfashion.Thatmeansthatwecan’teven assumethatwhatwewillseecorrespondstosomeinterleaving ofthethreads’instruc%ons! Core1 Core2 Core3 Core4 WhenCore1storesto ALU ALU ALU ALU “memory”,itlazily propagatestoCore2’sL1 cache.TheloadatCore2 L1cache L1cache L1cache L1cache mightnotseeit,unless thereisanexplicit synchronizaGon. L2cache Beyondthescopeofthisclass!Butthetake-awayisthis:It’snotagoodidea touseordinaryloads/storestosynchronizethreads;youshoulduseexplicitsynchronizaGon primiGvessothehardwareandopGmizingcompilerdon’topGmizethemaway. 65 Summary:Interleaving&RaceCondiGons Calculatepossibleoutcomesforaprogrambyconsideringallofthepossible interleavingsoftheatomicacGonsperformedbyeachthread. – Subjecttothehappens-beforerelaGon. • can’thaveachildthread’sacGonshappeningbeforeaparentforksit. • can’thavelaterinstrucGonsexecuteearlierinthesamethread. – Here,atomicmeansindivisibleacGons. • Forexample,onmostmachinesreadingorwriGnga32-bitwordisatomic. • But,wriGngamulG-wordobjectisusuallynotatomic. • MostoperaGonslike“b:=b-w”areimplementedintermsofaseriesof simpleroperaGonssuchas – r1=read(b);r2=read(w);r3=r1–r2;write(b,r3) Reasoningaboutallinterleavingsishard.justaboutimpossibleforpeople – NumberofinterleavingsgrowsexponenGallywithnumberofstatements. – It’shardforustotellwhatisandisn’tatomicinahigh-levellanguage. – YOUAREDOOMEDTOFAILIFYOUHAVETOWORRYABOUTTHISSTUFF! 66 Summary:Interleaving&RaceCondiGons Calculatepossibleoutcomesforaprogrambyconsideringallofthepossible interleavingsoftheatomicacGonsperformedbyeachthread. – Subjecttothehappens-beforerelaGon. WARNING • can’thaveachildthread’sacGonshappeningbeforeaparentforksit. • can’thavelaterinstrucGonsexecuteearlierinthesamethread. Ifyouseepeopletalkaboutinterleavings,BEWARE! – Here,atomicmeansindivisibleacGons. Itprobablymeansthey’reassuming • Forexample,onmostmachinesreadingorwriGnga32-bitwordisatomic. “sequenGalconsistency,” • But,wriGngamulG-wordobjectisusuallynotatomic. whichisanoversimplified,naïvemodelofwhatthe • MostoperaGonslike“b:=b-w”areimplementedintermsofaseriesof parallelcomputerreallydoes. simpleroperaGonssuchas It’sactuallymorecomplicatedthanthat. – r1=read(b);r2=read(w);r3=r1–r2;write(b,r3) Reasoningaboutallinterleavingsishard.justaboutimpossibleforpeople – NumberofinterleavingsgrowsexponenGallywithnumberofstatements. – It’shardforustotellwhatisandisn’tatomicinahigh-levellanguage. – YOUAREDOOMEDTOFAILIFYOUHAVETOWORRYABOUTTHISSTUFF! 67 AconvenGonalsoluGonforshared-memoryparallelism let inc(r:int ref) = r := (!r) + 1 Thread1 Thread2 lock(mutex); lock(mutex); inc(r); inc(r); !r !r unlock(mutex); unlock(mutex); GuaranteesmutualexclusionofthesecriGcalsecGons. ThissoluGonworks(evenforrealmachinesthatarenot sequenGallyconsistent),but… Complextoprogram,subjecttodeadlock,pronetobugs, notfault-tolerant,hardtoreasonabout. AconvenGonalsoluGonforshared-memoryparallelism let inc(r:int ref) = r := (!r) + 1 Thread1 Thread2 lock(mutex); lock(mutex); inc(r); inc(r); SynchronizaHo n !r !r unlock(mutex); unlock(mutex); GuaranteesmutualexclusionofthesecriGcalsecGons. ThissoluGonworks(evenforrealmachinesthatarenot sequenGallyconsistent),but… Complextoprogram,subjecttodeadlock,pronetobugs, notfault-tolerant,hardtoreasonabout. AnotherapproachtothecoordinaGonProblem Thread.create : (‘a -> ‘b) -> ‘a -> Thread.t let t = Thread.create f () in let y = g () in ... Howdowegetbacktheresultthattiscompu%ng? 70 OneSoluGon(usingjoin) let r = ref None let t = Thread.create (fun _ -> r := Some(f ())) in let y = g() in Thread.join t ; match !r with | Some v -> (* compute with v and y *) | None -> failwith “impossible” 71 OneSoluGon(usingjoin) let r = ref None let t = Thread.create (fun _ -> r := Some(f ())) in let y = g() in Thread.join t ; match !r with | Some v -> (* compute with v and y *) | None -> failwith “impossible” Thread.join tcauses thecurrentthreadtowait unGlthethreadt terminates. 72 OneSoluGon(usingjoin) let r = ref None let t = Thread.create (fun _ -> r := Some(f ())) in let y = g() in SynchronizaHo Thread.join t ; n match !r with | Some v -> (* compute with v and y *) | None -> failwith “impossible” SoaSerthejoin,weknow thatanyoftheoperaGons ofthavecompleted. 73 InPictures Thread1 t=createfx inst1,1; inst1,2; inst1,3; inst1,4; … inst1,n-1; inst1,n; joint Thread2 inst2,1; inst2,2; inst2,3; … inst2,m; Weknowthatforeach threadtheprevious instrucGonsmusthappen beforethelaterinstrucGons. Soforinstance,inst1,1must happenbeforeinst1,2. 74 InPictures Thread1 t=createfx inst1,1; inst1,2; inst1,3; inst1,4; … inst1,n-1; inst1,n; joint Thread2 inst2,1; inst2,2; inst2,3; … inst2,m; Wealsoknowthatthe forkmusthappenbefore thefirstinstrucGonofthe secondthread. 75 InPictures Thread1 t=createfx inst1,1; inst1,2; inst1,3; inst1,4; … inst1,n-1; inst1,n; joint Thread2 inst2,1; inst2,2; inst2,3; … inst2,m; Wealsoknowthatthe forkmusthappenbefore thefirstinstrucGonofthe secondthread. Andthankstothejoin, weknowthatallofthe instrucGonsofthesecond threadmustbecompleted beforethejoinfinishes. 76 InPictures Thread1 t=createfx inst1,1; inst1,2; inst1,3; inst1,4; … inst1,n-1; inst1,n; joint Thread2 inst2,1; inst2,2; inst2,3; … inst2,m; However,ingeneral,we donotknowwhether inst1,iexecutesbeforeor aSerinst2,j. Ingeneral,synchroniza%on instruc%onslikeforkand joinreducethenumberof possibleinterleavings. Synchroniza%oncutsdown nondeterminism. Intheabsenceof synchronizaGonwedon’t knowanything… 77
© Copyright 2026 Paperzz