Introduction to Scientific Computing Doug Sondak [email protected] Boston University Scientific Computing and Visualization Outline • • • • Introduction Software Parallelization Hardware Introduction • What is Scientific Computing? – Need for speed – Need for memory • Simulations tend to grow until they overwhelm available resources – If I can simulate 1000 neurons, wouldn’t it be cool if I could do 2000? 10000? 1087? • Example – flow over an airplane – It has been estimated that if a teraflop machine were available, would take about 200,000 years to solve (resolving all scales). • If Homo Erectus had a teraflop machine, we could be getting the result right about now. Introduction (cont’d) • Optimization – Profile serial (1-processor) code • Tells where most time is consumed – Is there any “low fruit”? • Faster algorithm • Optimized library • Wasted operations • Parallelization – Break problem up into chunks – Solve chunks simultaneously on different processors Software Compiler • The compiler is your friend (usually) • Optimizers are quite refined – Always try highest level • Usually –O3 • Sometimes –fast, -O5, … • Loads of flags, many for optimization • Good news – many compilers will automatically parallelize for sharedmemory systems • Bad news – this usually doesn’t work well Software • Libraries – Solver is often a major consumer of CPU time – Numerical Recipes is a good book, but many algorithms are not optimal – Lapack is a good resource – Libraries are often available that have been optimized for the local architecture • Disadvantage – not portable Parallelization Parallelization • Divide and conquer! – divide operations among many processors – perform operations simultaneously – if serial run takes 10 hours and we hit the problem with 5000 processors, it should take about 7 seconds to complete, right? • not so easy, of course Parallelization (cont’d) • problem – some calculations depend upon previous calculations – can’t be performed simultaneously – sometimes tied to the physics of the problem, e.g., time evolution of a system • want to maximize amount of parallel code – occasionally easy – usually requires some work Parallelization (3) • method used for parallelization may depend on hardware • distributed memory – each processor has own address space – if one processor needs data from another processor, must be explicitly passed • shared memory – common address space – no message passing required Parallelization (4) proc 0 proc 1 proc 2 proc 3 mem 0 mem 1 mem 2 mem 3 distributed memory proc 0 proc 1 proc 2 proc 3 mem shared memory proc 0 proc 1 mem 0 proc 2 proc 3 mem 1 mixed memory Parallelization (5) • MPI – for both distributed and shared memory – portable – freely downloadable • OpenMP – shared memory only – must be supported by compiler (most do) – usually easier than MPI – can be implemented incrementally MPI • Computational domain is typically decomposed into regions – One region assigned to each processor • Separate copy of program runs on each processor MPI • Discretized domain to solve flow over airfoil • System of coupled PDE’s solved at each point MPI • Decomposed domain for 4 processors MPI • Since points depend on adjacent points, must transfer information after each iteration • This is done with explicit calls in the source code i i 1 i 1 x 2x MPI • Diminishing returns – Sending messages can get expensive – Want to maximize ratio of computation to communication • Parallel speedup, parallel efficiency T1 S Tn T = time T1 S nTn n n = number of processors Speedup Parallel Efficiency OpenMP • Usually loop-level parallelization for(i=0; i<N; i++){ do lots of stuff } • An OpenMP directive is placed in the source code before the loop – Assigns subset of loop indices to each processor – No message passing since each processor can “see” the whole domain OpenMP • Can’t guarantee order of operations for(i = 0; i < 7; i++) a[i] = 1; for(i = 1; i < 7; i++) a[i] = 2*a[i-1]; Example of how to do it wrong! Parallelize this loop on 2 processors i a[i] (serial) a[i] (parallel) 0 1 1 1 2 2 2 4 4 3 8 8 4 16 2 5 32 4 6 64 8 Proc. 0 Proc. 1 Hardware Hardware • A faster processor is obviously good, but: – Memory access speed is often a big driver • Cache – a critical element of memory system • Processors have internal parallelism such as pipelines and multiply-add instructions Cache • Cache is a small chunk of fast memory between the main memory and the registers registers primary cache secondary cache main memory Cache (cont’d) • Variables are moved from main memory to cache in lines – L1 cache line sizes on our machines • • • • Opteron (blade cluster) 64 bytes Power4 (p-series) 128 bytes PPC440 (Blue Gene) 32 bytes Pentium III (linux cluster) 32 bytes • If variables are used repeatedly, code will run faster since cache memory is much faster than main memory Cache (cont’d) • Why not just make the main memory out of the same stuff as cache? – Expensive – Runs hot – This was actually done in Cray computers • Liquid cooling system Cache (cont’d) • Cache hit – Required variable is in cache • Cache miss – Required variable not in cache – If cache is full, something else must be thrown out (sent back to main memory) to make room – Want to minimize number of cache misses Cache example “mini” cache holds 2 lines, 4 words each x[0] x[8] x[1] x[9] x[2] a x[3] b x[4] … x[6] x[7] Main memory … x[5] for(i==0; i<10; i++) x[i] = i Cache example (cont’d) •We will ignore i for simplicity •need x[0], not in cache cache miss •load line from memory into cache •next 3 loop indices result in cache hits x[0] x[1] x[2] x[3] x[0] x[8] x[1] x[9] x[2] a x[3] b x[4] … x[6] x[7] … x[5] for(i==0; i<10; i++) x[i] = i Cache example (cont’d) x[0] x[4] x[1] x[5] x[2] x[6] x[3] x[7] x[0] x[8] x[1] x[9] x[2] a x[3] b x[4] … x[6] x[7] for(i==0; i<10; i++) x[i] = i … x[5] •need x[4], not in cache cache miss •load line from memory into cache •next 3 loop indices result in cache hits Cache example (cont’d) x[8] x[4] x[9] x[5] a x[6] b x[7] x[0] x[8] x[1] x[9] x[2] a x[3] b x[4] … x[6] x[7] for(i==0; i<10; i++) x[i] = i … x[5] •need x[8], not in cache cache miss •load line from memory into cache •no room in cache! •replace old line Cache (cont’d) • Contiguous access is important • In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] … Cache (cont’d) • In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) … Cache (cont’d) • Rule: Always order your loops appropriately for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } } C do j = 1, n do i = 1, n a(i,j) = 1.0 enddo enddo Fortran SCF Machines p-series • • • • • Shared memory IBM Power4 processors 32 KB L1 cache per processor 1.41 MB L2 cache per pair of processors 128 MB L3 cache per 8 processors p-series machine Proc. spd # procs memory kite 1.3 GHz 32 32 GB pogo 1.3 GHz 32 32 GB frisbee 1.3 GHz 32 32 GB domino 1.3 GHz 16 16 GB twister 1.1 GHz 8 16 GB scrabble 1.1 GHz 8 16 GB marbles 1.1 GHz 8 16 GB crayon 1.1 GHz 8 16 GB litebrite 1.1 GHz 8 16 GB hotwheels 1.1 GHz 8 16 GB Blue Gene • Distributed memory • 2048 processors – 1024 2-processor nodes • IBM PowerPC 440 processors – 700 MHz • 512 MB memory per node (per 2 processors) • 32 KB L1 cache per node • 2 MB L2 cache per node • 4 MB L3 cache per node BladeCenter • Hybrid memory • 56 processors – 14 4-processor nodes • AMD Opteron processors – 2.6 GHz • 8 GB memory per node (per 4 processors) – Each node has shared memory • 64 KB L1 cache per 2 processors • 1 MB L2 cache per 2 processors Linux Cluster • Hybrid memory • 104 processors – 52 2-processor nodes • Intel Pentium III processors – 1.3 GHz • 1 GB memory per node (per 2 processors) – Each node has shared memory • 16 KB L1 cache per 2 processors • 512 KB L2 cache per 2 processors For More Information • SCV web site http://scv.bu.edu/ • Today’s presentations are available at http://scv.bu.edu/documentation/presentations/ under the title “Introduction to Scientific Computing and Visualization” Next Time • G & T code • Time it – Look at effect of compiler flags • profile it – Where is time consumed? • Modify it to improve serial performance
© Copyright 2026 Paperzz