Introduction to Scientific Computing

Introduction to
Scientific Computing
Doug Sondak
[email protected]
Boston University
Scientific Computing
and Visualization
Outline
•
•
•
•
Introduction
Software
Parallelization
Hardware
Introduction
• What is Scientific Computing?
– Need for speed
– Need for memory
• Simulations tend to grow until they
overwhelm available resources
– If I can simulate 1000 neurons, wouldn’t it be
cool if I could do 2000? 10000? 1087?
• Example – flow over an airplane
– It has been estimated that if a teraflop machine
were available, would take about 200,000 years
to solve (resolving all scales).
• If Homo Erectus had a teraflop machine, we could be
getting the result right about now.
Introduction (cont’d)
• Optimization
– Profile serial (1-processor) code
• Tells where most time is consumed
– Is there any “low fruit”?
• Faster algorithm
• Optimized library
• Wasted operations
• Parallelization
– Break problem up into chunks
– Solve chunks simultaneously on different
processors
Software
Compiler
• The compiler is your friend (usually)
• Optimizers are quite refined
– Always try highest level
• Usually –O3
• Sometimes –fast, -O5, …
• Loads of flags, many for optimization
• Good news – many compilers will
automatically parallelize for sharedmemory systems
• Bad news – this usually doesn’t work well
Software
• Libraries
– Solver is often a major consumer of CPU
time
– Numerical Recipes is a good book, but many
algorithms are not optimal
– Lapack is a good resource
– Libraries are often available that have been
optimized for the local architecture
• Disadvantage – not portable
Parallelization
Parallelization
• Divide and conquer!
– divide operations among many processors
– perform operations simultaneously
– if serial run takes 10 hours and we hit the
problem with 5000 processors, it should take
about 7 seconds to complete, right?
• not so easy, of course
Parallelization (cont’d)
• problem – some calculations depend upon
previous calculations
– can’t be performed simultaneously
– sometimes tied to the physics of the
problem, e.g., time evolution of a system
• want to maximize amount of parallel code
– occasionally easy
– usually requires some work
Parallelization (3)
• method used for parallelization may depend
on hardware
• distributed memory
– each processor has own address space
– if one processor needs data from another
processor, must be explicitly passed
• shared memory
– common address space
– no message passing required
Parallelization (4)
proc
0
proc
1
proc
2
proc
3
mem
0
mem
1
mem
2
mem
3
distributed memory
proc
0
proc
1
proc
2
proc
3
mem
shared memory
proc
0
proc
1
mem
0
proc
2
proc
3
mem
1
mixed memory
Parallelization (5)
• MPI
– for both distributed and shared memory
– portable
– freely downloadable
• OpenMP
– shared memory only
– must be supported by compiler (most do)
– usually easier than MPI
– can be implemented incrementally
MPI
• Computational domain is typically decomposed
into regions
– One region assigned to each processor
• Separate copy of program runs on each
processor
MPI
• Discretized domain to solve flow over airfoil
• System of coupled PDE’s solved at each point
MPI
• Decomposed domain for 4 processors
MPI
• Since points depend on adjacent points, must transfer
information after each iteration
• This is done with explicit calls in the source code
i i 1  i 1

x
2x
MPI
• Diminishing returns
– Sending messages can get expensive
– Want to maximize ratio of computation to
communication
• Parallel speedup, parallel efficiency
T1
S
Tn
T = time
T1
S


nTn n
n = number of processors
Speedup
Parallel Efficiency
OpenMP
• Usually loop-level parallelization
for(i=0; i<N; i++){
do lots of stuff
}
• An OpenMP directive is placed in the source
code before the loop
– Assigns subset of loop indices to each processor
– No message passing since each processor can “see”
the whole domain
OpenMP
• Can’t guarantee order of operations
for(i = 0; i < 7; i++)
a[i] = 1;
for(i = 1; i < 7; i++)
a[i] = 2*a[i-1];
Example of how to do it wrong!
Parallelize this loop on 2 processors
i
a[i] (serial)
a[i] (parallel)
0
1
1
1
2
2
2
4
4
3
8
8
4
16
2
5
32
4
6
64
8
Proc. 0
Proc. 1
Hardware
Hardware
• A faster processor is obviously good, but:
– Memory access speed is often a big driver
• Cache – a critical element of memory
system
• Processors have internal parallelism such
as pipelines and multiply-add instructions
Cache
• Cache is a small chunk of fast memory
between the main memory and the
registers
registers
primary cache
secondary cache
main memory
Cache (cont’d)
• Variables are moved from main memory
to cache in lines
– L1 cache line sizes on our machines
•
•
•
•
Opteron (blade cluster) 64 bytes
Power4 (p-series) 128 bytes
PPC440 (Blue Gene) 32 bytes
Pentium III (linux cluster) 32 bytes
• If variables are used repeatedly, code
will run faster since cache memory is
much faster than main memory
Cache (cont’d)
• Why not just make the main memory out
of the same stuff as cache?
– Expensive
– Runs hot
– This was actually done in Cray computers
• Liquid cooling system
Cache (cont’d)
• Cache hit
– Required variable is in cache
• Cache miss
– Required variable not in cache
– If cache is full, something else must be
thrown out (sent back to main memory) to
make room
– Want to minimize number of cache misses
Cache example
“mini” cache
holds 2 lines, 4 words each
x[0]
x[8]
x[1]
x[9]
x[2]
a
x[3]
b
x[4]
…
x[6]
x[7]
Main memory
…
x[5]
for(i==0; i<10; i++)
x[i] = i
Cache example (cont’d)
•We will ignore i for simplicity
•need x[0], not in cache
cache miss
•load line from memory into cache
•next 3 loop indices result in cache hits
x[0]
x[1]
x[2]
x[3]
x[0]
x[8]
x[1]
x[9]
x[2]
a
x[3]
b
x[4]
…
x[6]
x[7]
…
x[5]
for(i==0; i<10; i++)
x[i] = i
Cache example (cont’d)
x[0]
x[4]
x[1]
x[5]
x[2]
x[6]
x[3]
x[7]
x[0]
x[8]
x[1]
x[9]
x[2]
a
x[3]
b
x[4]
…
x[6]
x[7]
for(i==0; i<10; i++)
x[i] = i
…
x[5]
•need x[4], not in cache
cache miss
•load line from memory into cache
•next 3 loop indices result in cache hits
Cache example (cont’d)
x[8]
x[4]
x[9]
x[5]
a
x[6]
b
x[7]
x[0]
x[8]
x[1]
x[9]
x[2]
a
x[3]
b
x[4]
…
x[6]
x[7]
for(i==0; i<10; i++)
x[i] = i
…
x[5]
•need x[8], not in cache
cache miss
•load line from memory into cache
•no room in cache!
•replace old line
Cache (cont’d)
• Contiguous access is important
• In C, multidimensional array is stored in
memory as
a[0][0]
a[0][1]
a[0][2]
…
Cache (cont’d)
• In Fortran and Matlab, multidimensional
array is stored the opposite way:
a(1,1)
a(2,1)
a(3,1)
…
Cache (cont’d)
• Rule: Always order your loops
appropriately
for(i=0; i<N; i++){
for(j=0; j<N; j++){
a[i][j] = 1.0;
}
}
C
do j = 1, n
do i = 1, n
a(i,j) = 1.0
enddo
enddo
Fortran
SCF Machines
p-series
•
•
•
•
•
Shared memory
IBM Power4 processors
32 KB L1 cache per processor
1.41 MB L2 cache per pair of processors
128 MB L3 cache per 8 processors
p-series
machine
Proc. spd
# procs
memory
kite
1.3 GHz
32
32 GB
pogo
1.3 GHz
32
32 GB
frisbee
1.3 GHz
32
32 GB
domino
1.3 GHz
16
16 GB
twister
1.1 GHz
8
16 GB
scrabble
1.1 GHz
8
16 GB
marbles
1.1 GHz
8
16 GB
crayon
1.1 GHz
8
16 GB
litebrite
1.1 GHz
8
16 GB
hotwheels 1.1 GHz
8
16 GB
Blue Gene
• Distributed memory
• 2048 processors
– 1024 2-processor nodes
• IBM PowerPC 440 processors
– 700 MHz
• 512 MB memory per node (per 2
processors)
• 32 KB L1 cache per node
• 2 MB L2 cache per node
• 4 MB L3 cache per node
BladeCenter
• Hybrid memory
• 56 processors
– 14 4-processor nodes
• AMD Opteron processors
– 2.6 GHz
• 8 GB memory per node (per 4 processors)
– Each node has shared memory
• 64 KB L1 cache per 2 processors
• 1 MB L2 cache per 2 processors
Linux Cluster
• Hybrid memory
• 104 processors
– 52 2-processor nodes
• Intel Pentium III processors
– 1.3 GHz
• 1 GB memory per node (per 2 processors)
– Each node has shared memory
• 16 KB L1 cache per 2 processors
• 512 KB L2 cache per 2 processors
For More Information
• SCV web site
http://scv.bu.edu/
• Today’s presentations are available at
http://scv.bu.edu/documentation/presentations/
under the title “Introduction to Scientific Computing
and Visualization”
Next Time
• G & T code
• Time it
– Look at effect of compiler flags
• profile it
– Where is time consumed?
• Modify it to improve serial performance