Motivation
Proposed solution
Results
Conclusion
Parallel Genetic Algorithm on the CUDA
Architecture
Petr Pospı́chal, Jiřı́ Jaroš and Josef Schwarz
{ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Brno University of Technology
Faculty of Information Technology, Department of Computer Systems
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Overview
1
Motivation
Genetic algorithms
Problem
GPU vs. CPU
2
Proposed solution
GPGPU possibilities and disadvantages
CUDA Hardware model
Mapping CUDA hardware model to software model
3
Results
Speedup
Quality
Limitations
4
Conclusion
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Genetic algorithms
Genetic algorithms 1/2
initial random
population
best individual
= solution
end
yes
terminating
conditions
met?
stochastic
optimization
technique
evaluation
1
2
3
no
new iteration
selection
parent population
no
offspring population
no
mutation
mutation?
yes
crossover?
black-box ⇒
minimal problem
knowledge
yes
crossover
⇒ This work is focused on accleration
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
employs population
of candidate
solutions
robust, wide area of
applications
inheritely parallel,
but slow
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Genetic algorithms
Genetic algorithms 1/2
initial random
population
best individual
= solution
end
yes
terminating
conditions
met?
stochastic
optimization
technique
evaluation
1
2
3
no
new iteration
selection
parent population
no
offspring population
no
mutation
mutation?
yes
crossover?
black-box ⇒
minimal problem
knowledge
yes
crossover
⇒ This work is focused on accleration
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
employs population
of candidate
solutions
robust, wide area of
applications
inheritely parallel,
but slow
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Genetic algorithms
Genetic algorithms 2/2: Island Model with migrations
migrations
island 1
island 2
island 3
island 7
several independent
populations
better convergence
tovards different
suboptima
new operator: migration
island 4
island 6
island 5
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
migration occasionally
transfers good genetic
material between islands
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Problem
Motivation
Problem
Genetic algorithms are effective in solving many practical problems
BUT their exection usually take a long time.
Possible solutions
Hardware accelerators (i.e. FPGA) – difficult
Multicore parallelization – small speedup
Grid computing – expensive
simple, cheap, available for everyone?
⇒ graphic cards - ideal?
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Problem
Motivation
Problem
Genetic algorithms are effective in solving many practical problems
BUT their exection usually take a long time.
Possible solutions
Hardware accelerators (i.e. FPGA) – difficult
Multicore parallelization – small speedup
Grid computing – expensive
simple, cheap, available for everyone?
⇒ graphic cards - ideal?
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Problem
Motivation
Problem
Genetic algorithms are effective in solving many practical problems
BUT their exection usually take a long time.
Possible solutions
Hardware accelerators (i.e. FPGA) – difficult
Multicore parallelization – small speedup
Grid computing – expensive
simple, cheap, available for everyone?
⇒ graphic cards - ideal?
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
GPU vs. CPU
GPU compute capability history
+compute capability
+performance
1999 Voodoo 3
pixel+vertex units
year
shader
version
max. executed
instructions
GeForce GTX 295 2009
DirecX 11
unified shader units CUDA OpenCL compute
Cg
C++
2001 2002 2003 2004
2006 2007 2008
2009 2010
1.0
5.0
<10
1.3
2.0
3.0
4.0
4.1
~100
~10000
no limit
⇒ compute capability is sufficient (there are some minor limitations)
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
GPU vs. CPU
Graphics Processing Units (GPUs) vs. CPUs
2500
Memory
ALU
180
ATi/AMD GPU
Nvidia GPU
Intel CPU
160
Memory bandwidth [GB/s]
Theoretical performance [GFLOPs]
3000
2000
1500
1000
ATi/AMD GPU
Nvidia GPU
Intel CPU
140
120
100
80
60
40
500
20
0
2004
2005
2006
2007
year
2008
2009
2010
0
2004
2005
2006
2007
year
2008
2009
⇒ raw performance is huge (memory can be limiting)
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
2010
Motivation
Proposed solution
Results
Conclusion
GPU vs. CPU
Why (not) GPUs
advantages
huge raw FP power GTX 280 - 240 cores = 1TFLOP
good power/price and power/Watt ratio
hardware thread scheduler (little overhead during switching)
fast on-chip memory (user managed L1 cache)
external adapter, scalability (multi-GPU solution)
disadvantages
SIMD hardware – bad branching
limited data types, double is slow
low performance per thread, requires massively-parallel tasks
PCI-Express bus bottleneck
⇒ Challenge
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
GPU vs. CPU
Why (not) GPUs
advantages
huge raw FP power GTX 280 - 240 cores = 1TFLOP
good power/price and power/Watt ratio
hardware thread scheduler (little overhead during switching)
fast on-chip memory (user managed L1 cache)
external adapter, scalability (multi-GPU solution)
disadvantages
SIMD hardware – bad branching
limited data types, double is slow
low performance per thread, requires massively-parallel tasks
PCI-Express bus bottleneck
⇒ Challenge
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
GPU vs. CPU
Why (not) GPUs
advantages
huge raw FP power GTX 280 - 240 cores = 1TFLOP
good power/price and power/Watt ratio
hardware thread scheduler (little overhead during switching)
fast on-chip memory (user managed L1 cache)
external adapter, scalability (multi-GPU solution)
disadvantages
SIMD hardware – bad branching
limited data types, double is slow
low performance per thread, requires massively-parallel tasks
PCI-Express bus bottleneck
⇒ Challenge
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Next coming up...
1
Motivation
Genetic algorithms
Problem
GPU vs. CPU
2
Proposed solution
GPGPU possibilities and disadvantages
CUDA Hardware model
Mapping CUDA hardware model to software model
3
Results
Speedup
Quality
Limitations
4
Conclusion
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
GPGPU possibilities and disadvantages
Compute Unified Device Architecture (CUDA)
nVidia framework for general purpose computation on GPUs
(GPGPU)
works on GeForce 8 (first unified shader generation) and
better under both Windows and Linux
good control over hardware, allows direct utilization of
on-chip shared memory
consists of hardware and software model
best GPGPU results so far
⇒ Selected for our implementation
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
GPGPU possibilities and disadvantages
Compute Unified Device Architecture (CUDA)
nVidia framework for general purpose computation on GPUs
(GPGPU)
works on GeForce 8 (first unified shader generation) and
better under both Windows and Linux
good control over hardware, allows direct utilization of
on-chip shared memory
consists of hardware and software model
best GPGPU results so far
⇒ Selected for our implementation
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
GPGPU possibilities and disadvantages
GPGPU execution datapath
start
finish
application
GTX285
CUDA
VRAM
driver
OS
GPU
159GB/s
data
init, program, data
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
PCI Express 16x
4GB/s
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
GPGPU possibilities and disadvantages
start
finish
application
GTX285
CUDA
VRAM
driver
OS
159GB/s
data
init, program, data
high latency!
GPU
PCI Express 16x
4GB/s
bottleneck!
time spent during execution [%]
GPGPU execution: drawback
100
90
80
70
60
50
40
30
20
10
0 1
0.14
10
0.14
GPU execution
GPU data transfer
rest of program
100 1000 10000 100000
0.16 0.19 0.74 5.99
number of generations
time [s]
⇒ whole GA should be executed on graphic card
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
GPGPU possibilities and disadvantages
start
finish
application
GTX285
CUDA
VRAM
driver
OS
159GB/s
data
init, program, data
high latency!
GPU
PCI Express 16x
4GB/s
bottleneck!
time spent during execution [%]
GPGPU execution: drawback
100
90
80
70
60
50
40
30
20
10
0 1
0.14
10
0.14
GPU execution
GPU data transfer
rest of program
100 1000 10000 100000
0.16 0.19 0.74 5.99
number of generations
time [s]
⇒ whole GA should be executed on graphic card
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
CUDA Hardware model
CUDA Hardware model
CUDA hardware model - graphics card
...
GPU
SIMD multiprocesor N
registers
processor2
GPU
constant cache
texture cache
VRAM
main memory is very slow
(only) processors within
multiprocessors can be
synchronized easily
registers
...
processorM
shared instruction unit
and
hardware scheduler
main memory = local data + textures + constants
input, output
host system
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
GPU is divided into SIMD
multiprocessors
multiprocessors contain
fast shared memory
shared memory
processor1
registers
SIMD multiprocesor 1
SIMD multiprocesor 0
GTX280 GPU - 30 x 8
processors, 1GB main,
16KB shared memory
⇒ effective mapping of
the GA is required
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
CUDA Hardware model
CUDA Hardware model
CUDA hardware model - graphics card
...
GPU
SIMD multiprocesor N
registers
processor2
GPU
constant cache
texture cache
VRAM
main memory is very slow
(only) processors within
multiprocessors can be
synchronized easily
registers
...
processorM
shared instruction unit
and
hardware scheduler
main memory = local data + textures + constants
input, output
host system
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
GPU is divided into SIMD
multiprocessors
multiprocessors contain
fast shared memory
shared memory
processor1
registers
SIMD multiprocesor 1
SIMD multiprocesor 0
GTX280 GPU - 30 x 8
processors, 1GB main,
16KB shared memory
⇒ effective mapping of
the GA is required
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Mapping CUDA hardware model to software model
Mapping CUDA hardware model to software model: step 1
input
GPU
...
CUDA hardware model - graphics card
CUDA software model - genetic algorithm kernel
main graphics memory
SIMD multiprocesor N
SIMD multiprocesor 1
SIMD multiprocesor 0
processor1
blocks?
threads?
...
shared memory
processor2
...
processorM
shared instruction unit
and
hardware scheduler
main graphics memory
input, output
block desynchronisation
output
host system
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
main graphics memory
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Mapping CUDA hardware model to software model
Mapping CUDA hardware model to software model: step 2
input
GPU
...
CUDA hardware model - graphics card
CUDA software model - genetic algorithm kernel
main graphics memory
SIMD multiprocesor N
SIMD multiprocesor 1
SIMD multiprocesor 0
T0
T1
T2
...
TN
T0
T1
T2
...
TN
processor2
...
shared memory
processor1
processorM
processors are
mapped to
individuals
+
multiprocessors
are mapped
to islands
GA
island 0
GA
island 1
...
=
shared instruction unit
and
hardware scheduler
SIMD friendly
massive
paralelism
main graphics memory
...
input, output
block desynchronisation
...
output
host system
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
main graphics memory
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Mapping CUDA hardware model to software model
Mapping CUDA hardware model to software model: step 3
input
GPU
...
CUDA hardware model - graphics card
CUDA software model - genetic algorithm kernel
main graphics memory
SIMD multiprocesor N
SIMD multiprocesor 1
SIMD multiprocesor 0
T0
T1
T2
...
TN
T0
T1
T2
...
TN
...
GA
island 0
shared memory
processor2
shared memory
processor1
processorM
shared instruction unit
and
hardware scheduler
GA
island 1
...
island
population
is stored in
shared memory
=
high performance
main graphics memory
...
input, output
block desynchronisation
...
output
host system
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
main graphics memory
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Mapping CUDA hardware model to software model
Complete GPU GA kernel
island GA with
asynchronous unidirectinal
ring migration
main graphics memory - input population, migrating individuals, parameters
input
block0 = island 0
T0
T1
T2
...
TN
block1=island 1
T0
T1
T2
...
TN
threads = individuals
fitness func. evaluation
no
migration?
whole GA is executed on
the GPU in parallel
barrier within block
yes
fast shared memory is
used to maintain
population
selection
crossover
mutation
no
end?
yes
...
output
shared memory within multiprocessor
migration/evaluation
...
SIMD-friendly execution
data
control
block desynchronisation
...
main memory on graphics card - output population
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
usage of slow main
memory is minimized
compiler constants/macro
parameters
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Next coming up...
1
Motivation
Genetic algorithms
Problem
GPU vs. CPU
2
Proposed solution
GPGPU possibilities and disadvantages
CUDA Hardware model
Mapping CUDA hardware model to software model
3
Results
Speedup
Quality
Limitations
4
Conclusion
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Testing enviroment
CPU
GPU
hardware
software
hardware
software
Core i7 920
single threaded Galib, no elitism, custom Tournament
88000 GTX (128 cores)
GTX 260 (216 cores)
GTX 285 (240 cores)
presented custom GA
GA parameter
number of generations
individuals per island
number of islands
selection
mutation
fitness
genome length
value
1000
varying from 2 to 256
varying from 1 to 1024
Tournament (N=2)
Gauss
Rosenbrock, Michalewicz and Griewank
varying from 2 to 10
⇒ speedup and quality
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Testing enviroment
CPU
GPU
hardware
software
hardware
software
Core i7 920
single threaded Galib, no elitism, custom Tournament
88000 GTX (128 cores)
GTX 260 (216 cores)
GTX 285 (240 cores)
presented custom GA
GA parameter
number of generations
individuals per island
number of islands
selection
mutation
fitness
genome length
value
1000
varying from 2 to 256
varying from 1 to 1024
Tournament (N=2)
Gauss
Rosenbrock, Michalewicz and Griewank
varying from 2 to 10
⇒ speedup and quality
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Testing enviroment
CPU
GPU
hardware
software
hardware
software
Core i7 920
single threaded Galib, no elitism, custom Tournament
88000 GTX (128 cores)
GTX 260 (216 cores)
GTX 285 (240 cores)
presented custom GA
GA parameter
number of generations
individuals per island
number of islands
selection
mutation
fitness
genome length
value
1000
varying from 2 to 256
varying from 1 to 1024
Tournament (N=2)
Gauss
Rosenbrock, Michalewicz and Griewank
varying from 2 to 10
⇒ speedup and quality
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Speedup
Speedup
Performance of the GPU highly depends on population size – from 2 to 256
individuals
Q per island ⇒ the performance unit is population-size independent,
IIGG= (Island population size, number of Islands, Genotype length, number of
Generations) per second
Migration cost for 100% of individuals every generation about 30%
Mean value from 5 runs (max. about 5% difference), chromosome length = 2
arch.
CPU
GPU
fitness function
Rosenbrock
Michalewicz
Griewank
Rosenbrock
Rosenbrock-FastMath
Michalewicz
Michalewicz-FastMath
Griewank
Griewank-FastMath
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
(min – max) IIGG·106 per
2.6 – 2.8
1.8 – 2.5
2.5 – 2.8
8800GTX
GTX260
14.2 – 8877
12.0 – 13094
18.5 – 11914
15.5 – 17318
6.9 – 5893
5.8 – 8850
11.7 – 9894
9.8 – 13692
9.6 – 7108
8.0 – 10515
15.9 – 10507
13.3 – 15360
second
GTX285
14.3 – 18669
18.5 – 24288
7.0 – 12937
11.6 – 19400
9.9 – 14496
15.8 – 20920
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Speedup
Speedup – cont.
Speedup of kernel execution against CPU (Griewank):
max=[1024,128]=7437.170
min=[1,2]=5.617
times faster than CPU
times faster than CPU
max=[256,64]=3735.318
min=[1,2]=5.653
4000
3500
3000
2500
2000
1500
1000
500
0
1
2
4
8
16 32
islands
(blocks)
64 128
256 512
2
1024
4
256
128
64
32
16 individuals on island
8
(threads in block)
(a) 8800 GTX
8000
7000
6000
5000
4000
3000
2000
1000
0
1
2
4
8
16 32
islands
(blocks)
64 128
256 512
2
1024
4
256
128
64
32
16 individuals on island
8
(threads in block)
(b) GTX 285
⇒ 99% GPU utilization for peak performance, up to 8000
times faster execution compared to single threaded Galib
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Speedup
Speedup – cont.
Speedup of kernel execution against CPU (Griewank):
max=[1024,128]=7437.170
min=[1,2]=5.617
times faster than CPU
times faster than CPU
max=[256,64]=3735.318
min=[1,2]=5.653
4000
3500
3000
2500
2000
1500
1000
500
0
1
2
4
8
16 32
islands
(blocks)
64 128
256 512
2
1024
4
256
128
64
32
16 individuals on island
8
(threads in block)
(c) 8800 GTX
8000
7000
6000
5000
4000
3000
2000
1000
0
1
2
4
8
16 32
islands
(blocks)
64 128
256 512
2
1024
4
256
128
64
32
16 individuals on island
8
(threads in block)
(d) GTX 285
⇒ 99% GPU utilization for peak performance, up to 8000
times faster execution compared to single threaded Galib
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Quality
Quality comparsion
same parameters and 32 individuals per island
CPU and GPU1 is single island
GPU2: fully utilized GPU, 1024-islands, 10% individuals are
migrated every 10 generations
Quality comparsion, mean value from 100 runs
genes
2
3
4
5
6
7
8
9
10
CPU
0.086
1.897
8.900
22.112
48.450
83.455
128.710
167.329
233.364
mean best fitness from whole population
Rosenbrock
Michalewicz
GPU1
GPU2
CPU
GPU1
GPU2
CPU
−7
3.468
7.57·10
-1.022
-1.768
-1.801
0.0005
4.996
0.447
-1.220
-2.336
-2.760
0.0051
4.997
0.494
-1.459
-2.748
-3.696
0.0156
17.332
2.042
-1.684
-3.184
-4.628
0.0246
56.045
4.313
-1.817
-3.654
-5.440
0.0408
-2.035
-3.646
-6.163
0.0479
42.509
6.903
155.233
9.257
-2.120
-3.805
-6.659
0.0650
131.737
12.045
-2.176
-4.830
-7.136
0.0749
184.370
15.379
-2.391
-5.009
-7.649
0.0805
Griewank
GPU1
GPU2
0.0020
3.99·10−12
0.0048
1.06·10−8
0.0188
1.22·10−7
0.0414
0.0001
0.0570
0.0005
0.0620
0.0012
0.1360
0.0027
0.1444
0.0042
0.1758
0.0058
GPU1 is 20% better ⇒ GPU is able to optimise simple
numerical functions and do it better in the same time
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Quality
Quality comparsion
same parameters and 32 individuals per island
CPU and GPU1 is single island
GPU2: fully utilized GPU, 1024-islands, 10% individuals are
migrated every 10 generations
Quality comparsion, mean value from 100 runs
genes
2
3
4
5
6
7
8
9
10
CPU
0.086
1.897
8.900
22.112
48.450
83.455
128.710
167.329
233.364
mean best fitness from whole population
Rosenbrock
Michalewicz
GPU1
GPU2
CPU
GPU1
GPU2
CPU
−7
3.468
7.57·10
-1.022
-1.768
-1.801
0.0005
4.996
0.447
-1.220
-2.336
-2.760
0.0051
4.997
0.494
-1.459
-2.748
-3.696
0.0156
17.332
2.042
-1.684
-3.184
-4.628
0.0246
56.045
4.313
-1.817
-3.654
-5.440
0.0408
-2.035
-3.646
-6.163
0.0479
42.509
6.903
155.233
9.257
-2.120
-3.805
-6.659
0.0650
131.737
12.045
-2.176
-4.830
-7.136
0.0749
184.370
15.379
-2.391
-5.009
-7.649
0.0805
Griewank
GPU1
GPU2
0.0020
3.99·10−12
0.0048
1.06·10−8
0.0188
1.22·10−7
0.0414
0.0001
0.0570
0.0005
0.0620
0.0012
0.1360
0.0027
0.1444
0.0042
0.1758
0.0058
GPU1 is 20% better ⇒ GPU is able to optimise simple
numerical functions and do it better in the same time
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Limitations
Limitations
?? too many simulated islands ⇒ GPU can simulate multiple
independent runs at the same time
fitness function must be executed on the GPU
limited shared memory per island (currently 16KB)
New generation has bigger shared memory, L2 cache and
supports C++ ⇒ future is bright for GA on GPUs
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Limitations
Limitations
?? too many simulated islands ⇒ GPU can simulate multiple
independent runs at the same time
fitness function must be executed on the GPU
limited shared memory per island (currently 16KB)
New generation has bigger shared memory, L2 cache and
supports C++ ⇒ future is bright for GA on GPUs
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Next coming up...
1
Motivation
Genetic algorithms
Problem
GPU vs. CPU
2
Proposed solution
GPGPU possibilities and disadvantages
CUDA Hardware model
Mapping CUDA hardware model to software model
3
Results
Speedup
Quality
Limitations
4
Conclusion
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Conclusions
we have used GPU to accelerate genetic algorithms and
published results on EvoStar 2010
GPU has great potential for acceleration of simple
numerical function optimization!
Galib is not very well optimized, hand-tuning on CUDA leads
to massive speedup
up to 400 times better power/watt ratio, consumer level
hardware
Future work
Grammatical evolution
More complex GA
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Conclusions
we have used GPU to accelerate genetic algorithms and
published results on EvoStar 2010
GPU has great potential for acceleration of simple
numerical function optimization!
Galib is not very well optimized, hand-tuning on CUDA leads
to massive speedup
up to 400 times better power/watt ratio, consumer level
hardware
Future work
Grammatical evolution
More complex GA
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Thank you
Thank you for your attention
Feel free to ask questions!
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Motivation
Proposed solution
Results
Conclusion
Thank you
Thank you for your attention
Feel free to ask questions!
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Appendix
Genetic operators: Tournament selection
Appendix A: Genetic operators: Tournament selection
original
population
random
individual
T0
T1
T2
...
T3
TN-2
TN-1
threads
in pairs
random
number
generation
...
shared array 2
weight
...
...
selected
individuals
shared array 1
block = island
shared memory within block
shared
value
crossover
yes
cross?
crossover
crossover
no
original
individuals
data
control
...
uses fast, parallel PRNG
HybridTaus from GPU
Gems 3 book seris
(mutation uses BoxMuller
PRNG)
threads use shared
memory to work in pairs
individual
offsprings
...
offspring
population
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
Appendix
Genetic operators: Migration
Appendix B: Genetic operators: Migration
main graphics memory
population of island 0
I0
I1
population of island 1
I M-1
IN
...
I0
I1
I M-1
IN
...
...
...
M
migrated
individuals
...
...
...
T0
T1 ... TM-1 ...
TN-M ...
TN
...
...
...
threads = individuals
...
M best
...
...
M worst
shared memory within block
...
individual consisting of genes
write
read
block0 = island 0
block1=island 1
island X
unidirectinal ring
asynchronous, but works well
uses main memory for chromosomes and Bitonic-Merge sort
algorithm implementation (GPU Gems book)
Petr Pospı́chal — {ipospichal,jarosjir,schwarz}@fit.vutbr.cz
Parallel Genetic Algorithm on the CUDA Architecture
© Copyright 2026 Paperzz