Challenges in solving Large Scale Optimization Problems

Challenges in solving Large Scale
Optimization Problems
HEMERA — Challenge COPs
!
Leaders: B. Derbel and N. Melab
Participants: T-T VU (PhD Hemera), A. Ali (PostDoc
Hemera), M. Djamai (PhD Univ. Lille 1)
DOLPHIN — INRIA Lille Nord Europe
!
2
General Context
Combinatorial Optimization
Computing Resources
▪ Applications: Logistics,
Supply Chain,
Telecommunications,
Clouds, Green-IT, etc.
▪ NP-hard problems
▪ Resolution methods are
computing intensive
▪ Aggregated resources:
clusters, grids, clouds, etc.
▪ New architectures: multicores, many-cores, etc.
▪ Impressive computing
capability (in theory)
!
General objective
▪ Solve Combinatorial Optimization Problems
efficiently on large scale computing resources
3
Branch-and-Bound
{1,2,3}
4
9
2 {1,3}
1 {2,3}
8
1 2 3
6
9
1 3 2
3 {1,2}
7
3 1 2
10
3 2 1
▪ Branching: divide a problem to several sub-problems
!▪ Bounding: calculate the estimated optimal solution
▪ lower/upper bound
tree exploration strategy (DFS, BFS, etc)
▪ Select:
!
!▪ Pruning: eliminate unpromising branches
!
Large Scale Heterogeneous Systems
Multi-cores
GPUs
Cluster(s)
4
Grid + P2P
▪ Large scale systems
▪ Heterogeneity
▪ Node-level: compute power, programming paradigm, etc
▪ Network-level: latency, bandwidth, etc
Contributions
!
!
▪ Efficient parallel B&B load balancing
!
!
▪ Efficient parallel B&B on node-heterogeneous systems
!
!
▪ Efficient parallel B&B on link-heterogeneous systems
5
Contributions
!
!
▪ Efficient parallel B&B load balancing
!
!
▪ Efficient parallel B&B on node-heterogeneous systems
!
!
▪ Efficient parallel B&B on link-heterogeneous systems
5
Irregularity of B&B
➔
!
➔
Workload of processing unit varies dynamically
Work stealing is a reference approach
6
Tree-based B&B Work Stealing
7
▪ Tree-based stealing strategy: 2 steals in parallel
▪ Synchronous steals to children or parent
▪ Asynchronous steals to remote neighbors
▪ Attempt to cluster idle peers
▪ Amount of work is adjusted distributively based on
subtree sizes
!
Syn. Stealing
Asyn. Stealing
Experimental Evaluation
8
▪ Application Settings
▪ Taillard’s Flowshop Instances (Ta20*20)
▪ Permutation FSP: 20 jobs on 20 machines
▪ Generic UTS benchmark
▪ Baseline Algorithms
▪ H-MW: Hierarchical Adaptive MW (B&B specific)
[Bendjoudi et al., FGCS’12, IEEE TC’13]
▪ MW: Master-Worker (B&B specific) [Mezmaz et al.,
IDPDS’07]
▪ RWS: (distributed) Random Work Stealing [Dinan et al.,
SC’09]
!
!
Our Approach vs RWS vs MW
9
Large Scale (1000 peers)
▪ MW suffers from the bottleneck when scaling the system
▪ RWS suffers fine-grain parallelism in large scales
Contributions
!
!
▪ Efficient parallel B&B load balancing
!
!
▪ Efficient parallel B&B on node-heterogeneous systems
!
!
▪ Efficient parallel B&B on link-heterogeneous systems
10
Towards heterogeneous B&B
11
CPU
CPU
CPU
CPU
Distributed Networks
CPU
GPU
GPU
CPU
GPU
GPU
CPU
CPU
Towards heterogeneous B&B
12
▪ How to profit from current node-heterogeneous
computing platforms in B&B computations?
!
▪ Three main challenges:
▪ How to map B&B and hardware parallelism?
!
▪ How to deal with B&B workload irregularity?
!
▪ How to deal with huge differences in compute power?
Heterogeneous parallel B&B
▪ The 2MBB approach
▪ Multi-CPU Multi-GPU B&B
!
▪ The 3MBB approach
▪ Multi-Core Multi-CPU Multi-GPU B&B
▪ host-device parallelism in a single CPU-GPU
▪ Adaptive workload transfer
▪ Hybrid stealing in multi-core systems
▪ Lock-free work queues
13
2MBB: Experimental Results
▪ CPUs: fixed to 128
▪ 64 CPUs of 2.27 GHz
▪ 64 CPUs of 2.5 GHz
▪ GPUs: scale up to 20
▪ ½ GPUs at full capacity
▪ ¼ GPUs at half capacity
▪ ¼ GPUs at quarter
capacity
14
15
2MBB: Experimental Results
6
6
Steal 1/2
Steal 1/2
Adaptive
5
Adaptive
5
Linear GPU normalized Speedup
Linear GPU normalized Speedup
4
Speedup
Speedup
4
3
3
2
2
1
1
0
1
2
4
8
16
32
64
128
256
0
1
2
4
#CPUs
1 GPU
8
16
32
64
#CPUs
2 GPU
▪ Moderate scales:
▪ We scale close to the linear speedup
▪ Baseline suffers from node-heterogeneity
▪ Largest scales: we are still far from the linear speedup
128
256
16
3MBB: Experimental Results
9
9
2MBB
8
0 GPU
7
3MBB
7
Linear GPU normalized Speedup
6
5
4
5
4
3
3
2
2
1
1
0
1
2
4
8
16
32
64
128
256
0
512
1
2
4
8
# CPUs
32
64
128
256
512
12
11
2MBB
11
2MBB
10
3MBB
10
3MBB
9
Linear GPU normalized Speedup
9
Linear GPU normalized Speedup
8
8
7
Speedup
Speedup
16
#CPUs
12
2 GPUs
1 GPU
Linear GPU normalized Speedup
6
Speedup
Speedup
2MBB
8
3MBB
6
5
7
6
4
3
3
2
2
1
1
0
1
2
4
8
16
32
#CPUs
64
128
256
512
4 GPUs
5
4
0
1
2
4
8
16
32
#CPUs
64
128
256
512
Contributions
!
!
▪ Efficient parallel B&B load balancing
!
!
▪ Efficient parallel B&B on node-heterogeneous systems
!
!
▪ Efficient parallel B&B on link-heterogeneous systems
17
Heterogeneous links
!
!
!
!
!
!
!
!
!
!
▪ Steal requests through WAN links are expensive
18
19
State-of-the-art approaches
Algorithms
[J. SC] ACRS
[PPoPP] CRS
[Europar] PWS
[J. ACM] RWS
Platform
dependent
Platforms
Generic
Link Heterogeneous Work Stealing
20
!
▪ Local Steals: based on a preference neighbors and a nonuniform adaptive probability
!
▪ Learn local neighbors and remote neighbors at runtime
▪ K-Means clustering to return 2 sets of neighbors
!
▪ Remote steals: controlled by a timing window
▪ If the window expires and no work found remote steal
is enabled
▪ Window size controlled adaptively (additive Increase
Multiplicative Decrease)
Performance Assessment
▪ Experimentation methodology: Emulation
▪ Deploy Distem on top of Grid’5000
▪ Network configuration is artificially modified by
Distem
!
▪ Broad range of network configurations
▪ Flat: n-level communication hierarchy
▪ Latency between peers
▪ Grid: two-level communication hierarchy
▪ Latency between clusters
▪ Number of clusters
21
Flat Configuration
▪ LWS improves up to 40%
22
Grid Configuration
▪ PWS and ACRS performance depends on application
▪ LWS is platform and application-independent
23
Conclusion
24
▪ Design and experimental evaluation of new parallel
B&B algorithms for large scale heterogeneous
environments
!
▪ In the future
▪ Investigate more complex compute systems
!
▪ Investigate more complex optimization problems and
other algorithmic paradigms
!
Thank You !
Questions?
Combinatorial Optimization
Large scale instances
Work distribution
Overlay Mapping
Computational resources
Clusters, Grids, Clouds, Virtualized environements
Load Balancing
Heterogenity
Faults

Download Report

Challenges in solving Large Scale Optimization Problems

Paperzz.com

Your Paperzz