Challenges in solving Large Scale
Optimization Problems
HEMERA — Challenge COPs
!
Leaders: B. Derbel and N. Melab
Participants: T-T VU (PhD Hemera), A. Ali (PostDoc
Hemera), M. Djamai (PhD Univ. Lille 1)
DOLPHIN — INRIA Lille Nord Europe
!
2
General Context
Combinatorial Optimization
Computing Resources
▪ Applications: Logistics,
Supply Chain,
Telecommunications,
Clouds, Green-IT, etc.
▪ NP-hard problems
▪ Resolution methods are
computing intensive
▪ Aggregated resources:
clusters, grids, clouds, etc.
▪ New architectures: multicores, many-cores, etc.
▪ Impressive computing
capability (in theory)
!
General objective
▪ Solve Combinatorial Optimization Problems
efficiently on large scale computing resources
3
Branch-and-Bound
{1,2,3}
4
9
2 {1,3}
1 {2,3}
8
1 2 3
6
9
1 3 2
3 {1,2}
7
3 1 2
10
3 2 1
▪ Branching: divide a problem to several sub-problems
!▪ Bounding: calculate the estimated optimal solution
▪ lower/upper bound
tree exploration strategy (DFS, BFS, etc)
▪ Select:
!
!▪ Pruning: eliminate unpromising branches
!
Large Scale Heterogeneous Systems
Multi-cores
GPUs
Cluster(s)
4
Grid + P2P
▪ Large scale systems
▪ Heterogeneity
▪ Node-level: compute power, programming paradigm, etc
▪ Network-level: latency, bandwidth, etc
Contributions
!
!
▪ Efficient parallel B&B load balancing
!
!
▪ Efficient parallel B&B on node-heterogeneous systems
!
!
▪ Efficient parallel B&B on link-heterogeneous systems
5
Contributions
!
!
▪ Efficient parallel B&B load balancing
!
!
▪ Efficient parallel B&B on node-heterogeneous systems
!
!
▪ Efficient parallel B&B on link-heterogeneous systems
5
Irregularity of B&B
➔
!
➔
Workload of processing unit varies dynamically
Work stealing is a reference approach
6
Tree-based B&B Work Stealing
7
▪ Tree-based stealing strategy: 2 steals in parallel
▪ Synchronous steals to children or parent
▪ Asynchronous steals to remote neighbors
▪ Attempt to cluster idle peers
▪ Amount of work is adjusted distributively based on
subtree sizes
!
Syn. Stealing
Asyn. Stealing
Experimental Evaluation
8
▪ Application Settings
▪ Taillard’s Flowshop Instances (Ta20*20)
▪ Permutation FSP: 20 jobs on 20 machines
▪ Generic UTS benchmark
▪ Baseline Algorithms
▪ H-MW: Hierarchical Adaptive MW (B&B specific)
[Bendjoudi et al., FGCS’12, IEEE TC’13]
▪ MW: Master-Worker (B&B specific) [Mezmaz et al.,
IDPDS’07]
▪ RWS: (distributed) Random Work Stealing [Dinan et al.,
SC’09]
!
!
Our Approach vs RWS vs MW
9
Large Scale (1000 peers)
▪ MW suffers from the bottleneck when scaling the system
▪ RWS suffers fine-grain parallelism in large scales
Contributions
!
!
▪ Efficient parallel B&B load balancing
!
!
▪ Efficient parallel B&B on node-heterogeneous systems
!
!
▪ Efficient parallel B&B on link-heterogeneous systems
10
Towards heterogeneous B&B
11
CPU
CPU
CPU
CPU
Distributed Networks
CPU
GPU
GPU
CPU
GPU
GPU
CPU
CPU
Towards heterogeneous B&B
12
▪ How to profit from current node-heterogeneous
computing platforms in B&B computations?
!
▪ Three main challenges:
▪ How to map B&B and hardware parallelism?
!
▪ How to deal with B&B workload irregularity?
!
▪ How to deal with huge differences in compute power?
Heterogeneous parallel B&B
▪ The 2MBB approach
▪ Multi-CPU Multi-GPU B&B
!
▪ The 3MBB approach
▪ Multi-Core Multi-CPU Multi-GPU B&B
▪ host-device parallelism in a single CPU-GPU
▪ Adaptive workload transfer
▪ Hybrid stealing in multi-core systems
▪ Lock-free work queues
13
2MBB: Experimental Results
▪ CPUs: fixed to 128
▪ 64 CPUs of 2.27 GHz
▪ 64 CPUs of 2.5 GHz
▪ GPUs: scale up to 20
▪ ½ GPUs at full capacity
▪ ¼ GPUs at half capacity
▪ ¼ GPUs at quarter
capacity
14
15
2MBB: Experimental Results
6
6
Steal 1/2
Steal 1/2
Adaptive
5
Adaptive
5
Linear GPU normalized Speedup
Linear GPU normalized Speedup
4
Speedup
Speedup
4
3
3
2
2
1
1
0
1
2
4
8
16
32
64
128
256
0
1
2
4
#CPUs
1 GPU
8
16
32
64
#CPUs
2 GPU
▪ Moderate scales:
▪ We scale close to the linear speedup
▪ Baseline suffers from node-heterogeneity
▪ Largest scales: we are still far from the linear speedup
128
256
16
3MBB: Experimental Results
9
9
2MBB
8
0 GPU
7
3MBB
7
Linear GPU normalized Speedup
6
5
4
5
4
3
3
2
2
1
1
0
1
2
4
8
16
32
64
128
256
0
512
1
2
4
8
# CPUs
32
64
128
256
512
12
11
2MBB
11
2MBB
10
3MBB
10
3MBB
9
Linear GPU normalized Speedup
9
Linear GPU normalized Speedup
8
8
7
Speedup
Speedup
16
#CPUs
12
2 GPUs
1 GPU
Linear GPU normalized Speedup
6
Speedup
Speedup
2MBB
8
3MBB
6
5
7
6
4
3
3
2
2
1
1
0
1
2
4
8
16
32
#CPUs
64
128
256
512
4 GPUs
5
4
0
1
2
4
8
16
32
#CPUs
64
128
256
512
Contributions
!
!
▪ Efficient parallel B&B load balancing
!
!
▪ Efficient parallel B&B on node-heterogeneous systems
!
!
▪ Efficient parallel B&B on link-heterogeneous systems
17
Heterogeneous links
!
!
!
!
!
!
!
!
!
!
▪ Steal requests through WAN links are expensive
18
19
State-of-the-art approaches
Algorithms
[J. SC] ACRS
[PPoPP] CRS
[Europar] PWS
[J. ACM] RWS
Platform
dependent
Platforms
Generic
Link Heterogeneous Work Stealing
20
!
▪ Local Steals: based on a preference neighbors and a nonuniform adaptive probability
!
▪ Learn local neighbors and remote neighbors at runtime
▪ K-Means clustering to return 2 sets of neighbors
!
▪ Remote steals: controlled by a timing window
▪ If the window expires and no work found remote steal
is enabled
▪ Window size controlled adaptively (additive Increase
Multiplicative Decrease)
Performance Assessment
▪ Experimentation methodology: Emulation
▪ Deploy Distem on top of Grid’5000
▪ Network configuration is artificially modified by
Distem
!
▪ Broad range of network configurations
▪ Flat: n-level communication hierarchy
▪ Latency between peers
▪ Grid: two-level communication hierarchy
▪ Latency between clusters
▪ Number of clusters
21
Flat Configuration
▪ LWS improves up to 40%
22
Grid Configuration
▪ PWS and ACRS performance depends on application
▪ LWS is platform and application-independent
23
Conclusion
24
▪ Design and experimental evaluation of new parallel
B&B algorithms for large scale heterogeneous
environments
!
▪ In the future
▪ Investigate more complex compute systems
!
▪ Investigate more complex optimization problems and
other algorithmic paradigms
!
Thank You !
Questions?
Combinatorial Optimization
Large scale instances
Work distribution
Overlay Mapping
Computational resources
Clusters, Grids, Clouds, Virtualized environements
Load Balancing
Heterogenity
Faults
© Copyright 2026 Paperzz