Synchronization and Load Distribution Strategies for

MACRo 2015- 5th International Conference on Recent Achievements in
Mechatronics, Automation, Computer Science and Robotics
Synchronization and Load Distribution Strategies for
Parallel Implementations of P-graph Optimizer
Anikó BARTOS1, Botond BERTÓK2
1 Department of Computer Science and Systems Technology, Faculty of Information
Technology, University of Pannonia, Veszprem, Hungary, e-mail: [email protected]
2
Department of Computer Science and Systems Technology, Faculty of Information
Technology, University of Pannonia, Veszprem, Hungary, e-mail: [email protected]
Manuscript received January 12, 2015, revised February 9, 2015.
Abstract: Process Network Synthesis aims at determining the optimal or n-best
process structures of a production system as well as the optimal volumes of the
constituting operating units. In the P-graph framework algorithm ABB provides the nbest structurally different process networks. It is widely applied for optimal design of
manufacturing systems as well as business processes and supply chains. Thus its
effective implementation is essential in practice. The present work introduces a novel
cooperative parallel implementation for algorithm ABB and it is compared to former
load distribution strategies.
Keywords: P-graph, search strategy, parallel computing, optimization
1. Process Network Synthesis
Process Network Synthesis (PNS) was introduced formally by Friedler and
Fan in 1992 [1]. The problem can be formulated as MILP, however
incorporating application field specific logical implications in the decision
procedures may lead to faster optimization software than a general MILP solver
and results in more practical solutions. The logical implications rely on an
unambiguous graphical representation of the process structures by P-graphs [2].
2. Search strategy
Numerous search strategies exist to find the optimal solution for a PNS
problem not necessarily enumerating each of the structures, which can be
constructed from the building blocks called operating units in the problem
definition [2-4]. The accelerated branch-and-bound or ABB algorithm reduces
303
10.1515/macro-2015-0030
Unauthenticated
Download Date | 7/13/17 1:24 AM
304
A. Bartos, B. Bertók
both the number and complexity of subproblems to be visited during the search
by logical implications due to mathematical foundations of the P-garph
framework [2-4]. These foundations include an unambiguous graph
representation of the structural properties of process networks by P-graphs, and
expression of the necessary conditions for process networks to be structurally or
combinatorially feasible by axioms.
The algorithm ABB follows the branch and bound technique with disjoint
branches. Contrary to the general purpose solvers, algorithm ABB provides the
n-best suboptimal structures or flowsheets in addition to the optimal one, where
n is given by the user before executing the algorithm. A structure is considered
to be suboptimal if it does not include a better substructure [5]. Algorithm ABB
constructs a process structure in retrosynthetic direction, i.e., backward from the
products till the raw materials. Decisions are made on the sets of operating units
producing a product or an intermediate material which is consumed. After each
decision logical implications are applied to reduce the number and size of
subproblems. Algorithm ABB visits each structurally feasible structure as worst
case.
Figure 1: Search tree.
The algorithm InsideOut [4] differs from the original ABB in a sense, that
decisions are made only on those operating units appearing in the continuous
relaxation of the actual problem with nonzero volume. As the worst case,
algorithm InsideOut visits those structures only, which are feasible according to
the relaxed continuous model of the problem. It is called InsideOut since in the
original ABB, the combinatorial part controls the search and analysis of the
relaxed model has less priority, while in InsideOut the search is controlled
according to the relaxed model and then logical implications are executed by
combinatorial analysis. The algorithm InsideOut examines a subproblem in
Unauthenticated
Download Date | 7/13/17 1:24 AM
Synchronization and Load Distribution Strategies for Parallel Implementations of P-graph Optimizer
305
each of its iterations until the container of the open subproblems becomes
empty. If the subproblem of interest is not a solution then branches it, i.e.,
generate two subproblems by inclusion or exclusion of a selected operation unit.
Both recently generated subproblem are examined if they are feasible, and
included in the storage of the subproblems if yes. If the subproblem is a
solution, than it is analyzed whether it is better or worse than the previously
stored solutions. Both the container of the open subproblems and the container
of solutions are revisited according to the new solution. Fig.1 .shows a binary
tree illustrating the steps of the search. Each of its vertices represents a
subproblem. The search goes top-down. The width and depth of the search tree
depends on the problem to be solved and the search strategy followed.
The search time is expected to be reduced by parallel implementation of the
algorithm. The forthcoming sessions parallel realization of algorithm InsideOut,
and the best parameter settings for efficient execution.
3. Cooperative Parallel Implementation of Algorithm InsideOut
A parallel version of algorithm ABB (ABBP) for multi-thread execution was
introduced by Varga et al., in 1995, following the master-slave synchronization
scheme [4]. The topology of the parallel computing processing elements at that
time limited the realization of the information flow. In this work the slaves form
a ring topology, as it is shown in Fig. 2.
Figure 2: Ring topology for master-slave implementation of algorithm ABB.
In each block the transfer and the multiplexer receive and send the messages.
It is easy to see that sending a message from the master to the last slave is slow,
because the information has to go through all slave to arrive to the last one.
Unauthenticated
Download Date | 7/13/17 1:24 AM
306
A. Bartos, B. Bertók
Similarly, when one of the middle slaves sends the data back to the master, it
has to go through a long path, because none of the slaves in the middle is
connect to the master directly. Furthermore, the master’s only work is to control
the slaves’ work.
During the last decade more and more cores appeared in the processors with
equal access to the same memory. This has enabled the creation of a parallel
algorithm, with flexible topology and load distribution.
The two major problem of parallel implementation is how to balance the
loads of the cores and at the same time minimize the need for communication
between them. With frequent communication the communication time can slow
down the algorithm, but if the communication is rare, unnecessary calculations
are made due to the lack of information. This is caused by the followings. A so
called bounding procedure helps eliminating useless subproblems, i.e., it can be
proven that subproblems with worse estimated objective value than the bound
never yield to better solutions than the ones already available. However, in
parallel execution, while a thread finds a solution resulting in the update of the
bound, the other threads still analyze their own subproblems, and can only
eliminate some of them, when they receives information about the updated
bound.
Figure 3: Architecture of the shared memory implementation for parallel
implementation of algorithm InsideOut.
The parallelization requires some modification in the simple InsideOut
algorithm. Instead of a single subproblem container, each thread has its own
(Fig. 3.). If such a local subproblem container becomes empty the thread sends
request to the others. There is a common storage, let us call postbox, where the
Unauthenticated
Download Date | 7/13/17 1:24 AM
Synchronization and Load Distribution Strategies for Parallel Implementations of P-graph Optimizer
307
threads put subproblems addressed to the others. When a thread accepts a
request, shares a subproblem, i.e., deletes the subproblem from its own
subproblem container and puts it into the postbox. When a thread receives a
subproblem, moves it from the post box to its own container. It is important that
only one thread can modify the postbox at a time. The solution-container and
the bounds are shared as well, i.e., their access is also limited to a single thread
at a time. The threads share a subproblem only if the number of subproblems in
the local container exceeds a predefined minimum. Fig.4. depicts the state
diagram illustrating the behavior of each thread during the cooperative search.
The search ends when the number of requests is equal to the number of threads,
i.e., every local subproblem container becomes empty.
Figure 4: Control logic of the cooperative parallel implementation of algorithm
InsideOut
Fig. 5.and Fig. 6. shows computation times of the parallel InsideOut
algorithm for test problems. Fig. 5.illustrates the computation time required to
determine a single optimal solution executing the algorithm non-parallel, and up
to four threads. In Fig. 6. the test problems and algorithms are the same, but
generation of the 10 best optimal and suboptimal solution is required. The
results show that the parallelization increase the running time (from non-parallel
to one-threads running), but executing the algorithm in more than one threads
the algorithm accelerates. The acceleration is faster when the algorithm solves
Unauthenticated
Download Date | 7/13/17 1:24 AM
308
A. Bartos, B. Bertók
more complex problems, e.g., problem_2 and problem_3 are more complex
than the other two.
Figure 5:Decrease of the computation time on multiple threads generating the
optimal solution.
Figure 6: Decrease of the computation time on multiple threads generating the 10
bestsolutions.
Unauthenticated
Download Date | 7/13/17 1:24 AM
Synchronization and Load Distribution Strategies for Parallel Implementations of P-graph Optimizer
309
Search tree of the parallel algorithm executed on four threads is illustrated
on Fig. 7. Different colors are assigned to different threads. In this execution a
thread shared a subproblem, only if it was requested and there were more than
two subproblems in the local container. It is easily noticeable that the threads
are equally loaded.
Figure 7: Multithread searching tree.
4. Parameter-settings
More acceleration can be achieved, when the parameter settings are optimal.
The first parameter is the minimum_remaining_subproblem, which can set how
many subproblems required to remain in the local storage before sending a
subproblem for another thread.
Unauthenticated
Download Date | 7/13/17 1:24 AM
310
A. Bartos, B. Bertók
Figure 8: Computation time with different minimum_remaining_subproblem value.
Fig. 8. shows the results expected: the faster the subproblem is shared the
more the overall computational time is reduced. If that value is higher, it cause
waiting, and the algorithm will be slower and slower. In the Figure 8 ‘1’ means
that more than one subproblem has to remain in the thread’s own stack, i.e., if
there are two subproblems, it will share one of them, when one of the other
threads is waiting for.
The other question of the subproblem-sharing strategy is: what to share?
Two options are Local and GlobalNext subproblems. The differences between
these two strategies are represented in the search tree in figures 9 and 10.
Figure 9: With GlobalNext sharing strategy the algorithm shares subproblem from
higher level
Unauthenticated
Download Date | 7/13/17 1:24 AM
Synchronization and Load Distribution Strategies for Parallel Implementations of P-graph Optimizer
311
Figure 10: With LocalNext sharing strategy the algorithm shares subproblem from
lower level
Each thread performs a depth-first search. The green part is already
discovered by a thread and the purple denotes the open subproblems available
in the thread’s own storage. With the GlobalNext sharing strategy (Fig. 9.), the
thread shares the subproblem available at the higher level of the search tree.
Using LocalNext sharing strategy the thread shares the subproblem from the
lower discovered level of the search tree (Fig. 10.).
Figure 11: Computation time with LocalNext and GlobalNext sharing strategy
The results show (Fig. 11.) that if a smaller number of solutions is required,
the LocalNext sharing strategy is more effective, however if numerous solution
are asked, the algorithm better to use GlobalNext sharing strategy to decrease
the running time. What counts small number, depends on the size of the
Unauthenticated
Download Date | 7/13/17 1:24 AM
312
A. Bartos, B. Bertók
problem. It is explained by the fact that the LocalNext strategy accelerates the
depth search, and the GlobalNext strategy facilitates the width search.
140,00
120,00
time (sec)
100,00
80,00
60,00
40,00
20,00
0,00
worse
better
average
Parameter settings
Figure 12: Computation time when the parameter settings are less properly, optimal
and average
Fig. 12 illustrates, that if the parameter settings are less properly set, the
running time may increase even for double. The difference between the worse
and best results demonstrate how important is to choose the parameters
properly.
The results on the Table 1 show how much acceleration can be achieved on
solving different problems for different required number of best structures with
the parallelization when the parameter settings are optimal. It can be seen that
the acceleration is higher for larger problems (60-70%), because for smaller
ones the original algorithm is also very fast.
Table 1: Computation time and acceleration on different problems and different
required number of best structures
Problemsname
solutionnumber OriginalTime (s) 4 threadsTime (s) Acceleration
1
0,0
0,0
0%
Denmark 3.in
10
0,1
0,1
0%
100
0,3
0,2
33%
1
98,3
41,8
57%
route_vp_307_2auto.in
10
169,2
101,0
40%
100
884,2
418,3
52%
1
0,0
0,0
0%
SNS2_v4.in
10
8,2
1,9
77%
100
133,3
48,6
64%
1
1,5
0,6
60%
Example72.in
10
7,5
4,2
44%
100
39,0
28,9
26%
Average:
38%
Unauthenticated
Download Date | 7/13/17 1:24 AM
Synchronization and Load Distribution Strategies for Parallel Implementations of P-graph Optimizer
313
5. Conclusion
A cooperative shared memory parallel implementation of the InsideOut
algorithm for process-network synthesis has been presented herein. Initial tests
show, that the loads of the threads can be balanced with relatively low
frequency of communication, and finally executing the algorithm on more
processor cores is faster than on a single one. The results also presents, that the
more complex a problem is, the more its solution can be accelerated in parallel
execution. The algorithm proposed has numerous parameters to be fine tuned
for different fields of applications, e.g., the minimal number of the subproblems
in the local subproblem containers before sharing tasks with other threads or the
subproblem sharing strategy.
Acknowledgements
Publication of this paper has been supported by the European Union and
Hungary and co-financed by the European Social Fund through the project
TÁMOP- 4.2.2.C-11/1/KONV-2012-0004 - National Research Center for
Development and Market Introduction of Advanced Information and
Communication Technologies.
References
[1]
[2]
[3]
[4]
[5]
Friedler F., Tarjan K., Huang Y.W., Fan L.T., Combinatorial Algorithms for Process
Synthesis, Computers Chem. Engng.16, S313-320 (1992).
Friedler F., Varga J.B., Feher E., Fan L.T., Combinatorially Accelerated Branch-and-Bound
Method for Solving the MIP Model of Process Network Synthesis, Nonconvex
Optimization and Its Applications, State of the Art in Global Optimization, Computational
Meth.and App, 609-626 (1996).
Varga, J. B., F. Friedler, and L. T. Fan, Parallelization of the Accelerated Branch-andBound Algorithm of Process Synthesis: Application in Total Flowsheet Synthesis,
ActaChimicaSlovenica, 42, 15-20 (1995).
Illés T. and Nagy Á., Sufficient optimality criteria for linearly constrained, separable
concave minimization problems, Journal of Optimization Theory and Application, 125(3),
559-575 (2005).
Bertok, B., M. Barany, and F. Friedler, Generating and Analyzing Mathematical
Programming Models of Conceptual Process Design by P-graph Software, Industrial &
Engineering Chemistry Research, 52(1), 166-171 (2013).
Unauthenticated
Download Date | 7/13/17 1:24 AM