4092149.pdf

Parallelization of DC Analysis through Multiport Decomposition ∗
Gaurav Trivedi, Madhav P. Desai, H. Narayanan
{trivedi,madhav,hn}@ee.iitb.ac.in
Department of Electrical Engineering,
Indian Institute of Technology, Bombay,
Mumbai, 400 076,
India
Abstract
Physical problems offer scope for macro level parallelization of solution by their essential structure. For parallelization of electrical network simulation, the most natural
structure based method is that of Multiport Decomposition.
In this paper this method is used for the simulation of electrical networks consisting of resistances, voltage and current sources using a distributed cluster of weakly coupled
processors. At the two levels in which equations are solved
in this method we have used sparse LU for both levels in
the first scheme and sparse LU in the inner level and Conjugate Gradient in the outer level in the second scheme. Results are presented for planar networks, for the cases where
the number of slave processors are 1 and 2, and for circuit
sizes upto 8.2 million nodes and 16.4 million edges using 8
slave processors. We use a cluster of Pentium IV processors
linked through a 10/100MBPS Ethernet switch.
the min cost flow problem through electrical engineering
methods. This problem is extremely important from the
point of view of applications and also has been well studied through the algorithmic methods of computer science
[1]. Basically, the solution of the min cost flow problem is equivalent to the solution of a network made up
of ideal diodes, voltage and current sources (DVJ network). To solve a DVJ network approximately, we replace the ideal diode by practical diodes (with characteristic
v
i = Is (e VT − 1)). To solve the latter network through the
Newton-Raphson procedure, at the innermost loop we need
to solve resistor, voltage source and current source (RVJ)
circuits. In this paper we study the solution of large scale
RVJ circuits through parallelization using the multiport decomposition method.
+
D1
+
1
D2
Introduction
The usual means for overcoming limitations of time and
space in the computational solution of large scale problems
is to adopt a parallelization strategy. Parallelization at the
micro level is being studied currently with great intensity.
However effective utilization of this technique needs expensive infrastructure. A very good compromise is macro
(high) level parallelization. Present day work environment
invariably consists of networked computers. High level parallelization is suited for such a situation since this strategy
allows us to assign subtasks to different processors which
communicate infrequently and in bursts. In this paper we
examine the high level parallelization of DC analysis.
Our primary motivation is the approximate solution of
∗ This project is supported by WEBOPT project sponsored by European
Union’s Asia IT & C under grant ASI/B7–301/97/0126–73.
J
−
−
+
−
ε
Figure 1. Electrical equivalent of a branch in
flow graph
When one converts the min cost flow problem to a DVJ
circuit a typical flow edge converts to a composite electrical branch made up of two diodes, one voltage source and
one current source (Figure 1). We therefore are interested
in the solution of RVJ circuits where the number of voltage
sources is 25% of the total number of devices. If we use the
modified nodal analysis for such a circuit we would have a
coefficient matrix of size (n + E − 1) × (n + E − 1), where
n ≡ number of nodes and E ≡ number of voltage sources.
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 © 2007
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 06:55 from IEEE Xplore. Restrictions apply.
Also the matrix will have E diagonal zeros. The presence
of such a large number of diagonal zeros rules out iterative methods which depend on positive definiteness. On
the other hand if we adopt the elementary method called
the 2 − Graph technique [2] we get a coefficient matrix
of size (n − E − 1) × (n − E − 1) in the equations to be
solved. This technique breaks the RVJ circuit solution into
two steps. In Step (a) we solve an RJ circuit whose graph
G1 is obtained from the original graph G by short circuiting the voltage sources and where the current sources are
obtained by appropriately transforming the original current
and voltage sources. This yields the currents and voltages of
the resistors. The second step (Step (b))builds a network by
short circuiting certain edges of the original network. It then
uses the current in the resistors in the RJ circuit of Step (a)
to compute the current of the voltage sources through KCL.
Step (b) is graph theoretic, linear time and requires very little storage. For example, our DC analyzer takes only 1.17
seconds in finding the two graphs associated with an RVJ
electrical network, performing KCL verification and computing currents in all the edges, but it takes 16.69 seconds
in order to solve the RJ electrical network generated during
Step (a) of the 2 − Graph technique. Step (a) is the major
computational effort requiring substantial storage and this
is the step we parallelize. So in this paper we only consider
the parallelization of an RJ circuit even though our primary
interest is in the solution of large RVJ circuits.
In Section 3, a brief introduction of Multiport Decomposition is given. Section 4 provides the algorithmic procedure
for the parallelization of DC analyzer. In Section 5 we have
discussed experimental results. Section 6 is on Conclusions.
2
1. Step 1: Decompose the network into multiports and a
port connection diagram whose edges are the ports of
the multiports.
2. Step 2: Find the port behaviour at the multiports.
3. Step 3: Use the port behaviour of Step 2 as device characteristic for the port connection diagram network and
solve it. This amounts to matching port conditions of
the multiports. We thus know the port voltages and
currents of the multiports at the end of the step.
4. Step 4: Impose the port edge voltage/current of Step 3
onto the multiports to obtain all voltages and currents
of the multiport.
When the circuit is nonlinear, such as the one that arises in
the case of the min cost flow problem, at each NR iteration
we encounter an RVJ circuit whose topology remains the
same. The multiport decomposition is done only once but
the values of resistors, current sources and voltage sources
will change with the iteration. Since the topology does not
change, the zero-nonzero structure of the matrices does not
change with the iteration. So the ordering in the sparse LU
routines does not change and therefore has to be done only
once.
w
Conjugate Gradient
The Conjugate Gradient (CG) method is an effective
and well known method for linear equations of the form
Ax = b, where A is symmetric, positive definite [4]. We
omit a description of this method. But we remark that, computationally speaking, in this method, the step that requires
most effort in each iteration of the algorithm is that of finding, given a vector x, the vector Ax. In our case, in the
scheme we have called the LU-CG method, Ax is computed
by implicit means without explicitly computing and storing
the dense matrix A.
3
natural of these is the method of multiport decomposition,
which indeed goes back to the time of Thevenin’s Theorem.
However it is often not clear from the literature that this
method is essentially topological and independent of the devices present in the network [9].
Essentially the method consists of the following steps.
Multiport Decomposition
As mentioned before, in the 2-graph technique, parallelizing RVJ circuit analysis reduces to parallelizing RJ circuit analysis.
A whole range of methods are available for the parallelization of electrical circuit analysis [8]. One of the most
h
Figure 2. Original electrical network of w × h
nodes
Port sources
A
B
Multiports
Figure 3. Decomposed Multiports
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 © 2007
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 06:55 from IEEE Xplore. Restrictions apply.
where ÂP1 · · · ÂPk is the reduced incidence matrix of
the port connection diagram.
Port nodes

Figure 4. Port Connection Diagram
Let Norig be the original RJ circuit with graph G with
reduced incidence matrix Ao , with device characteristic,
(1)
(i − J) = Gv
Let E be the edge set of G and let it be partitioned into
E1 , · · · , Ek . We now decompose Norig into the multiports
N1P1 , · · · , NkPk and the port connection diagram NP . The
multiports NjPj , j = 1, · · · , k are on graphs GjPj , j =
1, · · · , k respectively with edge sets Ej ∪ Pj , j = 1, · · · , k
where Pj , j = 1, · · · , k are the port edges, and NP is built
on GP with edge set P1 ∪ · · · ∪ Pk . The number of port
edges Pj depends upon the partition E1 , E2 , · · · , Ek . Given
the partition E1 , E2 , · · · , Ek building the multiports and the
port connection diagram is essentially linear time ([7], [8],
[9]). However, as is to be expected, finding the best partition (in the sense of minimizing the total number of port
edges) is NP-Hard. We find a good partition heuristically
by using the partitioner METIS [5]. In the present work, we
have considered planar networks, which invariably decompose into multiports with few ports. Further, in the planar
case, the construction of multiports is intuitive and trivial.
Formally, at the end of Step 1, the constraint of Norig
Ao i = 0
(2)
(Ao )> vn = v
(3)
(i − J) = Gv
(4)
is transformed into the equivalent (in the variable i, v) constraint
ij
A Ej
where
multiport.
A Ej
A Pj
A Pj
A>
Ej
A>
Pj
iP j
=0
(5)
is the reduced incidence matrix of j th
vnPj =
vj
v Pj
(6)
(7)
Gj vj = (ij − Jj )
j = 1, · · · , k.
ÂP1
···
ÂPk


iP
 . 1 
 ..
=0
iP k


Â>
P1
 ..


 .
 vnP = 
Â>
Pk
(8)

v P1

..

.
v Pk
(9)
In Step 2, each multiport is treated as a network with ports
Pj being taken as voltage sources
vPj . For
instance, for the
multiport N1P1 if we take AE1 AP1 to be row equiva
A1E1 0
lent to
, we can get the 2-graph equations
A2E1
A2P1
in the form
A1E1 G1 A>
1E1
A2E1
vnP1 = −A1E1 J1 + T1P1 vP1
A2P1
i E1
iP 1
=0
(10)
(11)
(where all the coefficient matrices on the LHS and RHS of
the equation can be obtained by scaling with conductance
values and by linear time graph theoretic algorithms)
From Equation 10 and Equation 11 for each multiport,
we can obtain the port behaviour
iP 1 = G P 1 v P 1 + b P 1 , · · · , i P k = G P k v P k + b P k
(12)
This completes Step 2. If we now use Equation 8 and Equation 9, we get
ÂP1
..
ÂPk

GP 1
 0
 0
0
0
G P2
0
0
..
..
.
..
0
Â>
P1
0
  .. 


.  vnP
0
>
ÂPk
G Pk

= −(ÂP1 bP1 + · · · + ÂPk bPk )

(13)
We then solve Equation 13 and obtain vnP and use the latter
in Equation 9 to obtain vP1 · · · vPk . This completes Step 3.
Next we substitute these in the multiport equations
(Equations 10 and 11) and obtain all node potentials of
the multiports and therefore all voltages and currents in the
multiports. We note that the non port variables among the
latter are the same as the voltages and currents of all the
edges of the original network. This completes Step 4.
While using the above technique for parallelizing circuit analysis, it is convenient to think of a single master and
several slave processors. Step 1 and 3 can be done by master and Step 2, where individual multiports are solved repeatedly, can be done by slave processors. In this scheme
usually the master and slave processors do not have to operate concurrently. Therefore if memory constraints are not
stringent, one of the slaves can itself function as the master.
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 © 2007
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 06:55 from IEEE Xplore. Restrictions apply.
4
5
Two schemes
We have studied two options after adopting multiport decomposition:
(a) Solve both the multiport and port connection diagram
network using sparse LU factorization. We call it LU-LU
method.
(b) Solve the multiports using sparse LU factorization but
the port connection diagram network using Conjugate Gradient method. We call it LU-CG method. Below, we have
elaborated on this method.
Method (a) is fast but more memory intensive while
method (b) is slow but with minimal memory requirement
for the master.
4.1
LU-CG
The essential difference between LU-LU and this
method is in the way the solution of the port connection
diagram is handled. Suppose the port behaviour for all the
ports together is
ip = G p v p + b p
(14)
and the Equation 13 for the port connection diagram network is written more compactly as
Arp Gp A>
rp vnp = −Arp bp
(15)
The key step in the CG algorithm (for solving Equation 15)
is to obtain (Arp Gp A>
rp )x given x. This can be done without explicitly computing Gp or (Arp Gp A>
rp ) by proceeding
as follows:
(i) Compute A>
rp x i.e. in the port connection diagram network compute edge voltages given the node potential vector
x. Let us call this resulting vector vpx .
(ii)To compute Gp vpx we do not explicitly use Equation 14
but rather in Equations 10 and 11 we put vp = vpx and
sp = 0, solve and then obtain ipx = Gp vpx . In other
words, if in the multiports we set all internal sources to zero
and the port voltages to vpx we would get the port currents
ipx = Gp vpx . Note that the LU factorization for the multiport equations is done only once. This computation corresponds to solving the multiport equations for different right
sides corresponding to different vpx .
Explicit computation of Gp is not required. Observe that
for each iteration of CG routine we have to compute one solution of the multiport equations. Next we compute Arp ipx .
This may be interpreted as the current leaving the nodes of
a graph where the currents in the branches is given by ipx .
This is a simple node by node computation. We thus see that
to compute (Arp Gp A>
rp )x bulk of the effort is in computing
x
=
G
v
.
The
size of the port connection diagram
Gp A>
p px
rp
is not very important in contrast with the situation one obtains in the case of the LU-LU method. What does matter is
the number of CG iterations required for convergence.
Results
Parallelization of DC-Analyzer was implemented using
a network of PIV 3.0 GHz processors each having 1.0GB
RAM. Processors used were connected through a 10/100
Mbps Switch. In dealing with large circuits, memory swapping between RAM and disk can cause excessive delays.
So, as far as possible, we have attempted to bring the required circuit entirely within the RAM and perform all the
calculations.
To save space, we are confining our detailed discussion
to the case where the network is divided into 8 multiports
and one master and one or two slave processors are used.
We have however given plots showing the variation of speed
vs number of slave processors including the cases of four
and eight slave processors also. Our experiments have been
performed on large planar rectangular grid type circuits and
on general nonplanar circuits. We chose grids among planar circuits because these are harder to analyze than circuits
which are closer to trees in the sense of having few cotree
branches. Our results on nonplanar circuits are preliminary
and primarily show the direction in which research must
proceed in order to handle such circuits.
Tables 1 and 2 show the simulation results for LU-LU and
LU-CG cases. The notation used in the tables is as follows.
The superscripts LL,LC refer respectively to LU-LU
and LU-CG methods. ts ,tm are the maximum time taken by
a single slave processor and that taken by the master processor respectively. tcom is the total communication time. ttot
is the total time taken in the solution of the electrical network. ms , mm are the memory used by slave and master
processors respectively in megabytes. All the times mentioned are in seconds. We have used actual ‘wall clock time’
in place of CPU time deliberately. To illustrate the notation, consider the row corresponding to g100K in Table 2.
The superscript LL refers to the LU-LU case and LC refers
to the LU-CG case. The values corresponding to LU-CG
are given within brackets. For the LU-LU case, tLL
= 14
s
secs (this refers to the greater of the times corresponding to
LL
the two slave processors), tLL
m = 5 secs, tcom = 2 secs,
LL
ttot = (max slave time + master time) = 19 secs,
LL
mLL
s = 33MB and mm = 40MB;.
It can be seen that communication time is not a significant factor in the total time in either LU-LU or LU-CG
case. The slave time ts is the maximum amongst the slaves
(the master has to wait this long before it can get all the results for further processing). The plot in Figures 5-6 gives
the time taken for solving circuits of various sizes when the
number of slave processors is 2, 4 or 8. In the present case
of rectangular grids the slave times are varying approximately inversely as the number of processors. In general, it
is not feasible to partition the network so that the multiports
are exactly equal in size and have exactly equal number of
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 © 2007
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 06:55 from IEEE Xplore. Restrictions apply.
Circuit
g100k
g500k
g1M
g1.5M
g2M
LC
tLL
s (ts )
39 (22)
152 (128)
294 (288)
486 (-)
742 (-)
LC
tLL
m (tm )
6 (6)
16 (44)
29 (118)
61 (-)
115(-)
LC
tLL
com (tcom )
2 (1)
12 (6)
25 (14)
57 (-)
111 (-)
LC
tLL
tot (ttot )
45 (28)
168 (172)
323 (406)
547 (-)
857 (-)
LC
mLL
s (ms )
33 (96)
160 (480)
335 (928)
460 (-)
633 (-)
LC
mLL
m (mm )
36 (6)
79 (52)
132 (103)
189 (-)
240 (-)
Table 1. Simulation results with one master processor and one slave processor
Circuit
g100k
g500k
g1M
g1.5M
g2M
LC
tLL
s (ts )
14 (11)
71 (64)
148 (141)
241 (221)
417 (427)
LC
tLL
m (tm )
5 (4)
12 (21)
23 (44 )
33 (73 )
63 (120)
LC
tLL
com (tcom )
1 (1)
7 (3)
17 (7 )
23 (10)
49 (15)
LC
tLL
tot (ttot )
19 (15)
83 (85)
171 (185)
274 (294)
480 (547)
LC
mLL
s (ms )
33 (44)
163 (224)
315 (464)
476 (660)
646 (848)
LC
mLL
m (mm )
40 (6)
80 (52)
132 (103)
185 (163)
240 (218)
Table 2. Simulation results with one master processor and two slave processor
ports. Hence, slave times, in general, cannot be expected to
vary inversely as the number of processors. This is essentially a feature of high level parallelization.
1200
1 slave processor
2 slave processors
4 slave processors
8 slave processors
---------> Time (in Seconds)
1000
800
600
400
200
0
0
500
1000
1500
2000
2500
---------> Circuit size (in thousand Nodes)
Figure 5. Plain grid type electrical network
simulation by LU-LU method
800
1 slave processor
2 slave processors
4 slave processors
8 slave processors
---------> Time (in Seconds)
700
600
500
400
300
200
100
0
0
500
1000
1500
2000
2500
3000
---------> Circuit size (in thousand Nodes)
Figure 6. Plain grid type electrical network
simulation by LU-CG method
In the LU-LU method, the memory usage is that corresponding to storing all the defining data (topology+device
characteristic) of multiports assigned to the slave processor
+ one multiport LU factorization and port behaviour computation. Once this latter information is passed onto the
master, it is deleted from the slave. During Step 4, LU factorization for the multiport is done once more by the slave.
In the LU-CG method this technique is not feasible since we
need to make port current computation from port voltages
in each iteration of CG. Therefore slave memory requirements are more in LU-CG than in the case of LU-LU, if the
number of multiports per slave is more than one. Thus, we
see from Table 1 that in the single master + single slave
case we are only able to solve sizes upto a million nodes in
the case of LU-CG, while in the case of LU-LU we reach
upto two million nodes. However, master memory requirement in LU-CG is much less than that for LU-LU since the
dense coefficient matrix for the port connection diagram is
not stored explicitly. This fact becomes critically important
in the case of nonplanar networks where the number of port
branches can increase substantially and LU-LU is essentially infeasible beyond tens of thousands of nodes. We note
that direct LU fails beyond 400,000 nodes because of memory requirements if a single computer is used which is of the
kind (1 GB RAM) used above. However the LU-LU strategy permits us to use a single computer as both master and
slave and reach upto 1.5 million nodes (see that in Table 1,
ms + mm is less than 1 GB). Finally, we state that by using
a master + 8 slave processors we have been able to reach
sizes upto 8.2 million nodes and 16.4 million edges with an
LU-CG scheme. (where tLC
= 2222 secs, tLC
s
m = 253 secs,
LC
LC
tcom = 97 secs, ttot = 2475 secs, mLC
=
1071MB and
s
mLC
=
916MB,
number
of
blocks
=
56).
m
Sparse LU routines do poorly for nonplanar circuits. Further, when we decompose nonplanar circuits, the number
of ports can become very large which can in turn lead to
port connection diagram being very large. Because of these
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 © 2007
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 06:55 from IEEE Xplore. Restrictions apply.
Circuit
g1k
g3k
g10k
g50k
g100k
g200k
LC
tLL
s (ts )
1(1)
6(3)
-(15)
-(91)
-(208)
-(267)
LC
tLL
m (tm )
6 (0)
565 (0)
- (3)
- (23)
- (49)
- (103)
LC
tLL
com (tcom )
0 (0)
2 (1)
- (1)
- (17)
- (40)
- (55)
LC
tLL
tot (ttot )
7 (1)
571(3)
- (18)
- (114)
- (257)
- (370)
LC
mLL
s (ms )
2 (¡1)
10(1.6)
-(4)
-(14)
-(19)
-(70)
LC
mLL
m (mm )
45 (¡1)
400(2)
- (4)
- (25)
- (53)
- (108)
Table 3. Nonplanar simulation results with one master processor and eight slave processors
---------> Time (in Seconds)
CG scheme permits us to handle very large sizes because it
places a lesser memory burden on the master (supervisor)
computer.
8 slave processors
700
600
500
400
References
300
200
100
0
0
1
2
3
4
---------> Circuit size (in thousand)
5
Figure 7. Nonplanar electrical networks
simulation by LU-LU method
400
8 slave processors
---------> Time (in Seconds)
350
300
250
200
150
100
50
0
0
50
100
150
---------> Circuit size (in thousand)
200
Figure 8. Nonplanar electrical networks
simulation by LU-CG method
reasons, LU-LU is infeasible for nonplanar circuits of size
beyond a few tens of thousand nodes and LU-CG, beyond
about 200,000 nodes unless care is taken to decompose the
network into planar multiports. Further research for nonplanar circuits is being planned only with LU-CG method and
with the decomposed multiports being planar.
6
Conclusion
In this paper we have shown that the multiport decomposition method is an effective high level parallelization technique for the DC analysis of large resistor, current source,
voltage source circuits (upto 8.2 million nodes, 16.4 million
edges) using freely available systems of networked computers. We have described two further schemes within this
method which we have called LU-LU and LU-CG. The LU-
[1] Ravindra K. Ahuja, Thomas L. Magnanti, James
B. Orlin, Network flows; Theory, Algorithms and
Applications,Prentice-Hall, Englewood Cliffs, New
Jersey, 1993.
[2] S. H. Batterywala and H. Narayanan,Efficient DC
Analysis of RVJ Circuits for Moment and Derivative
Commutations of Interconnect Networks, 12th International Conference on VLSI Design,1999,pp 169174.
[3] Jack B. Dennis, Mathematical Programming and
Electrical Networks, John Wiley & Sons, Inc., New
York, 1959.
[4] G. Golub, C. Van Loan, ”Matrix Computations - 2nd
Edition”, Johns Hopkins University Press Baltimore,
[5] George Karypis and Vipin Kumar, A fast and high
quality multilevel scheme for partitioning irregular
graphs, International Conference on Parallel Processing, 1995, pp. 113-122.
[6] LEDA libraries, Algorithmic Solutions Software
GMBH, Germany, 2005.
[7] H. Narayanan, Submodular Function and Electrical
Networks, Annals of Discrete Mathematics, Volume
54, North Holland, Amsterdam, The Netherlands,
1997.
[8] H. Narayanan,Topological Transformations of Electrical Networks,International Journal of Circuit Theory
and Applications,Vol. 15, 1987, pp 211-233.
[9] H. Narayanan, On the decomposition of vector spaces,
Linear Algebra and its Applications 76,1986, pp. 6198
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 © 2007
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 06:55 from IEEE Xplore. Restrictions apply.

Download Report

4092149.pdf

Paperzz.com

Your Paperzz