Modeling and Analyzing of Blocking Time Effects

Modeling and Analyzing of Blocking Time Effects on Power Consumption in
Network-on-Chips
Arghavan Asad, Amir Ehsani Zonouz, Mehrdad Seyrafi, Mohsen Soryani and, Mahmood Fathy
Department of Computer Engineering, Iran University of Science And Technology
{arghavan_asad, ae_zonouz, mehseyrafi}@comp.iust.ac.ir, {soryani, mahfathy}@iust.ac.ir
Abstract-- Networks-on-Chip (NoC) has been proposed as an
only efficient and scalable solution for providing global on-chip
communications in any large VLSI design. Simultaneously,
power dissipation issues have grown to such importance that
they now constrain attainable performance. The large value of
power consumption, relative to the active power, can therefore
have serious implications for the feasibility of deploying NoCs.
If NoCs are to be accepted, their full power implications need
to be known. Moreover, these power characteristics must be
accurately understood across the large possible design space of
NoCs. Blocking time is one of the effective factors on NoC
power consumption. In this paper we present a Markovian
model for evaluating the amount of the dissipated power comes
from packet blocking and show the blocking time effects on
total power consumption of on-chip networks approach.
Keywords- Network on Chip(NoC); High Traffic Regions;
Packet Blocking Power Consumption; Blocking Time
I.
INTRODUCTION
Large systems on chip(SoC) are interconnecting limited
due to high area, power and delays. Network on Chip(NoC)
is an efficient on-chip communication architecture for SoC
architectures that is structured, reusable, scalable, and has
high performance and enables integration of a large number
of computational and storage blocks on a single chip.
Researchers and designers have recognized the
importance of low-power computing for even very high-end
microprocessors. Clock rates and die sizes increase by
technology scaling; therefore power dissipation is predicted
to become the key limiting factor on the performance of
single chip microprocessor.
Nowadays, power efficiency is one of the most
important concerns in NoC architecture design. Consider a
10×10 tile-based NoC, assuming a regular mesh topology
and 32 bit link width in 0.18um technology and minimal
spacing, under 100Mbit/s pair-wise communication
demands, interconnects will dissipate 290W of power [2].
Thus, reducing the power consumption on global
interconnects is a key factor to the success of NoC designs.
It is not clear which NoC-based architecture is best
suited for a specific application. The trade-off in
performance versus power consumption of the
interconnection networks is an open question. The tradeoff
analysis can be performed by varying the following design
parameters: topology, length of physical links, width of
physical links, buffer allocation, switching techniques,
routing algorithms, and levels of service. Innovative
performance evaluation models are required to address the
design challenges of NoC based interconnection
architectures particularly in modeling and analyzing of
power consumption [1].
One of the famous and well designed topologies in
interconnection networks is torus. Torus networks have high
path diversity, offering many alternative paths between each
source and destination. Many studies, e.g. [3], [5], have
investigated the 2D torus NoC architectures (as shown in
Fig. 1). Wormhole switching has widely used as dominate
switching technique in contemporary multicomputers. So
wormhole switching is assumed to be used in this research.
In wormhole switching technique as network traffic
increases, messages may experience large delays to cross
the network due to the chain of blocked channels. To
overcome this, the flit buffers associated with a given
physical channel are organized into several channels, each
representing a “logical” channel with its own buffer and
flow control logic. Virtual channels are allocated
independently to different messages and compete with each
other for the physical bandwidth. We further assume that
deterministic routing is used to direct the packets across the
network. This form of routing results in a simpler router
implementation and has been used in many practical
systems. For deadlock free routing two organizations of
virtual channels allocation, Dally’s methodology and
Duato’s methodology, in the context of deterministic routing
can be considered. In this research we use second
organization based on Duato’s methodology. In this scheme
the V virtual channels of a given physical channel are split
into two sets: VC1 = {v3, v4, …,vV} and VC2 ={ v1, v2}.
Since the routing is deterministic, message cross dimensions
in a fixed predefined order. At each node a message can
choose any of the (V – 2) virtual channels in VC1. If all these
virtual channels are busy, the message crosses v1 if ci < di ;
otherwise it crosses v2.
Depending on the complexity of the interconnection
network and resources available, simulation may take long
time to perform. However, analytical model of a system can
be considered as an alternative approach. And it can provide
the value of desired performance parameters of a specific
on-chip network in a fraction of the time that simulation
would take.
Figure 1. A 4×4 Torus network topology
Proceedings of the 5th IEEE International Conference on ReConFigurable Computing and FPGAs (ReConfig’09) © IEEE 2009
The rest of this paper is organized as follows; a brief
summary of related work and our distinct contributions are
described in Section 2. Section 3 proposes an analytical
performance model for packet blocking power consumption
and analyses the effects of blocking time on NoC’s power
dissipation. Section 4 validates the proposed model using
simulation. Finally, Section 5 concludes this paper.
II.
RELATED WORK
A thorough review of a range of NoC architectures and
platforms as a scalable solution for on chip communications
has already presented in [6], and [7]. The network topology
and routing algorithm used in the underlying on-chip
communication network are the two most important aspects
that distinguish various proposed NoC architectures [8], [9].
Performance analysis techniques that can be used in
optimization loops are extremely important. Several
analytical models provide in the literature an average packet
latency model for different interconnection networks under
different traffic patterns [3], [5], [10]. The authors in [11]
present power and throughput models in terms of traffic rate
parameters for the most popular traffic models i.e. Uniform,
Local, Hotspot and First Matrix Transpose (FMT). A novel
approach in high level power modeling based on latency for
WK-recursive and mesh topologies for NoCs is presented in
[12]. In [12] the power consumption is calculated for two
different operating regions of the NoC, namely the low and
high traffic regions. The low traffic region in a NoC is the
region that there is no packet contention or the contention is
rare. Also the high traffic is defined as a region that packet
blocking is frequently occurred but there is no packet
deadlock and network does not enter the saturation region.
There are four effective factors on NoC power
consumption: 1. length of the link between adjacent routers,
2. distance between source and destination in term of
number of links, 3. number of channel monitoring in a cycle
and, 4. blocking time. Blocking time is one of the effective
factors on NoC power consumption.
Near and in high traffic regions, the energy consumption
is affected by the packet contention. In this case an amount
of the dissipated energy comes from packet blocking
(energy consumed for buffering and blocking time) which is
not in low traffic load. The other form of energy dissipation
in wires, switching hardware, and general buffering is
almost the same for low traffic load. In this paper, we
propose an analytical model to compute packet blocking
power consumption for torus networks.
The main contribution of this work is to propose a high
level approach for modeling and studying power
consumption of packet blocking in high traffic regions. By
this model we can realize which design parameters can
affect on packet blocking power consumption more and we
can try to reduce it.
III.
THE PACKET BLOCKING POWER CONSUMPTION
MODEL
A. Model Assumptions
The model is based on the following assumptions, which
are widely used in the literature [3], [4], [5], [10]:
• Nodes generate traffic independently of each other,
and follow a Poisson process, with a mean rate of
λg message/cycle/node or λ packet/cycle/node.
• Message length is fixed (M flits) and, packet length
is fixed (L flits).
• Message destinations are uniformly distributed
across the network nodes.
• V virtual channel per physical channel are used.
• The local queue in the source node has infinite
capacity. Moreover, messages at the destination
node are transferred to the local PE as soon as they
arrive at their destinations through the ejection
channel.
B. Model Description
For modeling the power consumption of packet
blocking, we first model the packet blocking energy
consumption as follows, and then derive the power
consumption model from calculated energy model. To
model the packet blocking energy consumption, first, the
mean power consumption of a blocking message,
Pblocking _ message , is determined. Then the mean blocking time,
T blocking , which is the average blocking time of a message, is
evaluated. Finally the total number of transmitted messages
that successfully arrive at their destinations in a uniform
traffic pattern, N transmitted _ messages , is calculated. Therefore, the
mean packet blocking energy consumption can be written as
Energy consumption =
P blocking _ message × T blocking × N transmitted _ messages
(1)
In each router, FIFO buffers at each port have a capacity
of B flits. The mean power consumption relative to the
buffering in each router’s port is assumed P buffering
.Therefore, the mean power consumption for a message can
be denoted as
P blocking _ message =
M × P buffering
B
(2)
In the 2-D torus network where k is the number of nodes
in each dimension, the average number of hops that a
message makes along one dimension, k , and the average
number of hops that a message traverses before reaching its
destination, d , are given by [13]
Proceedings of the 5th IEEE International Conference on ReConFigurable Computing and FPGAs (ReConfig’09) © IEEE 2009
k=
k
4
(3)
k
d=
2
(4)
Under the uniform traffic pattern, messages arrive at
network channels at a uniform rate. The channel arrival rate
can be found by simply dividing the total channel arrival
rates over the number of channels in the network. Each PE
generates, on average, λg messages in a cycle, resulting in a
total of k2 λg newly-generated messages per cycle in the
network. Since each message traverses, on average, d hops
to cross the network, and the total number of four output
channels of each node are equally utilized, the arrival rate of
messages to any network channel, denoted λc is equal to
λc =
k 2λg d
4k 2
=
λg d
(8)
v = V
For computing Pi,v, the average service time of a channel
in dimension X, Y is needed. The average service time of a
channel in dimension i (i = X, Y), Si , can be expressed as
M + k(1 + W x PBx );
Si = 
M + k(1 + W x PBx ) + k(1 + W y PBy );
i = X
(9)
i = Y
(5)
network message blocking time, t blocking , can be denoted as
(6)
t
= k×W ×P + k×W ×P
x
Bx
y
By
where W x and W y are the average waiting time for
acquiring a virtual channel in dimension X and Y, and PBx
and PBy are the probabilities of a message being blocked at a
hop of dimension X and Y, respectively. To compute PBx
and PBy two cases should be considered, (i) V virtual
channels are busy which means all virtual channels in sets
VC1 and VC2 are busy, and (ii) V-1 virtual channels are
busy which means all virtual channels in the first set VC1 are
busy and the virtual channel to be used by the message in
the second set VC2 are busy too.
Let Pi,v , 0 ≤ v ≤ V, represent the probability that v
virtual channels at physical channel in dimension i (i = X,
Y) are busy.
PBi = Pi,v +
0 ≤ v ≤ V, i = X, Y
4
A message at dimension X traverses k hops, on average,
and then moves to the next dimension Y. At each hop there
is one cycle to transfer header flit over a channel, and some
delay due to blocking in the network. Therefore the average
blocking
Pi ,V
 (1 − λ c Si )( λ c Si ) v ;
=
v
 ( λ c S i ) ;
Pi,v−1
; i = X, Y
V
(7)
The probability, Pi,v , can be determined using
Markovian model shown in Fig. 2. State πv, corresponding
to v virtual channels being requested. The transition rate out
of state πv to state πv+1 is 1 Si , i = X, Y. The probability that
v virtual channels are busy, 0 ≤ v < V, is the probability of
being in state πv, i.e. Pi,v = Pr(πv). However, the probability
that V virtual channels are busy is the summation of
probabilities of being in states πv, V ≤ v < ∞, i.e.
Pi,v = ∑
Prπ The steady-state solution of the Markovian model yields the
probability Pi,v to be
Figure 2. Markov model for computing occupying and releasing virtual
channels probabilities at dimension i (i=X, Y)
To determine the mean waiting time, Wx and W y , to
acquire a virtual channel in dimension X and Y, an M/G/1
queue is used with a mean waiting time given by [14].
Therefore Wx and W y by using
(9), can be calculated as
(
(
λ c Sx + Sx – M
Wx =
))
2
(10)
2( 1 − λ c Sx )
(
(
λ c Sy + Sy – Sx
Wy =
Sx and Sy from equation
))
2
(11)
2( 1 − λ c Sy )
The local queue in the source node is modeled as an
M/G/1 queue, with the mean arrival rate λ g V (recalling
that a message in the source node can enter the network
through any of the V virtual channels). A message
originating from a given source node sees the waiting time,
Ws , as follows [3]
(
(
(λ g V) Sy + Sy – M
Ws =
))
2
(12)
2( 1 − (λ g V)Sy )
When multiple virtual channels are used per physical
channel they share the bandwidth in a time-multiplexed
manner. The average degree of multiplexing of virtual
Proceedings of the 5th IEEE International Conference on ReConFigurable Computing and FPGAs (ReConfig’09) © IEEE 2009
channels that takes place at a physical channel in dimension
i, (i=X, Y), can be estimated by [5].
v 2 Pi,v
v =1 vP
i,v
V
Vi = ∑
(13)
Vx + Vy
2
(14)
Therefore, the mean blocking time, T blocking , which is the
average blocking time of a message, can be written as
T blocking =
(t
blocking
+ Ws
)V
(
))
(
(λg V) Sy + Sy – M
N 1
×
k2 λ
(16)
be calculated as
N λg
k2 λ
(17)
Note that λL = λgM, so we can write a simple form for
N transmitted _ messages as follows
N transmitted _ messages =
NL
k2
)×(
Vx + Vy NL
) ×( 2 )
2
k
We modeled the overall required packet blocking energy
at high packet injection rates and high traffic regions for
torus network on chip topologies under uniform traffic
pattern heretofore. But the main goal is modeling and
analyzing the packet blocking power consumption in this
paper. By using equation (20), and dividing Eblocking by Ttpp,
we can get the total packet blocking power consumption
formula as equation (21)
(20)
(15)
Therefore it is clear that the total number of transmitted
messages in a uniform traffic pattern, N transmitted _ messages , can
N transmitted_messages =
(19)
2
Eblocking = Pblocking × Ttpp
As you can see in the model assumptions, nodes
generate traffic independently of each other, and follow a
Poisson process, with a mean rate of λg message/cycle/node
or λ packet/cycle/node and, message destinations are
uniformly distributed across the network nodes. We also
assume that the total number of injected packets to the
network by all of the PEs is N (the total number of produced
packets by nodes). It is clear that the overall time which a
fixed workload of N packets require to arrive successfully at
their destinations in a uniform traffic pattern, Ttpp (overall
time required for arriving total produced packet to their
destination nodes) can be denoted as
Ttpp =
M×Pbuffering
×((k×Wx ×PBx + k×Wy ×PBy ) +
B
2( 1−(λg V)Sy )
Averaging over two dimensions, X and Y, the average
degree of multiplexing of virtual channels in the network is
given by
V=
Eblocking =
(18)
Finally the overall required packet blocking energy is
modeled as
Pblocking =
M × Pbuffering
× ((k × Wx × PBx + k × Wy × PBy )
B
(
(
(λg V) Sy + Sy – M
+
2( 1− (λg V)Sy )
IV.
))
(21)
2
)×(
Vx + Vy
) × (λL)
2
VALIDATION OF THE MODEL
The proposed analytical model has been validated
through a NoC simulation platform developed in SystemC
coupled with Orion simulator [16] as a separate plug-in
power analysis tool, to our network simulator. In each
simulation experiments, a total number of 10000 messages
are delivered. The simulator uses the same assumptions as
the analysis, and some of these assumptions are detailed
here with a view to making the network operation clearer.
The network cycle time is defined as the transmission time
of a single flit from one router to the next. Traffic sources
generate 8-flits packets with an exponential distribution, the
parameters of which depend on the packet injection rate.
The FIFO buffers have a capacity of 4 flits. Destination
nodes are determined using a uniform random number
generator. An 8×8 2D torus topology has been considered
for all experiments, and packets size is fixed in 8 flits.
Numerous validation experiments have been performed
for several combinations of message lengths. Figs. 3 and 4
depict the total packet blocking power dissipation results
predicted by the above models plotted against those
provided by the simulator for different message lengths,
M=32 and, M=64 flits. Moreover, the number of virtual
channels per physical channels was set to V = 3 for the
experiments. The horizontal axis in the figures shows the
traffic generation rate at each node while the vertical axis
shows the total packet blocking power consumption to drain
a fixed amount of traffic.
Proceedings of the 5th IEEE International Conference on ReConFigurable Computing and FPGAs (ReConfig’09) © IEEE 2009
Figure 3. Total packet blocking power consumption predicted by the
model against simulation results for 8×8 torus topology with message
length M=32, and virtual channel number V= 3
power model and simulation results, packet blocking power
consumption is very little in torus networks and it is
negligible relative to other power consumption factors.
Also we can explain this model in this way that a further
increase in packet injection rate would not result in more
traffic injected in the network and packets simply remain
unqueued at the network interface of the source node.
Therefore the interconnection network goes to saturation
mode soon and the effect of blocking time on torus power
consumption reduces.
Since mesh topologies are rarely similar to torus
topologies, packet blocking power consumption nears to
zero while total number of injected packets in to the NoC
increases in mesh topologies too.
This is a novel result, showing that although
performance (such as packet latency and throughput) might
be seriously degraded at high congestion, there is no
considerable direct impact on packet dynamic energy. Any
significant energy impact would come from second-order
effects, for example by reducing the time available to affect
standby power minimization techniques, due to increased
packet traversal time.
So we suggest that when energy (power) consumption is
a constraining criterion for a specific application, it is better
to use the torus topology.
Our next objective is to develop an analytical model for
the total power consumption of an interconnection network
required to drain a fixed amount of traffic on a single chip.
Predicting total power consumption of an interconnection
network and modeling it, on the other hand, could prove to
be a decisive factor in battery powered mobile devices. Also
this can be very useful for the microarchitect in determining
a suitable routing policy for application-specific traffic
pattern, or to balance workload among network nodes to
avoid power and performance hotspots.
ACKNOWLEDGEMENT
Figure 4. Total packet blocking power consumption predicted by the
model against simulation results for 8×8 torus topology with message
length M=64, and virtual channel number V= 3
This work was supported in part by a research grant
from IRAN Telecommunication Research Center (ITRC).
REFERENCES
[1]
V.
CONCLUSION
This paper has described an analytical model to compute
the packet blocking power consumption in wormhole-routed
2D torus topologies with XY routing algorithm. Simulation
experiments have revealed that the total packet blocking
power consumption results predicted by the analytical
model are in good agreement with those obtained through
simulation. We can model the packet blocking power
consumption for other famous topologies like mesh,
hypercube and, etc in this way. Because of the similarity of
mesh symmetric structure to torus, the result of this model
analyzing is extendable to mesh topology easily.
Our overall conclusion is that due to the wrap-around
links of the torus topology and its small average distance,
the numbers of nodes that are involved in blocking are fewer
than other topologies. So as you saw in the packet blocking
[2]
[3]
[4]
[5]
N. Banerjee, P. Vellanki, and K. S. Chatha, “ A Power and
Performance Model for Network-on-Chip Architectures,” In DATE
’04: Proceedings of the conference on Design, automation and test in
Europe, page 21250, Washington, DC, USA, 2004.
Y. Hu, H. Chen, Y. Zhu, A. A. Chien and C. Cheng, “Physical
Synthesis of Energy-Efficient Networks-on-Chip Through Topology
Exploration and Wire Style Optimizations,” Proceedings of the
International Conference on Computer Design (ICCD), pp. 111-118,
2005.
H. Sarbazi-Azad, A. Khonsari and M. Ould-Khaoua, “Performance
Analysis of Deterministic Routing in Wormhole K-ary n-cubes with
Virtual-Channels”, Journal of Interconnection Networks, 2002.
X. Chen, and L.Peh, “Leakage Power Modeling and Optimization in
Interconnection Networks,” In Proceedings of International
Symposium on Low Power Electronics and Design, pp. 90-95, 2003.
H. H. Najaf-abadi and, H. Sarbazi-Azad, “An Accurate
Combinational Model for Performance Prediction of Deterministic
Wormhole Routing in Torus Multicomputer Systems,” In Proceeding
of the International Conference on Computer Design, pp.548-553,
2004.
Proceedings of the 5th IEEE International Conference on ReConFigurable Computing and FPGAs (ReConfig’09) © IEEE 2009
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
W. Dally and B. Towles, “Route Packets, Not Wires: On-Chip
Interconnection Networks”, Proc. of the Design Automation
Conference, pp. 684- 689, Jun. 2001.
D. Bertozzi, A. Jalabert, S.Murali, et al., “NoC Synthesis Flow for
Customized Domain Specific Multi-Processor Systems-onChip,”
IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 2,
pp. 113–129, 2005.
A. Ehsani Zonouz, M. Seyrafi, A. Asad, M. Fathy, et al., “A Fault
Tolerant NoC Architecture for Reliability Improvement and Latency
Reduction”, accepted in DSD2009.
J. Hu and, R. Marculescu, “DyAD-Smart Routing for Networks-onChip,” DAC 2004, pp. 260 – 263, San Diego, California, USA, 2004.
M. Ould-Khaoua, H. Sarbazi-Azad, “ An Analytical Model of
Adaptive Wormhole Routing in Hypercubes in the Presence of Hot
Spot Traffic,” IEEE Trans. Parallel Distrib. Syst. 12(3), pp. 283-292,
2001.
S. Koohi, M. Mirza-Aghatabar, S. Hessabi, M. Pedram, “High-Level
Modeling Approach for Analyzing the Effects of Traffic Models on
Power and Throughput in Mesh-Based NoCs,” 21st International
Conference on VLSI Design (VLSI Design 2008) India, 4-8 January
2008, pp. 415-420.
D. Rahmati, A. E. Kiasari, S. Hessabi, and H. Sarbazi-Azad, “A
Performance and Power Analysis of WK-Recursive and Mesh
Networks for Network-on-Chips,” IEEE International Conference on
Computer Design (ICCD 2006), San Jose, CA, USA, Oct. 2006.
J.T. Draper and J. Ghosh, “A Comprehensive Analytical Model for
Wormhole Routing in Multicomputer systems”, Journal of Parallel
and Distributed Computing, Vol. 23, No. 2, pp. 202–214, 1994.
L. Kleinrock, Queueing Systems, Part 1, Wiley, New York, 1975.
H. Wang, X. Zhu, L.-S. Peh and S. Malik, “Orion: A PowerPerformance Simulator for Interconnection Networks”, Proc.
MICRO, pp. 294-395, 2002.
Proceedings of the 5th IEEE International Conference on ReConFigurable Computing and FPGAs (ReConfig’09) © IEEE 2009

Download Report

Modeling and Analyzing of Blocking Time Effects

Paperzz.com

Your Paperzz