Embedding Fault-Tolerance with Dual-Level Agents in Many

Noname manuscript No.
First MEDIAN
(MEDIAN
2012)
(will be Workshop
inserted by
the editor)
Embedding Fault-Tolerance with Dual-Level Agents
in Many-Core Systems
Liang Guang · Syed M. A. H. Jafri · Bo
Yang · Juha Plosila · Hannu Tenhunen
the date of receipt and acceptance should be inserted later
Abstract Dual-level fault-tolerance is presented on many-core systems, provided by the software-based system agent and hardware-based local agents.
The system agent performs fault-triggered energy-aware remapping with bandwidth constraints, addressing coarse-grained processor failures. The local agents
achieve fine-grained link-level fault tolerance against transient and permanent
errors. The paper concisely presents the architecture, dual-level fault-tolerant
techniques and experiment results.
1 Introduction
Due to the increasing influence of PVT variations and complex application
scenarios, many-core systems rely on run-time optimization to achieve the
expected performance, energy-efficiency and dependability. In particular, as
errors and failures may occur unpredictably at various architectural levels,
fault-tolerance needs to be provided adaptively both in coarse and fine granularity. Ad-hoc manners of fault management may lead to non-compatible
system architectures which are not scalable nor portable. Instead, we propose
a generic system architecture with dual-level agents, supporting fault tolerance
in many-core systems with diverse granularities.
2 Dual-Level Agents for Run-Time Management Services
A design layer for monitoring and reconfiguration services is added upon the
conventional many-core architectures (Fig. 1). The layer is composed of one
Liang Guang, Bo Yang, Juha Plosila
University of Turku, Finland
(liagua,boyan,juplos)@utu.fi
Syed M. A. H. Jafri, Hannu Tenhunen
Royal Institute of Technology, Sweden
(jafri,hannu)@kth.se
41
First MEDIAN Workshop (MEDIAN 2012)
2
Liang Guang et al.
system agent, which is the general manager of all monitoring and reconfiguration operations, and distributed local agents, which are delegates of the system agent to actuate the operations. Such organizational partition of agents
is designed to support both coarse and fine-grained services, including faulttolerance. The system agent monitors the global system status, performance
and errors, and handles resource allocation. The local agent only monitors the
fine-grained status of individual component, such as a router. It reconfigures
the local setting, e.g. the replacement of broken wires. As monitors themselves
also suffer from errors, the system agent is monitoring the error status of the
local agents. The system agent is designed with higher reliability, which will
be addressed in the future work.
7<#+"/(40"-+
!"#$%&'()#*
6"*"0)+"
7$8+2)&"(
*+$$,$"&-$&.#.,/(%#.&0*-/#11-*1
!"#$%&'"()**$')+,$.)%*+(/)-)0"/"-+
1$2"&(/)-)0"&
9&)::"&
3$')*(
40"-+
3$')*(
40"-+
5$6"
5$6"
2(*.3(*#
;)-<=>$&"(
?)'@A$-"
4::*,')+,$!
!
!
!
B/)0"(
1&$'"##,-0
C,6"$(
"-'$6,-0
D"-"&)*(
6,0,+)*(
:&$'"##,-0
EEE
!
!
!
!
!
!
!
!
Fig. 1 System and Local Agents for Scalable Monitoring and Reconfiguration
[Asad et al(2012)Asad, Guang, Jantsch, Paul, Hemani, and Tenhunen]
3 Coarse and Fine-Grained Fault-Tolerance with Energy Awareness
We consider a platform where multiple applications can be dynamically loaded,
which is a very likely scenario for the thousand-core SoC used as the processing
engine in cloud computing. As illustrated in Fig. 2, the system agent, running
software instructions, determines the mapping of applications on available processors. To account for fault-tolerance, a portion of processors are left as spares.
In case one processor fails, the most suitable spare is utilized by the software
agent to replace the broken one. The suitability is chosen, in this case, based on
the minimal potential energy consumption of inter-processor communication,
given the bandwidth constraint is met (Section 4). While the system agent
addresses the processor failure, the local agent, which is a hardware module
attached to each router, addresses the link errors. Each local agent detects
transient link errors with Hamming codes. If the error persists after repetitive
checkings, a permanent fault is identified on a certain wire. The local agent
then replaces the broken wire with a spare wire.
42
First MEDIAN Workshop (MEDIAN 2012)
Dual-Level Fault Tolerance
3
+""'(,#-(.)&/&0+/1
+""'(,#-(.)&
2&0+21
!
+""'(,#-(.)&4&0+41
+""'(,#-(.)&
3&0+31
!
567,&8%23&#,
!"##$#%&'())$#%&(#*&
+,'())$#%&$#-.+"/.$0#)*"(&4%1./&%
5
9&:(*1&8&#,%
;*88"#3%
1.4&/04&1.4&/
5
5
!"#$%&'()*
!"#$%<*+(,
!.1*(%
/&1.#<"3+/*,".#
!"#$%!&'&(%)*+(,%
-.#,/.((&/%0%!.1*(%23&#,
Fig. 2 Dual-Level Fault Tolerance on Many-Core Systems
4 Initial Experiments
For experimental purposes, four applications (one for image processing, one
MPEG encoding, two parallel kernels) are mapped on a 10*9 many-core system. The initial mapping is done with multi-application multi-step mapping
algorithms [Yang et al(2010)Yang, Guang, Canhao, Yin, Säntti, and Plosila].
81 processor cores are used for the initial application mapping (26, 21, 18 and
16 respectively), leaving 9 spare cores. With random choice of processor errors,
the fault-triggered remapping searches for the most energy-efficient replacement. Considering that the number of spares is limited, an exhaustive searching
is considered a proper tradeoff between the solution’s optimality and run-time
overhead. The energy consumption is represented by the weighted communication volume (WCA; mathematical modeling as V olume × HopCounts), which
is proportional to the dynamic energy consumption. As the maximal channel bandwidth can not be violated after remapping, the bandwidth constraint
is checked for every possible replacement. Table 1 lists four cases of replacements. WCA is reported as normalized to the WCA of the initial mapping.
As shown in Table 1, the energy consumption is increased after remapping,
as a processor needs to be replaced by a spare more distant from other utilized processors. When the bandwidth constraint is considered, the remapping
may have to utilize a spare even further in distance to avoid communication
congestion, thus with higher WCA.
The local agent integrates a state machine to automate the link-level fault
management (Fig. 3). If the Hamming decoder identifies transient bit errors,
the local agent will repetitively check the link. If a bit error persists, the
corresponding wire is considered permanently broken. Then a spare wire is
used for the replacement. Based on the availability of spares, a number of
43
First MEDIAN Workshop (MEDIAN 2012)
4
Liang Guang et al.
Table 1 Exemplified Results of Fault-Triggered Energy-Aware Remapping
Failed
Nodes
(23,3,22)
(23,3,22)
(72,3,22)
(72,3,22)
Replacement
(35,5,25)
(35,5,45)
(75,5,25)
(75,5,35)
Normalized
WCA
1.48
1.64
1.48
1.56
Bandwidth
Constraint
No
Yes
No
Yes
wire errors can be tolerated. If all spare wires are used but more errors occur,
a request is sent to the system agent for configuring rerouting or remapping
(details beyond the scope of this paper). The local agent is coded in VHDL and
synthesized with 65nm CMOS library. Each router (with 136b unidirectional
links) is 33806um2 . Each packet has 128bits, in addition to 8 bits as spare
wires. Each local agent (including Hamming coder, decoder and control state
machine) requires 9098um2 (26.9% area overhead).
After N permanent error detections
Remap
req
synd==0
No
error
synd!=0
synd==0
wchd
ft2
synd!=0
count > thresh
synd!=0
count<thresh
synd==0
wchd
ft1
Tempft
1
synd!=0
synd==0
Tempft
2
synd==0
wchd
Switch
wire
synd!=0
count<thresh
synd!=0
count>thresh
synd!=0
After single permanent error detection
Before permanent fault detection
Fig. 3 State Transition of Local Agents for Link-Level Fault-Tolerance
5 Conclusion and Future Work
The dual-level agent-based management architecture is modular and scalable.
The system agent is exclusively dealing with major failures influencing global
performance. The fine-grained local fault detection and recovery are handled
by the distributed local agents. The architecture has been experimented with
expected performance, with moderate hardware overhead. It is an on-going
research, with future work focusing on the completion of fault detection and
recovering techniques on each agent level.
References
[Asad et al(2012)Asad, Guang, Jantsch, Paul, Hemani, and Tenhunen] Asad M, Guang L,
Jantsch A, Paul KP, Hemani A, Tenhunen H (2012) Self-adaptive noc power management
with dual-level agents - architecture and implementation. In: PECCS 2012, pp 1–9
[Yang et al(2010)Yang, Guang, Canhao, Yin, Säntti, and Plosila] Yang B, Guang L, Canhao XT, Yin AW, Säntti T, Plosila J (2010) Multi-application multi-step mapping
method for many-core network-on-chips. In: Proc. of Norchip 2010, pp 1–6
44