Noname manuscript No. First MEDIAN (MEDIAN 2012) (will be Workshop inserted by the editor) Embedding Fault-Tolerance with Dual-Level Agents in Many-Core Systems Liang Guang · Syed M. A. H. Jafri · Bo Yang · Juha Plosila · Hannu Tenhunen the date of receipt and acceptance should be inserted later Abstract Dual-level fault-tolerance is presented on many-core systems, provided by the software-based system agent and hardware-based local agents. The system agent performs fault-triggered energy-aware remapping with bandwidth constraints, addressing coarse-grained processor failures. The local agents achieve fine-grained link-level fault tolerance against transient and permanent errors. The paper concisely presents the architecture, dual-level fault-tolerant techniques and experiment results. 1 Introduction Due to the increasing influence of PVT variations and complex application scenarios, many-core systems rely on run-time optimization to achieve the expected performance, energy-efficiency and dependability. In particular, as errors and failures may occur unpredictably at various architectural levels, fault-tolerance needs to be provided adaptively both in coarse and fine granularity. Ad-hoc manners of fault management may lead to non-compatible system architectures which are not scalable nor portable. Instead, we propose a generic system architecture with dual-level agents, supporting fault tolerance in many-core systems with diverse granularities. 2 Dual-Level Agents for Run-Time Management Services A design layer for monitoring and reconfiguration services is added upon the conventional many-core architectures (Fig. 1). The layer is composed of one Liang Guang, Bo Yang, Juha Plosila University of Turku, Finland (liagua,boyan,juplos)@utu.fi Syed M. A. H. Jafri, Hannu Tenhunen Royal Institute of Technology, Sweden (jafri,hannu)@kth.se 41 First MEDIAN Workshop (MEDIAN 2012) 2 Liang Guang et al. system agent, which is the general manager of all monitoring and reconfiguration operations, and distributed local agents, which are delegates of the system agent to actuate the operations. Such organizational partition of agents is designed to support both coarse and fine-grained services, including faulttolerance. The system agent monitors the global system status, performance and errors, and handles resource allocation. The local agent only monitors the fine-grained status of individual component, such as a router. It reconfigures the local setting, e.g. the replacement of broken wires. As monitors themselves also suffer from errors, the system agent is monitoring the error status of the local agents. The system agent is designed with higher reliability, which will be addressed in the future work. 7<#+"/(40"-+ !"#$%&'()#* 6"*"0)+" 7$8+2)&"( *+$$,$"&-$&.#.,/(%#.&0*-/#11-*1 !"#$%&'"()**$')+,$.)%*+(/)-)0"/"-+ 1$2"&(/)-)0"& 9&)::"& 3$')*( 40"-+ 3$')*( 40"-+ 5$6" 5$6" 2(*.3(*# ;)-<=>$&"( ?)'@A$-" 4::*,')+,$! ! ! ! B/)0"( 1&$'"##,-0 C,6"$( "-'$6,-0 D"-"&)*( 6,0,+)*( :&$'"##,-0 EEE ! ! ! ! ! ! ! ! Fig. 1 System and Local Agents for Scalable Monitoring and Reconfiguration [Asad et al(2012)Asad, Guang, Jantsch, Paul, Hemani, and Tenhunen] 3 Coarse and Fine-Grained Fault-Tolerance with Energy Awareness We consider a platform where multiple applications can be dynamically loaded, which is a very likely scenario for the thousand-core SoC used as the processing engine in cloud computing. As illustrated in Fig. 2, the system agent, running software instructions, determines the mapping of applications on available processors. To account for fault-tolerance, a portion of processors are left as spares. In case one processor fails, the most suitable spare is utilized by the software agent to replace the broken one. The suitability is chosen, in this case, based on the minimal potential energy consumption of inter-processor communication, given the bandwidth constraint is met (Section 4). While the system agent addresses the processor failure, the local agent, which is a hardware module attached to each router, addresses the link errors. Each local agent detects transient link errors with Hamming codes. If the error persists after repetitive checkings, a permanent fault is identified on a certain wire. The local agent then replaces the broken wire with a spare wire. 42 First MEDIAN Workshop (MEDIAN 2012) Dual-Level Fault Tolerance 3 +""'(,#-(.)&/&0+/1 +""'(,#-(.)& 2&0+21 ! +""'(,#-(.)&4&0+41 +""'(,#-(.)& 3&0+31 ! 567,&8%23&#, !"##$#%&'())$#%&(#*& +,'())$#%&$#-.+"/.$0#)*"(&4%1./&% 5 9&:(*1&8&#,% ;*88"#3% 1.4&/04&1.4&/ 5 5 !"#$%&'()* !"#$%<*+(, !.1*(% /&1.#<"3+/*,".# !"#$%!&'&(%)*+(,% -.#,/.((&/%0%!.1*(%23&#, Fig. 2 Dual-Level Fault Tolerance on Many-Core Systems 4 Initial Experiments For experimental purposes, four applications (one for image processing, one MPEG encoding, two parallel kernels) are mapped on a 10*9 many-core system. The initial mapping is done with multi-application multi-step mapping algorithms [Yang et al(2010)Yang, Guang, Canhao, Yin, Säntti, and Plosila]. 81 processor cores are used for the initial application mapping (26, 21, 18 and 16 respectively), leaving 9 spare cores. With random choice of processor errors, the fault-triggered remapping searches for the most energy-efficient replacement. Considering that the number of spares is limited, an exhaustive searching is considered a proper tradeoff between the solution’s optimality and run-time overhead. The energy consumption is represented by the weighted communication volume (WCA; mathematical modeling as V olume × HopCounts), which is proportional to the dynamic energy consumption. As the maximal channel bandwidth can not be violated after remapping, the bandwidth constraint is checked for every possible replacement. Table 1 lists four cases of replacements. WCA is reported as normalized to the WCA of the initial mapping. As shown in Table 1, the energy consumption is increased after remapping, as a processor needs to be replaced by a spare more distant from other utilized processors. When the bandwidth constraint is considered, the remapping may have to utilize a spare even further in distance to avoid communication congestion, thus with higher WCA. The local agent integrates a state machine to automate the link-level fault management (Fig. 3). If the Hamming decoder identifies transient bit errors, the local agent will repetitively check the link. If a bit error persists, the corresponding wire is considered permanently broken. Then a spare wire is used for the replacement. Based on the availability of spares, a number of 43 First MEDIAN Workshop (MEDIAN 2012) 4 Liang Guang et al. Table 1 Exemplified Results of Fault-Triggered Energy-Aware Remapping Failed Nodes (23,3,22) (23,3,22) (72,3,22) (72,3,22) Replacement (35,5,25) (35,5,45) (75,5,25) (75,5,35) Normalized WCA 1.48 1.64 1.48 1.56 Bandwidth Constraint No Yes No Yes wire errors can be tolerated. If all spare wires are used but more errors occur, a request is sent to the system agent for configuring rerouting or remapping (details beyond the scope of this paper). The local agent is coded in VHDL and synthesized with 65nm CMOS library. Each router (with 136b unidirectional links) is 33806um2 . Each packet has 128bits, in addition to 8 bits as spare wires. Each local agent (including Hamming coder, decoder and control state machine) requires 9098um2 (26.9% area overhead). After N permanent error detections Remap req synd==0 No error synd!=0 synd==0 wchd ft2 synd!=0 count > thresh synd!=0 count<thresh synd==0 wchd ft1 Tempft 1 synd!=0 synd==0 Tempft 2 synd==0 wchd Switch wire synd!=0 count<thresh synd!=0 count>thresh synd!=0 After single permanent error detection Before permanent fault detection Fig. 3 State Transition of Local Agents for Link-Level Fault-Tolerance 5 Conclusion and Future Work The dual-level agent-based management architecture is modular and scalable. The system agent is exclusively dealing with major failures influencing global performance. The fine-grained local fault detection and recovery are handled by the distributed local agents. The architecture has been experimented with expected performance, with moderate hardware overhead. It is an on-going research, with future work focusing on the completion of fault detection and recovering techniques on each agent level. References [Asad et al(2012)Asad, Guang, Jantsch, Paul, Hemani, and Tenhunen] Asad M, Guang L, Jantsch A, Paul KP, Hemani A, Tenhunen H (2012) Self-adaptive noc power management with dual-level agents - architecture and implementation. In: PECCS 2012, pp 1–9 [Yang et al(2010)Yang, Guang, Canhao, Yin, Säntti, and Plosila] Yang B, Guang L, Canhao XT, Yin AW, Säntti T, Plosila J (2010) Multi-application multi-step mapping method for many-core network-on-chips. In: Proc. of Norchip 2010, pp 1–6 44
© Copyright 2026 Paperzz