A STUDY OF THE IMPACT OF TEMPERATURE ON FPGA-BASED TMR DESIGNS Amr A. Ahmadain, Karen A. Tomko Department of Electrical and Computer Engineering and Computer Science University of Cincinnati, Cincinnati, OH 45220 [email protected], [email protected] Abstract TMR-based systems are one of the most common fault tolerance methods for missioncritical systems but at the same time it comes at the cost of increased power consumption and temperature levels. Temperature is one of the most critical factors that can potentially lead to failures in electronic systems. Steady-state temperature is the most common method of testing integrated circuits although it is not the only type of stress that electronic devices are exposed to during their operational lifetime. Temperature cycles, temperature gradients and even random changes in temperature all have the capacity to affect the reliability of integrated circuits and electronic devices. In this study, we argue that using steady-state temperature as the only stress factor rather than using more realistic temperature-lifetime stress relationships could easily lead to pessimistic results and hence to overly-conservative decisions. We relax the assumption of a constant failure rate by using an inhomogeneous Markov chain and explore the preliminary relationship between TMR-based designs, different temperature-lifetime models and the overall impact on system reliability. We will show through Markov-based modeling that using steady-state temperature as the only method of stress testing could result in reliability estimates which are up to a factor of 8 more pessimistic than the results of other types of temperature-lifetime models. 1. Introduction In current high-performance sub-90 nm technology, leakage power is rising dramatically over time with ever-shrinking feature sizes and Ahmadain 1 channel lengths [1]. It is a well established relationship that leakage power with its two components, subthreshold and gate leakage, increases exponentially with junction temperature. A recent study has shown that leakage power is electrothermally coupled with junction temperature [2]. The temperaturedriven increase in leakage power in turn, leads to an increase in the junction temperature where they both form a positive feedback loop, potentially leading to thermal runaway. FPGA-based Triple Modular Redundancy (TMR) designs have long been used for hardening a given design against Single Event Upsets (SEU) which could potentially cause a bit-flip in the contents of the configuration SRAM of an FPGA leading to serious logical errors. TMR, on the other hand, has high costs which should be considered in future designs. Since it triples all combinatorial and sequential design components, TMR results in a considerable increase in total design area. TMR also leads to a reduction in performance due to the overhead of the voting and error detection and correction circuitry. Worst of all, a recent study has demonstrated that TMR designs consumes more than triple the power consumed in simplex designs [3]. This increase in total power dissipation reflects strongly in the positive feedback loop and fires back at the system reliability leading to an exponential reduction in the Mean Time to Failure [4]. Reliability prediction methods of electronic systems such as those in [5] and [6] have traditionally considered the effect of only steady-state temperature and constant failure rates which might eventually lead to errors in reliability prediction. The goal of this study is to develop a reliability prediction model which accurately P217 MAPLD 2005 captures the evolution of the system with time and temperature at each phase of its lifetime where the assumptions of steady-state temperature and constant failure rate are relaxed. an electronic device, we notice that during the “useful” life phase of the device, constant failure rate is assumed. n t αm αv R(n) z(t) s(n) eA B KB C discrete time step continuous time step Weibull shape parameter of a TMR module Weibull shape parameter of the TMR voter reliability as a function of both time and temperature hazard function as a function of both time and temperature Temperature as a function of time Activation Energy of the Arrhenius equation Parameter of the Arrhenius relationship (eA/KB) Boltzman’s constant Parameter of the Arrhenius relationship that depends on product geometry and other factors Infant Mortality Phase Wear-out Pahse Normal (Useful) Life “Constant” Failure Rate Life Time Figure 1: The Bathtub Curve We argue that device failure rates might change even during the useful phase of its life cycle. The second motivation is driven by the fact that temperature, much like the failure rate, might also vary with time. These arguably erroneous assumptions of constant failure rate and steady-state temperature could potentially lead to an overly-conservative prediction instead of a realistic one, which would ultimately result in an unbalanced resource allocation. 5. Solution Approach 3. Related Work Recent work has been done to alleviate the cost of a full TMR design. Selective TMR is a technique that has been developed to harden a design by selectively inserting TMR in the sensitive gates as determined by signal probabilities [7]. In anther effort to mitigate the cost of excessive area increase, partial TMR has investigated the application of TMR on a sub-set of circuit components [8]. In this study, the number of persistent cross sections for a partially-triple module redundant design has been shown to quadruple as compared to an unmitigated design. 4. Study Motivations This study has two key motivations. The first is that the widely-used assumptions of constant failure rate may cause errors in reliability predictions [9]. Looking at the Bathtub curve, which depicts the typical reliability life cycle of Ahmadain Failure Rate 2. Notation 2 In this paper, we employ Markov modeling to relax the two assumptions of constant failure rate and steady-state temperature and to capture the system progress with time. We use a nonhomogenous (non-stationary) Markov Chain to model the different system states and transitions. A non-homogenous Markov Chain is a chain where the system transition probabilities are a function of time [10]. By allowing the transition probabilities to vary with time, the assumption of a constant failure rate is essentially dropped and replaced by a failure rate that can change over time. Using the Markov Chain, the TMR system reliability can then be calculated as a function of the varying failure rate. We assume that the TMR system has a typical configuration that contains three redundant modules and a majority voter as shown below in figure 2. P217 MAPLD 2005 Distribution Function (PDF) of the ArrheniusWeibull distribution is expressed as follows Module M1 Module M2 PDF f (t , s(t )) Majority Voter C.e B s (t ) t B s (t ) C e e t B s (t ) C .e 5.1 Derivation of the Reliability Model Module M3 5.1.1 Definition of System States Figure 2: The TMR Configuration Model There are three additional key assumptions of this study. The first key assumption is that the module failure rates are time and temperature dependent. The second assumption states that failure rates are stochastically independent which means that the failure of one module does not, in any way, affect the failure rate of any other module in the system. The third assumption assumes that the majority voter has a different hazard rate (zv) than that of the TMR modules and hence, a lower value of the Weibull distribution shape parameter (αv). This is a valid assumption as the majority voter is typically more reliable than each of the other redundant modules. It is important to note the TMR model implemented in this paper is a generic one and is not specific to FPGAs. It can easily be used to analyze the reliability of any custom integrated circuit-based system. We use the well-known Arrhenius equation to model the relationship between life and temperature. MTTF Ce eA K BT where the MTTF is the Mean Time To Failure of a system module or majority voter. The activation energy in the above equation is a measure of the effect that temperature has on the system dynamics. Before the Markov model is derived, system states and transitions have to be defined. The system is modeled with three distinct states. State ‘0’, the initial state, is the state where all the system modules are functional. State ‘1’ is the state where the system has one failed module whereas the other two modules are still in a functional state. State ‘2’ is the system failure state. The system fails when two or more modules have failed, sequentially or simultaneously, or when the majority voter fails. 5.1.2 Calculation of Probabilities State Transition A state transition diagram is a symbolic representation of the states, transitions and transition probabilities of a Markov Chain. In the reliability model derived in this paper, two state transition diagrams are defined: The continuous-time diagram where the time-step interval approaches zero and a discrete-time diagram where the time interval is a discrete interval of time. The reason that two state diagrams are defined is the complexity involved in calculating the n-step transition probability matrix for a continuous-time Markov Chain. By approximating the continuous-time process with its discrete-time equivalent, closed-form solutions can be derived. We follow this exact same approach in deriving the reliability model. For details on the exact approximation steps, the reader is referred to [10]. The Arrhenius-Weibull distribution is assumed to be the life distribution of the TMR system modules [11]. The Probability Ahmadain 3 P217 MAPLD 2005 1 A(t) C(t) 0 1 6. Implementation of the Reliability Model 2 B(t) A(n) C(n) 1- [A(n) + B(n)] 1.0 0 1- C(n) 1 2 B(n) Figure 3: System State Transition Diagram (a) Continuous-time Diagram; (b) Discrete-time Diagram In the above two diagrams, A(n), B(n), C(n) is a concise way of expressing the mathematically-intricate state transition probabilities. A(n), B(n), C(n) are all expressed in terms of the hazard function of the ArrheniusWeibull Distribution as given in [11]. z t , s (t ) C.e B s (t ) t B s (t ) C.e The reliability of a Non-Homogenous Discrete-Time Markov Chain (NHDTMC) as given in [12] is expressed as follows n 1 R ( n) 1 pkU 1v k 0 1 and 1v 1 Substituting the state transition probability matrices calculated in the previous step into the above equation and carrying out the necessary matrix multiplications yields the reliability of the system, R(n), at time n. Ahmadain The reliability and failure rate functions have been implemented using numerical integration techniques. The Guass-Kronrod Quadrature method has been used. The Guass-Kronrod Quadrature is an adaptive method that yields the best and most accurate function estimates [14]. Experiments have been designed based on changing the values of two sets of parameters. The first set is stress-related while the second is probability distributionrelated. For steady-state temperatures, the parameter is simply the temperature. For cyclic stress, the parameter is the cycle period and for progressive stress, it is the slope of the straight line. The values of the parameters have been chosen to span minimum, typical, and maximum operational stress levels. 1 5.1.3 Calculation of the Reliability Function where 1 1 0 The reliability model has been implemented in Mathematica 5.0. To relax the assumption of a steady-state temperature, three different types of stress loading have been implemented: steady-state, cyclic and progressive [13]. Steady-state is the most widely used type of stress loading where basically the temperature is kept constant at all times of the product’s operational lifetime. In cyclic stress loading the product undergoes a cyclic change in temperature where it is varied between very high and very low ranges. Under progressive stress, the product experiences a continuous increase in the level of temperature. 4 The shape parameter of the Weibull distribution α has also been varied to represent typical shapes of the Weibull distribution curves. The values of α have been adapted from [12]. Table 1 shows the exact values assigned to these two sets of parameters. Table 2 shows the three model constants used in the model. An activation energy of 0.7 eV is considered an industry standard and is used when a specific failure mechanism is not known. P217 MAPLD 2005 Table 1: Parameters and Mathematical Functions of Stress Tests Steady-State Stress Cyclical Stress Progressive Stress Stress Test Parameter Temperature Period Slope Parameter Values 328, 373, 423 K π, 2π, 4π 0.25, 0.5, 1.0 Mathematical Function T = constant 328×sin (kn) an + 273 Table 2: Model Constants Model Constants Activation energy (eA) 0.7 eV C 2.4x10-9 B = eA/KB 8117.82 7. Results and Discussion Results for this parametric study are shown in figures 4-10. It is fairly obvious from looking at the results shown in figure 4 that there is big variation in the reliability as a result of applying different types of temperature functions and that the steady-state temperature consistently has the lowest reliability, the proof which supports the key motivation of the study that depending solely on steady-state temperature might lead to overly-conservative decisions and hence, unbalanced resource allocation. We can also see from figures 6, 8 and 10 that the value of the shape parameter (α), has a negligible effect on the reliability and from figures 5, 7 and 9 that the stress test-related parameters have a visible impact on reliability for all types of stress tests. These observations indicate that the system reliability is more sensitive to the values of stress-related parameters such as slope or period, than it is to parameters such as the value of α, which model design variation. 8. Conclusions In this paper, a reliability model for TMRbased designs has been investigated. We have shown that the two commonly used assumptions of steady-state temperature and constant failure rate could lead to errors in reliability prediction. Ahmadain 5 Using a non-stationary Markov Chain, the assumption of a constant failure rate has been relaxed. It has also been shown that using different stress loading functions such as cyclic and progressive stress, system reliability can be up to factor of eight less than that of a steadystate temperature. We have explained how the system reliability varies visibly with the change in the controllable stress-related parameters; a fact that encourages further investigation into models which take into account realistic lifetime scenarios. References [1] N.S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. Hu, M.J. Irvin, M. Kandemir, and V. Narayanan, “Leakage Current: Moore’s Law Meets Static Power,” Computer, Vol. 36(12), Dec. 2003, pp.6875. [2] K. Banerjee, S. Lin, A. Keshavarzi, S. Narendra, and V. De, “A Self-Consistent Junction Temperature Estimation Methodology For Nanometer Scale ICs with Implications for Performance and Thermal Management”, Technical Digest of the IEEE International Electron Devices Meeting (IEDM’03), 2003, pp. 36.7.1-36.7. [3] N. Rollins, M.J. Wirthlin, P.S. Graham, “Evaluation of Power Costs in Applying TMR to FPGA Designs”, Proceedings of the 7th Annual Military and Aerospace Programmable Logic Devices International Conference (MAPLD), Sept., 2004. [4] P. Lall, M.G. Pecht and E.B. Hakim, “Influence of Temperature on Microelectronics and System Reliability”, CRC Press LLC, 1997. P217 MAPLD 2005 [6] Siemens, SN29500 Reliability and Quality Specification Failure Rates of Components, 1986. α = 1.0 1 Reliability [5] U.S. Department of Defense, Reliability Prediction of Electronic Equipment, MILHDBK 217F, Washington, D.C., 1991. T 423 K 0.8 0.6 Period 0.4 Slope 1.0 0.2 0 0 [7] U.S. Department of Defense, Reliability Prediction of Electronic Equipment, MILHDBK 217F, Washington, D.C., 1991. [12] A. Platis, N. Limnois, and M.L. Du, “Hitting Time in a Finite NonHomogeneous Markov Chain with Applications”, Journal of Applied Stochastic Models and Data Analysis, Vol. 14(3), 1998, pp. 241-253.W. [13] Nelson, “Accelerated Testing: Statistical Models, Test Plans and Data Analyses”, John Wiley & Sons, 1990. 50 0.8 T 373 K 0.6 Period 2 0.4 Slope 0.5 0.2 0 0 10 20 30 40 50 Time [hours] (b) α = 1.0 α = 2.0 Reliability 1 0.8 T 423 K 0.6 Period 0.4 Slope 1.0 0.2 0 0 10 20 30 40 50 Time [hours] (c) α = 1.4 α = 0.8 1 0.8 T 328 K 0.6 Period 4 0.4 Slope 0.25 0.2 0 0 10 20 30 40 50 Time [hours] [14] Wolfram Research, "Gauss-Kronrod Quadrature", [Online document], Available at HTTP: http://mathworld.wolfram.com/GaussKronrodQuadrature.html Ahmadain 40 α = 1.4 1 Reliability [11] ReliaSoft, “Accelerated Life Testing Reference”, [Online book], 2001, Available at HTTP: http://www.weibull.com/acceltestwebconte nts.htm 30 (a) α = 0.8 [9] A. Mettas, P. Vassiliou, “Modeling and Analysis of Time-Dependent Stress Accelerated Life Data”, Proceedings of the 2002 Annual Reliability and Maintainability Symposium (RAMS), Jan., 2002, pp. 343348. [10] D.P. Siewiorek and R.S. Swarz, “Reliable Computer Systems”, Digital Press, 1992. 20 Time [hours] Reliability [8] B. Patt, D.E. Johnson, M.J. Wirthlin, M. Caffrey, K. Morgan, and P. Graham, “Improving FPGA Design Robustness with Partial TMR”, Proceedings of the 8th Annual Military and Aerospace Programmable Logic Devices International Conference (MAPLD), Sept., 2005. To Be Published. 10 (d) α = 2.0 Figure 4: Reliability vs. Type of Stress Test 6 P217 MAPLD 2005 α = 0.8 α = 2.0 1 1 T 328 K 0.6 T 373 K 0.4 T 423 K T 328 K 0.8 Reliability Reliability 0.8 0.2 T 373 K 0.6 T 423 K 0.4 0.2 0 0 0 2 4 6 8 10 12 14 0 2 4 Time [hours] 6 8 10 12 14 Time [hours] (a) α = 0.8 (b) α = 2.0 Figure 5: Reliability vs. Steady-State Temperature; fixed α and changing temperatures Temperature = 423 K Temperature = 328 K 1 1 Reliability Reliability α 0.8 0.8 α 1.0 0.6 α 1.4 0.4 α 2.0 0.2 0.8 α 0.8 0.6 α 1.0 0.4 α 1.4 α 2.0 0.2 0 0 0 5 10 15 20 0 2 4 6 8 10 Time [hours] Time [hours] (a) Temperature = 328 K (b) Temperature = 423 K Figure 6: Reliability vs. Steady-State Temperature; fixed temperatures and changing α α = 0.8 α = 2.0 1 0.8 Period 0.6 Period 2 0.4 Period 4 Reliability Reliability 1 0.2 0.8 Period 0.6 Period 2 0.4 Period 4 0.2 0 0 0 10 20 30 40 0 Time [hours] 10 20 30 40 Time [hours] (b) α = 2.0 (a) α = 0.8 Figure 7: Reliability vs. Cyclic Stress; fixed α and changing cycles Ahmadain 7 P217 MAPLD 2005 Period = π Period = 4π 1 1 α 0.8 Reliability 0.8 α 1.0 0.6 0.4 α 1.4 0.6 α 2.0 0.4 0.2 α 0.8 0.8 α 1.0 α 1.4 2.0 0.2 0 0 0 10 20 30 40 0 Time [hours] 10 20 30 40 Time [hours] (a) Period = 4π (b) Period = π Figure 8: Reliability vs. Cyclic Stress; fixed cycles and changing α α = 0.8 α = 2.0 1 0.8 Slope 0.25 0.6 Slope 0.5 0.4 Slope 1.0 0.2 Slope 0.25 0.8 Reliability Reliability 1 0.6 Slope 0.5 0.4 Slope 1.0 0.2 0 0 0 10 20 30 40 0 10 20 30 40 Time [hours] Time [hours] (a) α = 0.8 (b) α = 2.0 50 60 Figure 9: Reliability vs. Progressive Stress; fixed α and changing slopes Slope = 1.0 1 Slope = 0.25 0.8 0.8 0.8 Reliability Reliability 1 1.0 0.6 1.4 0.4 2.0 0.2 0.8 1.0 0.6 1.4 0.4 2.0 0.2 0 0 0 10 20 30 40 50 0 60 10 20 30 40 Time hours Time [hours] (a) Slope = 0.25 (b) Slope = 0.25 Figure 10: Reliability vs. Progressive Stress; fixed slopes and changing α Ahmadain 8 P217 MAPLD 2005
© Copyright 2026 Paperzz