A Generic Component Based Expert System Shell for

A STUDY OF THE IMPACT OF TEMPERATURE ON FPGA-BASED TMR
DESIGNS
Amr A. Ahmadain, Karen A. Tomko
Department of Electrical and Computer Engineering and Computer Science
University of Cincinnati, Cincinnati, OH 45220
[email protected], [email protected]
Abstract
TMR-based systems are one of the most
common fault tolerance methods for missioncritical systems but at the same time it comes at
the cost of increased power consumption and
temperature levels. Temperature is one of the
most critical factors that can potentially lead to
failures in electronic systems. Steady-state
temperature is the most common method of
testing integrated circuits although it is not the
only type of stress that electronic devices are
exposed to during their operational lifetime.
Temperature cycles, temperature gradients and
even random changes in temperature all have the
capacity to affect the reliability of integrated
circuits and electronic devices.
In this study, we argue that using steady-state
temperature as the only stress factor rather than
using more realistic temperature-lifetime stress
relationships could easily lead to pessimistic
results and hence to overly-conservative
decisions. We relax the assumption of a constant
failure rate by using an inhomogeneous Markov
chain and explore the preliminary relationship
between
TMR-based
designs,
different
temperature-lifetime models and the overall
impact on system reliability.
We will show through Markov-based
modeling that using steady-state temperature as
the only method of stress testing could result in
reliability estimates which are up to a factor of 8
more pessimistic than the results of other types
of temperature-lifetime models.
1. Introduction
In current high-performance sub-90 nm
technology, leakage power is rising dramatically
over time with ever-shrinking feature sizes and
Ahmadain
1
channel lengths [1]. It is a well established
relationship that leakage power with its two
components, subthreshold and gate leakage,
increases
exponentially
with
junction
temperature. A recent study has shown that
leakage power is electrothermally coupled with
junction temperature [2]. The temperaturedriven increase in leakage power in turn, leads to
an increase in the junction temperature where
they both form a positive feedback loop,
potentially leading to thermal runaway.
FPGA-based Triple Modular Redundancy
(TMR) designs have long been used for
hardening a given design against Single Event
Upsets (SEU) which could potentially cause a
bit-flip in the contents of the configuration
SRAM of an FPGA leading to serious logical
errors. TMR, on the other hand, has high costs
which should be considered in future designs.
Since it triples all combinatorial and sequential
design components, TMR results in a
considerable increase in total design area. TMR
also leads to a reduction in performance due to
the overhead of the voting and error detection
and correction circuitry. Worst of all, a recent
study has demonstrated that TMR designs
consumes more than triple the power consumed
in simplex designs [3]. This increase in total
power dissipation reflects strongly in the
positive feedback loop and fires back at the
system reliability leading to an exponential
reduction in the Mean Time to Failure [4].
Reliability prediction methods of electronic
systems such as those in [5] and [6] have
traditionally considered the effect of only
steady-state temperature and constant failure
rates which might eventually lead to errors in
reliability prediction.
The goal of this study is to develop a
reliability prediction model which accurately
P217 MAPLD 2005
captures the evolution of the system with time
and temperature at each phase of its lifetime
where the assumptions of steady-state
temperature and constant failure rate are relaxed.
an electronic device, we notice that during the
“useful” life phase of the device, constant failure
rate is assumed.
n
t
αm
αv
R(n)
z(t)
s(n)
eA
B
KB
C
discrete time step
continuous time step
Weibull shape parameter of a TMR
module
Weibull shape parameter of the TMR
voter
reliability as a function of both time and
temperature
hazard function as a function of both
time and temperature
Temperature as a function of time
Activation Energy of the Arrhenius
equation
Parameter of the Arrhenius relationship
(eA/KB)
Boltzman’s constant
Parameter of the Arrhenius relationship
that depends on product geometry and
other factors
Infant Mortality
Phase
Wear-out
Pahse
Normal (Useful) Life
“Constant” Failure Rate
Life Time
Figure 1: The Bathtub Curve
We argue that device failure rates might change
even during the useful phase of its life cycle.
The second motivation is driven by the fact that
temperature, much like the failure rate, might
also vary with time.
These arguably erroneous assumptions of
constant failure rate and steady-state temperature
could potentially lead to an overly-conservative
prediction instead of a realistic one, which
would ultimately result in an unbalanced
resource allocation.
5. Solution Approach
3. Related Work
Recent work has been done to alleviate the
cost of a full TMR design. Selective TMR is a
technique that has been developed to harden a
design by selectively inserting TMR in the
sensitive gates as determined by signal
probabilities [7]. In anther effort to mitigate the
cost of excessive area increase, partial TMR has
investigated the application of TMR on a sub-set
of circuit components [8]. In this study, the
number of persistent cross sections for a
partially-triple module redundant design has
been shown to quadruple as compared to an
unmitigated design.
4. Study Motivations
This study has two key motivations. The first
is that the widely-used assumptions of constant
failure rate may cause errors in reliability
predictions [9]. Looking at the Bathtub curve,
which depicts the typical reliability life cycle of
Ahmadain
Failure Rate
2. Notation
2
In this paper, we employ Markov modeling to
relax the two assumptions of constant failure
rate and steady-state temperature and to capture
the system progress with time. We use a nonhomogenous (non-stationary) Markov Chain to
model the different system states and transitions.
A non-homogenous Markov Chain is a chain
where the system transition probabilities are a
function of time [10]. By allowing the transition
probabilities to vary with time, the assumption
of a constant failure rate is essentially dropped
and replaced by a failure rate that can change
over time. Using the Markov Chain, the TMR
system reliability can then be calculated as a
function of the varying failure rate.
We assume that the TMR system has a
typical configuration that contains three
redundant modules and a majority voter as
shown below in figure 2.
P217 MAPLD 2005
Distribution Function (PDF) of the ArrheniusWeibull distribution is expressed as follows
Module M1
Module M2
PDF  f (t , s(t )) 
Majority Voter

C.e
B
s (t )

 t

B

s (t )
 C e


e





t

B


s (t )
 C .e
5.1 Derivation of the Reliability Model
Module M3
5.1.1 Definition of System States
Figure 2: The TMR Configuration Model
There are three additional key assumptions of
this study. The first key assumption is that the
module failure rates are time and temperature
dependent. The second assumption states that
failure rates are stochastically independent
which means that the failure of one module does
not, in any way, affect the failure rate of any
other module in the system. The third
assumption assumes that the majority voter has a
different hazard rate (zv) than that of the TMR
modules and hence, a lower value of the Weibull
distribution shape parameter (αv). This is a valid
assumption as the majority voter is typically
more reliable than each of the other redundant
modules. It is important to note the TMR model
implemented in this paper is a generic one and is
not specific to FPGAs. It can easily be used to
analyze the reliability of any custom integrated
circuit-based system.
We use the well-known Arrhenius equation
to model the relationship between life and
temperature.
MTTF  Ce
eA
K BT
where the MTTF is the Mean Time To Failure of
a system module or majority voter. The
activation energy in the above equation is a
measure of the effect that temperature has on the
system dynamics.
Before the Markov model is derived, system
states and transitions have to be defined. The
system is modeled with three distinct states.
State ‘0’, the initial state, is the state where all
the system modules are functional. State ‘1’ is
the state where the system has one failed module
whereas the other two modules are still in a
functional state. State ‘2’ is the system failure
state. The system fails when two or more
modules
have
failed,
sequentially
or
simultaneously, or when the majority voter fails.
5.1.2
Calculation of
Probabilities
State
Transition
A state transition diagram is a symbolic
representation of the states, transitions and
transition probabilities of a Markov Chain. In
the reliability model derived in this paper, two
state transition diagrams are defined: The
continuous-time diagram where the time-step
interval approaches zero and a discrete-time
diagram where the time interval is a discrete
interval of time. The reason that two state
diagrams are defined is the complexity involved
in calculating the n-step transition probability
matrix for a continuous-time Markov Chain. By
approximating the continuous-time process with
its discrete-time equivalent, closed-form
solutions can be derived. We follow this exact
same approach in deriving the reliability model.
For details on the exact approximation steps, the
reader is referred to [10].
The Arrhenius-Weibull distribution is
assumed to be the life distribution of the TMR
system modules [11]. The Probability
Ahmadain
3
P217 MAPLD 2005
 1






A(t)
C(t)
0
1
6. Implementation of the
Reliability Model
2
B(t)
A(n)
C(n)
1- [A(n) + B(n)]
1.0
0
1- C(n)
1
2
B(n)
Figure 3: System State Transition Diagram
(a) Continuous-time Diagram; (b) Discrete-time Diagram
In the above two diagrams, A(n), B(n), C(n)
is a concise way of expressing the
mathematically-intricate
state
transition
probabilities. A(n), B(n), C(n) are all expressed
in terms of the hazard function of the ArrheniusWeibull Distribution as given in [11].

z t , s (t )  
C.e
B
s (t )

 t

B

s (t )
 C.e
The reliability of a Non-Homogenous
Discrete-Time Markov Chain (NHDTMC)
as given in [12] is expressed as follows
n 1
R ( n)  1  pkU 1v
k 0
1
and 1v   
1
Substituting the state transition probability
matrices calculated in the previous step into the
above equation and carrying out the necessary
matrix multiplications yields the reliability of
the system, R(n), at time n.
Ahmadain
The reliability and failure rate functions
have been implemented using numerical
integration techniques. The Guass-Kronrod
Quadrature method has been used. The
Guass-Kronrod Quadrature is an adaptive
method that yields the best and most
accurate function estimates [14].
Experiments have been designed based on
changing the values of two sets of
parameters. The first set is stress-related
while the second is probability distributionrelated. For steady-state temperatures, the
parameter is simply the temperature. For
cyclic stress, the parameter is the cycle
period and for progressive stress, it is the
slope of the straight line. The values of the
parameters have been chosen to span
minimum,
typical,
and
maximum
operational stress levels.
 1





5.1.3 Calculation of the Reliability Function
where 1  1 0
The reliability model has been
implemented in Mathematica 5.0. To relax
the assumption of a steady-state temperature,
three different types of stress loading have
been implemented: steady-state, cyclic and
progressive [13]. Steady-state is the most
widely used type of stress loading where
basically the temperature is kept constant at
all times of the product’s operational
lifetime. In cyclic stress loading the product
undergoes a cyclic change in temperature
where it is varied between very high and
very low ranges. Under progressive stress,
the product experiences a continuous
increase in the level of temperature.
4
The shape parameter of the Weibull
distribution α has also been varied to
represent typical shapes of the Weibull
distribution curves. The values of α have
been adapted from [12]. Table 1 shows the
exact values assigned to these two sets of
parameters. Table 2 shows the three model
constants used in the model. An activation
energy of 0.7 eV is considered an industry
standard and is used when a specific failure
mechanism is not known.
P217 MAPLD 2005
Table 1: Parameters and Mathematical Functions of Stress Tests
Steady-State Stress
Cyclical Stress
Progressive Stress
Stress Test
Parameter
Temperature
Period
Slope
Parameter
Values
328, 373, 423 K
π, 2π, 4π
0.25, 0.5, 1.0
Mathematical Function
T = constant
328×sin (kn)
an + 273
Table 2: Model Constants
Model Constants
Activation energy (eA)
0.7 eV
C
2.4x10-9
B = eA/KB
8117.82
7. Results and Discussion
Results for this parametric study are shown in
figures 4-10. It is fairly obvious from looking at
the results shown in figure 4 that there is big
variation in the reliability as a result of applying
different types of temperature functions and that
the steady-state temperature consistently has the
lowest reliability, the proof which supports the
key motivation of the study that depending
solely on steady-state temperature might lead to
overly-conservative decisions and hence,
unbalanced resource allocation.
We can also see from figures 6, 8 and 10 that
the value of the shape parameter (α), has a
negligible effect on the reliability and from
figures 5, 7 and 9 that the stress test-related
parameters have a visible impact on reliability
for all types of stress tests. These observations
indicate that the system reliability is more
sensitive to the values of stress-related
parameters such as slope or period, than it is to
parameters such as the value of α, which model
design variation.
8. Conclusions
In this paper, a reliability model for TMRbased designs has been investigated. We have
shown that the two commonly used assumptions
of steady-state temperature and constant failure
rate could lead to errors in reliability prediction.
Ahmadain
5
Using a non-stationary Markov Chain, the
assumption of a constant failure rate has been
relaxed. It has also been shown that using
different stress loading functions such as cyclic
and progressive stress, system reliability can be
up to factor of eight less than that of a steadystate temperature. We have explained how the
system reliability varies visibly with the change
in the controllable stress-related parameters; a
fact that encourages further investigation into
models which take into account realistic lifetime
scenarios.
References
[1] N.S. Kim, T. Austin, D. Blaauw, T. Mudge,
K. Flautner, J. Hu, M.J. Irvin, M. Kandemir,
and V. Narayanan, “Leakage Current:
Moore’s Law Meets Static Power,”
Computer, Vol. 36(12), Dec. 2003, pp.6875.
[2] K. Banerjee, S. Lin, A. Keshavarzi, S.
Narendra, and V. De, “A Self-Consistent
Junction
Temperature
Estimation
Methodology For Nanometer Scale ICs with
Implications for Performance and Thermal
Management”, Technical Digest of the IEEE
International Electron Devices Meeting
(IEDM’03), 2003, pp. 36.7.1-36.7.
[3] N. Rollins, M.J. Wirthlin, P.S. Graham,
“Evaluation of Power Costs in Applying
TMR to FPGA Designs”, Proceedings of the
7th Annual Military and Aerospace
Programmable Logic Devices International
Conference (MAPLD), Sept., 2004.
[4] P. Lall, M.G. Pecht and E.B. Hakim,
“Influence
of
Temperature
on
Microelectronics and System Reliability”,
CRC Press LLC, 1997.
P217 MAPLD 2005
[6] Siemens, SN29500 Reliability and Quality
Specification Failure Rates of Components,
1986.
α = 1.0
1
Reliability
[5] U.S. Department of Defense, Reliability
Prediction of Electronic Equipment, MILHDBK 217F, Washington, D.C., 1991.
T  423 K
0.8
0.6
Period  
0.4
Slope  1.0
0.2
0
0
[7] U.S. Department of Defense, Reliability
Prediction of Electronic Equipment, MILHDBK 217F, Washington, D.C., 1991.
[12] A. Platis, N. Limnois, and M.L. Du,
“Hitting Time in a Finite NonHomogeneous Markov Chain with
Applications”,
Journal
of
Applied
Stochastic Models and Data Analysis, Vol.
14(3), 1998, pp. 241-253.W.
[13] Nelson, “Accelerated Testing: Statistical
Models, Test Plans and Data Analyses”,
John Wiley & Sons, 1990.
50
0.8
T  373 K
0.6
Period  2
0.4
Slope  0.5
0.2
0
0
10
20
30
40
50
Time [hours]
(b) α = 1.0
α = 2.0
Reliability
1
0.8
T  423 K
0.6
Period  
0.4
Slope  1.0
0.2
0
0
10
20
30
40
50
Time [hours]
(c) α = 1.4
α = 0.8
1
0.8
T  328 K
0.6
Period  4
0.4
Slope  0.25
0.2
0
0
10
20
30
40
50
Time [hours]
[14] Wolfram Research, "Gauss-Kronrod
Quadrature", [Online document], Available
at HTTP:
http://mathworld.wolfram.com/GaussKronrodQuadrature.html
Ahmadain
40
α = 1.4
1
Reliability
[11] ReliaSoft, “Accelerated Life Testing
Reference”, [Online book], 2001, Available
at HTTP:
http://www.weibull.com/acceltestwebconte
nts.htm
30
(a) α = 0.8
[9] A. Mettas, P. Vassiliou, “Modeling and
Analysis of Time-Dependent
Stress
Accelerated Life Data”, Proceedings of the
2002 Annual Reliability and Maintainability
Symposium (RAMS), Jan., 2002, pp. 343348.
[10] D.P. Siewiorek and R.S. Swarz, “Reliable
Computer Systems”, Digital Press, 1992.
20
Time [hours]
Reliability
[8] B. Patt, D.E. Johnson, M.J. Wirthlin, M.
Caffrey, K. Morgan, and P. Graham,
“Improving FPGA Design Robustness with
Partial TMR”, Proceedings of the 8th
Annual
Military
and
Aerospace
Programmable Logic Devices International
Conference (MAPLD), Sept., 2005. To Be
Published.
10
(d) α = 2.0
Figure 4: Reliability vs. Type of Stress Test
6
P217 MAPLD 2005
α = 0.8
α = 2.0
1
1
T  328 K
0.6
T  373 K
0.4
T  423 K
T  328 K
0.8
Reliability
Reliability
0.8
0.2
T  373 K
0.6
T  423 K
0.4
0.2
0
0
0
2
4
6
8
10
12
14
0
2
4
Time [hours]
6
8
10
12
14
Time [hours]
(a) α = 0.8
(b) α = 2.0
Figure 5: Reliability vs. Steady-State Temperature; fixed α and changing temperatures
Temperature = 423 K
Temperature = 328 K
1
1
Reliability
Reliability
α  0.8
0.8
α  1.0
0.6
α  1.4
0.4
α  2.0
0.2
0.8
α  0.8
0.6
α  1.0
0.4
α  1.4
α  2.0
0.2
0
0
0
5
10
15
20
0
2
4
6
8
10
Time [hours]
Time [hours]
(a) Temperature = 328 K
(b) Temperature = 423 K
Figure 6: Reliability vs. Steady-State Temperature; fixed temperatures and changing α
α = 0.8
α = 2.0
1
0.8
Period  
0.6
Period  2
0.4
Period  4
Reliability
Reliability
1
0.2
0.8
Period  
0.6
Period  2
0.4
Period  4
0.2
0
0
0
10
20
30
40
0
Time [hours]
10
20
30
40
Time [hours]
(b) α = 2.0
(a) α = 0.8
Figure 7: Reliability vs. Cyclic Stress; fixed α and changing cycles
Ahmadain
7
P217 MAPLD 2005
Period = π
Period = 4π
1
1
α  0.8
Reliability
0.8
α  1.0
0.6
0.4
α  1.4
0.6
α  2.0
0.4
0.2
α  0.8
0.8
α  1.0
α  1.4
  2.0
0.2
0
0
0
10
20
30
40
0
Time [hours]
10
20
30
40
Time [hours]
(a) Period = 4π
(b) Period = π
Figure 8: Reliability vs. Cyclic Stress; fixed cycles and changing α
α = 0.8
α = 2.0
1
0.8
Slope  0.25
0.6
Slope  0.5
0.4
Slope  1.0
0.2
Slope  0.25
0.8
Reliability
Reliability
1
0.6
Slope  0.5
0.4
Slope  1.0
0.2
0
0
0
10
20
30
40
0
10
20
30
40
Time [hours]
Time [hours]
(a) α = 0.8
(b) α = 2.0
50
60
Figure 9: Reliability vs. Progressive Stress; fixed α and changing slopes
Slope = 1.0
1
Slope = 0.25
  0.8
  0.8
0.8
Reliability
Reliability
1
  1.0
0.6
  1.4
0.4
  2.0
0.2
0.8
  1.0
0.6
  1.4
0.4
  2.0
0.2
0
0
0
10
20
30
40
50
0
60
10
20
30
40
Time hours
Time [hours]
(a) Slope = 0.25
(b) Slope = 0.25
Figure 10: Reliability vs. Progressive Stress; fixed slopes and changing α
Ahmadain
8
P217 MAPLD 2005