Formal Diagnosis of Hardware Transient Errors in

FORMAL DIAGNOSIS OF HARDWARE
TRANSIENT ERRORS IN PROGRAMS
Layali Rashid, Karthik Pattabiraman and
Sathish Gopalakrishnan
THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT
THE UNIVERSITY OF BRITISH COLUMBIA
Contributions
• Software-driven diagnosis of hardware
transient errors
– Diagnosis: “isolate the first affected
instruction”
• Program-level analysis
– Guarantees on the diagnosis
• Completeness
• Accuracy
THE UNIVERSITY OF BRITISH COLUMBIA
2
Why Software-Driven Diagnosis?
• No expensive hardware modifications.
• Minimal software instrumentation.
• Diagnose faults which manifest at the
program-level only.
• Direct access to the affected device is not
required.
THE UNIVERSITY OF BRITISH COLUMBIA
3
Diagnosis Approach
Dump File
(e.g. failing detector, register file)
Error Diagnosis
Detector Triggered
Transient Error
Faulty inst
THE UNIVERSITY OF BRITISH COLUMBIA
4
Diagnosis Approach
Dump File
(e.g. failing detector, register file)
Model Checking
Detector Triggered
Transient Error
Faulty inst
THE UNIVERSITY OF BRITISH COLUMBIA
5
Model Checking Using SymPLFIED
• Formal model for analyzing programs[DSN’08]
– Evaluate the effect of transient hardware errors on
programs.
• Symbolic error propagation technique
– Represent errors using a single symbol (err) to
avoid state space explosion.
THE UNIVERSITY OF BRITISH COLUMBIA
6
Example: Factorial Program
1
movi $2, #1
2
read $1
3
mov $3, $1
4
movi $4, #1
5 loop: setgt $5, $3, $4
6
beq $5, #0, exit
7
mult $2, $2, $3
8
subi $3, $3, #1
9
assert($3 < $1 + 1)
10
beq $0, #0, loop
11 exit: prints "Factorial = "
12
print $2
Result variable
User input
Loops while $3 < $4
Error detector
THE UNIVERSITY OF BRITISH COLUMBIA
7
Example: Error Propagation
1
movi $2, #1
2
read $1
3
mov $3, $1
4
movi $4, #1
5 loop: setgt $5, $3, $4
6
beq $5, #0, exit
7
mult $2, $2, $3
8
subi $3, $3, #1
9
assert($3 < $1 + 1)
10
beq $0, #0, loop
11 exit: prints "Factorial = "
12
print $2
$1 = 5
A transient fault, $3 = 13
Detector is triggered
THE UNIVERSITY OF BRITISH COLUMBIA
8
Example: Error Propagation
1
movi $2, #1
$1 = 5
2
read $1
3
mov $3, $1
A transient fault, $3 = 13
4
movi $4, #1
Dump file:
5 loop: setgt $5, $3, $4 Detector triggered
6
beq $5, #0, exit $1 = 5
7
mult $2, $2, $3 $2 = 13
8
subi $3, $3, #1 $3 = 12
$4 = 1
9
assert($3 < $1 + 1)
Detector is triggered
10
beq $0, #0, loop$5 = 1
11 exit: prints "Factorial = "
12
print $2
THE UNIVERSITY OF BRITISH COLUMBIA
9
Example: Error Diagnosis
1
movi $2, #1
2
read $1
3
mov $3, $1
4
movi $4, #1
5 loop: setgt $5, $3, $4
6
beq $5, #0, exit
7
mult $2, $2, $3
8
subi $3, $3, #1
9
assert($3 < $1 + 1)
10
beq $0, #0, loop
11 exit: prints "Factorial = "
12
print $2
A transient fault, $3 = err
True
Exit
False
Line 7
$2 = err
True
Line 10
False
Detector triggered
THE UNIVERSITY OF BRITISH COLUMBIA
10
Example: Error Diagnosis
1
movi $2, #1
2
read $1
3
mov
$3, $1
A transient fault, $3 = err
SymPLFIED’s
Solution
Dump
file:
4
movi
$4,
#1
True
Exit
Instruction 3 Injected
Detector triggered
5 loop: setgt
$5,
$3,
$4
Detector triggered
$1 = 5 Line 7
False
6
beq
$1 $5,
= 5 #0, exit
$2 = 13
$2 =$2,
err$2, $3
7
mult
$2 = err$3 = 12
$3 $3,
= err
8
subi
$3, #1
$4
= 1 Line 10
True
$4 = 1 < $1 + 1)
9
assert($3
$5 = 1
$5 = 1
False Detector triggered
10
beq $0, #0, loop
11 exit: prints "Factorial = "
12
print $2
THE UNIVERSITY OF BRITISH COLUMBIA
11
Example: Error Diagnosis
1
movi $2, #1
2
read $1
3
mov
$3, $1
A transient fault, $3 = err
SymPLFIED’s
Solution
Dump
file:
4
movi
$4,
#1
True
Exit
Instruction 3 Injected
Detector triggered
5 loop: setgt
$5,
$3,
$4
triggered
TheDetector
crash dump
file can be
identify
$1used
= 5 to
False
Line
7
6
beq
$1 $5,
= 5 #0, exit
$2 = 13
the
faulty
instruction.
$2 =$2,
err$2, $3
7
mult
$2 = err$3 = 12
$3 $3,
= err
8
subi
$3, #1
$4
= 1 Line 10
True
$4 = 1 < $1 + 1)
9
assert($3
$5 = 1
$5 = 1
False Detector triggered
10
beq $0, #0, loop
11 exit: prints "Factorial = "
12
print $2
THE UNIVERSITY OF BRITISH COLUMBIA
12
Experimental Methodology
• Enhance SymPLFIED to diagnose errors.
• Modify SimpleScalar simulator to inject faults.
• Evaluate for Matrix Multiply and Insertion Sort.
Instructions that
trigger a detector
More
inst?
Y
Inject at a random bit
in SimpleScalar
Detector
triggered?
N
Y
Done
Error diagnosis
THE UNIVERSITY OF BRITISH COLUMBIA
Create a dump
file
13
Results for Matrix Multiply
Number of detectors
1
4
6
Number of faults injected in SS
167
275
286
Number of faults detected in SS
74
135
150
Diagnosed faults (%)
100
77
80
Undiagnosed fault (%)
0
23
20
THE UNIVERSITY OF BRITISH COLUMBIA
14
Results for Matrix Multiply (1)
Number of detectors
1
4
6
Number of faults injected in SS
167
275
286
Number of faults detected in SS
74
135
150
Diagnosed faults (%)
100
77
80
Undiagnosed fault (%)
0
23
20
• The proposed technique diagnoses 77%-100% of the
detected errors for the matrix multiply program.
• The undiagnosed errors are implementation artifacts
of the SymPLFIED tool.
THE UNIVERSITY OF BRITISH COLUMBIA
15
Results for Matrix Multiply (2)
Number of detectors
1
4
6
Number of faults injected in SS
167
275
286
Number of faults detected in SS
74
135
150
Diagnosed faults (%)
100
77
80
Undiagnosed fault (%)
0
23
20
• The number of faults injected in SimpleScalar is
proportional to the number of detectors.
• Adding more detectors increases the diagnosis
accuracy.
THE UNIVERSITY OF BRITISH COLUMBIA
16
Conclusions and Future Work
• Software diagnosis of hardware faults is possible
and can be automated using formal techniques.
– Our diagnosis method is able to diagnose significant
number of errors using a few detectors.
• Future Work
– Investigate improvements with limited hardware
support.
– Improve scalability using heuristics.
– Extend to intermittent & permanent faults.
THE UNIVERSITY OF BRITISH COLUMBIA
17
Backup Slides
THE UNIVERSITY OF BRITISH COLUMBIA
18
Related Work
Hardware Fault Diagnosis
Hardware- Based
Techniques
Probabilistic
Techniques
Formal Methods
THE UNIVERSITY OF BRITISH COLUMBIA
Periodic-Testing
Techniques
19
Results for Insertion Sort
Number of detectors
1
4
7
Number of faults injected in SS
11
165
198
Number of faults detected in SS
8
64
83
Diagnosed faults (%)
100
87
89
Undiagnosed fault (%)
0
13
11
THE UNIVERSITY OF BRITISH COLUMBIA
20