false read error - University of Pittsburgh

1
DSN 2016
DSN 2016
Leveraging ECC to Mitigate Read Disturbance, False Reads
Leveraging ECC to Mitigate Read Disturbance, False Reads
and Write Faults in STT-RAM
and Write Faults in STT-RAM
Mohammad Seyedzadeh, Rakan. Maddah, Alex. Jones, Rami. Melhem
Mohammad Seyedzadeh, Rakan. Maddah, Alex. Jones, Rami. Melhem
University of Pittsburgh
University of Pittsburgh
2
Executive Summary
Traditional ECC mitigates write faults and False read errors
Observation: Read disturbance errors are correlated with repeated read
operations
Our Approach: On-demand refresh policy
• Write After Error detection (WAE)
Read disturbance error is close to other bit error rates
• Write After Persistent error (WAP)
False read error rate is higher than read disturbance
• Write after error threshold (WAT)
False read error rate is dominant error
Key results: Two orders of magnitude improvement on Reliability
and Energy product for different ranges of bit error rates
3
Background
Bit Line
Bit Line (B
Bit Line (BL)
Bit Line (BL)
MTJ
Free Layer
Barrier Layer
low resistance
Word Line
0 logic
Source Line
Word Line
(WL)
Fixed Layer
Word Line
(WL)
Word Line
Gate
(WL)
Source
Source Line (SL)
Anti-parallel (AP)
Drain
NMOS
Source Line (SL)
Parallel (P)
(a)
High resistance
1 logic
STT-RAM
STT-RAMCircuit
Cell
Word Li
(WL)
Source Line (S
Parallel (P
4
Background
Write error
Read
disturbance
False read
Error
STTRAM -11 
10
Typically mitigated
using ECC
10-7
10000 Reads
Key Observation
The cumulative effect of the read disturbance, even relatively low fault probabilities
can result in a relatively high probability of failure.
5
Solution to Mitigate Read Disturbance
Writing back data after every read operation (WAR)
0
1
1
0
0
0
1
0
0
1
0
0
1
0
ReadCan we do
better?
0
0
1
ECC
0
1
0
0
Write-back
1
0
Advantage
Highest reliability as long as
write error rate is low
0
Disadvantage
High Energy Cost
6
New Solution
• Do not Write-back after every read
• First detect error and then write-back
• Mitigate Read disturbance, false read error and write error using ECC
Write After Error detection (WAE)
• Read disturbance error is close to other bit error rates.
Write After Persistent error (WAP)
• False read error rate is higher than read disturbance.
Write after error threshold (WAT)
• False read error rate is dominant error.
7
Proposed Techniques
(a)
(c)
(b)
WAE
WAP
WAT
Write-back after
error detection
Check persistent error
Leaving behind
false read errors
8
Why Markov Model?
• As Monte-Carlo simulation is only feasible for high RBER, it is inadequate for
systems with persistent errors since it requires prohibitive simulation times to
capture the effect of low RBER.
• The cumulative effect of read disturbance is captured by the different Markov
states.
Markov Chain Process
Reliability
The expected number of
transitions before absorption
Energy
The number of system
write or system read
9
Modeling Write-back after every read by Markov chain
S1: No error
S2: At most one transient error
S5: At least two errors
Write-back
S3: One persistent error
S4: At least two persistent errors
10
Markov Model for Proposed Techniques
WAE
Write-back
WAP
Write-back
Second Read
WAT
Write-back after
threshold
11
Evaluation
False bit error rate is highest bit error rate (Single MTJ)
pf > pd
Read disturb error rate is highest bit error rate (Double MTJ)
pd > pf
Write error rate is highest bit error rate
pw > pd , pf
12
ECC-1 for single MTJ
Uncorrectable bit error rate (UBER)
1.E-06
a=50% b=50%
ECC1
WAE
1.E-05
WAR
WAP
UBER
UBER
1.E-05
1.E-07
1.E-06
a=99.9% b=0.1%
ECC1
WAE
WAR
WAP
1.E-07
pd ~ pf
1.E-08
1.E-08
6.000 4.954 3.903 2.845 1.778 0.699
log(pf/pd)
Equal prob. of read and write
6.000 4.954 3.903 2.845 1.778 0.699
log(pf/pd)
One write every 1000 read
• Conclusion1: WAE, WAE and WAP achieve acceptable UBER levels.
• Conclusion2: When the user read to write ratio increases, if pd is comparable
to pf, the system reliability varies significantly.
We conduct experimental results on a ``worst-case'' ratio of
1000 user reads to each user write (a=99.9\% vs. b=0.1\%).
13
Single MTJ
ECC-1
WAE
WAR
WAP
6.000
6.000
4.954
4.954
Log (pf/pd)
Log (pf/pd)
WAR
ECC-2
3.903
2.845
1.778
(a)
WAP
WAT
Leaving behind
persistent errors
when pd increases
3.903
2.845
1.778
0.699
0.699
1.E-03
WAE
1.E-02
1.E-01
1.E+00
1.E+01
Energy Overhead
Two orders of magnitude
improvement by WAE and WAP
1.E-05
(b)
1.E-03
1.E-01
1.E+01
Energy Overhead
Three orders of magnitude
improvement by WAT
Conclusion:
WAR incurs a large overhead and other approaches dramatically reduce
this energy overhead while achieving a similar or acceptable UBER level.
14
Single MTJ: Energy UBER Product (EUP)
ECC-1
WAR
WAE
ECC-2
1.E-06
WAP
1.E-04
1.E-08
1.E-06
1.E-10
WAR
WAE
WAP
WAT
Still two orders
of magnitude
improvement by
WAR and WAE
EUP
EUP
1.E-02
1.E-12
1.E-08
1.E-14
1.E-10
6.000 4.954 3.903 2.845 1.778 0.699
6.000 4.954 3.903 2.845 1.778 0.699
(b) pf > pd Log (pf/pd) pd ~ pf
(a)
Log (pf/pd)
Two orders of magnitude
Four orders of magnitude
improvement by WAE and WAP
improvement by WAT
Double MTJ: Energy UBER Product (EUP)
WAR
WAE
1.E-12
WAP
1.E-10
1.E-14
1.E-12
1.E-16
EUP
EUP
1.E-08
1.E-10
ECC-1
1.E-14
1.E-20
1.E-18
1.E-22
pd < pf
pd > pf
WAE
WAP
WAT
3.000 1.845 0.602 -0.155-1.222-2.699
3.000 1.845 0.602 -0.155 -1.222 -2.699
Log (pf/pd)
WAR
1.E-18
1.E-16
(a)
ECC-2
(b)
Log (pf/pd)
pd < pf
pd > pf
15
Conclusion
Traditional ECC mitigates write faults and False read errors
Observation: Read disturbance errors are correlated with repeated read
operations
Our Approach: On-demand refresh policy
• Write After Error detection (WAE)
Read disturbance error is close to other bit error rates
• Write After Persistent error (WAP)
False read error rate is higher than read disturbance
• Write after error threshold (WAT)
False read error rate is dominant error
Key results: Two orders of magnitude improvement on Reliability
and Energy product for different ranges of bit error rates
16
DSN 2016
DSN 2016
Leveraging ECC to Mitigate Read Disturbance, False Reads
Leveraging ECC to Mitigate Read Disturbance, False Reads
and Write Faults in STT-RAM
and Write Faults in STT-RAM
Mohammad Seyedzadeh, Rakan. Maddah, Alex. Jones, Rami. Melhem
Mohammad Seyedzadeh, Rakan. Maddah, Alex. Jones, Rami. Melhem
University of Pittsburgh
University of Pittsburgh