ANALYSIS OF AlUMINUM NITIRDE (AlN) AND GRADED

School of Electrical, Computer and Energy Engineering
PhD Final Oral Defense
Thesis title: Designing Low Cost Error Correction Schemes for Improving Memory
Reliability
by
Hsing-Min Chen
04/06/2017
9:30AM – 11:30AM
GWC 409
Committee:
Dr. Chaitali Chakrabarti (chair)
Dr. Carole-Jean Wu
Dr. Trevor Mudge
Dr. Umit Ogras
Abstract
Memory reliability is a major challenge in the design of large scale computing systems.
In this presentation, we present three low cost error protection schemes for 2D and 3D
DRAM systems. First, we present a low overhead solution to improving the reliability of
commodity DRAM systems with no change in the existing memory architecture.
Specifically, we propose five erasure and error correction (E-ECC) schemes that provide
at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16
(Scheme 5) DRAM systems. All schemes have superior error correction performance due
to the use of strong symbol-based codes. In addition, we make use of erasure codes to
extend the lifetime of the DRAM systems.
Second, we propose a rate-adaptive, two-tiered error correction scheme, RATT-ECC, that
provides strong reliability (1010 reduction in raw FIT rate) for an HBM-like 3D DRAM
system for CPU applications. The tier-1 code is a strong symbol-based code that can
correct errors due to small granularity faults and detect errors caused by large granularity
faults; the tier-2 code is an XOR-based code that corrects errors detected by the tier-1
code. The rate-adaptive feature of RATT-ECC enables permanent bank failures to be
handled through sparing. It can also be used to significantly reduce the refresh power
consumption without decreasing the reliability and timing performance.
Third, we propose a novel memory protection scheme, Configurable-ECC, which
provides strong reliability for different sized accesses (32B, 64B and 128B) in HBMGPU applications. The proposed scheme is also two-tiered as in RATT-ECC. However,
the tier-1 code now consists of an inner code to detect errors due to small and large
granularity faults and an outer code to correct errors due to small granularity faults. To
support different sized accesses, the tier-1 code has a modular structure. Overall
Configurable-ECC has flexible structure to support different accesses with strong
reliability without energy wastage.