School of Electrical, Computer and Energy Engineering PhD Final Oral Defense Thesis title: Designing Low Cost Error Correction Schemes for Improving Memory Reliability by Hsing-Min Chen 04/06/2017 9:30AM – 11:30AM GWC 409 Committee: Dr. Chaitali Chakrabarti (chair) Dr. Carole-Jean Wu Dr. Trevor Mudge Dr. Umit Ogras Abstract Memory reliability is a major challenge in the design of large scale computing systems. In this presentation, we present three low cost error protection schemes for 2D and 3D DRAM systems. First, we present a low overhead solution to improving the reliability of commodity DRAM systems with no change in the existing memory architecture. Specifically, we propose five erasure and error correction (E-ECC) schemes that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. In addition, we make use of erasure codes to extend the lifetime of the DRAM systems. Second, we propose a rate-adaptive, two-tiered error correction scheme, RATT-ECC, that provides strong reliability (1010 reduction in raw FIT rate) for an HBM-like 3D DRAM system for CPU applications. The tier-1 code is a strong symbol-based code that can correct errors due to small granularity faults and detect errors caused by large granularity faults; the tier-2 code is an XOR-based code that corrects errors detected by the tier-1 code. The rate-adaptive feature of RATT-ECC enables permanent bank failures to be handled through sparing. It can also be used to significantly reduce the refresh power consumption without decreasing the reliability and timing performance. Third, we propose a novel memory protection scheme, Configurable-ECC, which provides strong reliability for different sized accesses (32B, 64B and 128B) in HBMGPU applications. The proposed scheme is also two-tiered as in RATT-ECC. However, the tier-1 code now consists of an inner code to detect errors due to small and large granularity faults and an outer code to correct errors due to small granularity faults. To support different sized accesses, the tier-1 code has a modular structure. Overall Configurable-ECC has flexible structure to support different accesses with strong reliability without energy wastage.
© Copyright 2026 Paperzz